After looking through all the examples I felt a bit stumped as most of them seemed like they were already finished products and repurposing them in terms of code seemed beyond my current skills.
What appealed to me was using the Ai engines as an input device to make it learn different hand gestures, the idea was to make the sign alphabet shown below.
Copyright: Cavallini & Co. Sign Language Chart Poster
The KNNClissification example had a rock paper scissors example which was doing hand recognition and was the perfect start.
It took some tinkering with the source code to figure out what it was doing and how the html was rendering things from the .js file. This took a fair amount of time as most of the code was very alien looking and some very concise shortcut methods were used to optimize the code.
I had to stop and go back to Nick’s video to understand what tensorFlow and Ml5 were doing and how this all fits together as my code kept breaking and only the initial 3 values kept showing. A little mind mapping of the tensorflow and ml5 ecosystem cleared the fog and I could finally get the engine working with 5 values, it was understanding how the two things connected that helped me see what I was trying to do with the code, as most of it was copy-pasting code snippets.
The Idea for using ml5 and p5:
I had now the framework for the I wanted to use, I did not do the traditional chart as my idea was that I could use hand gestures instead. Like Stop, attention, Thumbs Up etc, this would allow me to make a game of it, where two players could train their own sets and ping-pong signs before the time ran out the winner would be the best at getting predictions precise and also switching back and forth in time before the timer ran out.
This was the charts I used to get the icons from
I mapped these in the window:
The next part was training the different symbols to recognize the hand gestures, this was fairly straight forward but I could not get the Ai to load the data set I had saved, I tried replacing the permissions but it did not work when I clicked load, I kept having to retrain the data sets to get them to function.
The next thing I want to try is getting p5 to trigger an animation based on the gesture that was 100% I did this with a very simple example to a rotating cube which would turn based on the no that was 100%. This was testing the concept but the ultimate goal was to make a meter which would be like a gauge for where the most accurate recognition was.
This was a proof of concept where the circle would be then replaced with a meter. I tried making this purely in p5 but it was way too much math for such a simple shape, so I made an image in illustrator and imported it.
The final result does not have a working meter or a game but they are both edging towards that final outcome the where you would the final game where two users return the gesture they are given and then trigger a new one if they are unable to return the gesture in time they lose a point. When you return the gesture the timer gets faster on the next return, till its too hard to compete and one person loses a point. The game ends at 10 points.
The concept for the game:
The above is mockUp of what I would like to create with the AI engine, I still have a long way to go to realize this but I have some of the pieces in place. It would use ml5->p5->PubNub->p5
PubNub & PoseNose GitHub Working App Link (If by yourself for testing, you can use your phone as a second webcam)
Objective & Concept
Through PoseNet I wanted to explore networking with computer vision to visualize multiple people watching a screen together. This project tracks the position of the nose on every unique browser page and shares the data amongst the connected users. The tracking allows for users to be aware of the physical position of where the other users are. This creates a spatially aware sensation by either encouraging others to follow other “noses” or for users to move away and create their own space on the browser page.
I followed along on Daniel Shiffman’s Coding Train tutorial where he explores the concepts of what is PoseNet, what is the data that is given, and how can you visualize the data. In his example, he visualizes a nose through an ellipse that follows the nose along with the user on the screen.
The most interesting (and surprisingly simple) part of PoseNet is that it simply changes your body into what could be perceived as “cursors” on the screen (x & y coordinates).
There are a few different examples of how to expand on this data with P5, such as creating a light nose. These examples with PoseNet are interesting because it uses the body as a controller. I had previously explored PoseNet in my Body Centric class in this blog post here where I expanded upon this concept.
For this project, to emphasize the ubiquity of the webcam, I wanted to explore what would it look like to have multiple people be visualized on the screen together.
Starting with Shiffman’s code for tracking the nose, I used this to create the ellipse that follows the nose along with the user on the open browser.
I then found the x and y coordinates of the nose and connected my page to PubNub to publish and subscribe to the position of the ellipse.
I followed along in a previous example from my Creation and Computation class in the fall that tracks how many users are on a webpage using PubNub. In this example, every user loaded on the page sends the x and y coordinates of their cursor on the webpage when they click. The code then takes the average of all the user coordinates and draws lines to the average point of all the cursors.
I connected my code of the nose to PubNub and sent the coordinates of my nose. I chose a nose because it is the center of the face, and most accurately depicts where the user is in relation to their screen. Upon receiving the subscribed data I would check to see if there was a “new nose”. If there was a new nose, that user would be added into the active user array of “All Noses”. Upon every time a message from PubNub was received I would check to see if their ID was in the array and if so then I would update the coordinates of where they are on the screen.
Two noses chasing each other on the page.
The code then loops through the array and draws an ellipse with the received/sent coordinates of the user’s noses. When the user leaves, the ellipse stays there which shows a trace of all the users that have been active on the page.
Along with sending the x & y coordinates, I also sent along to PubNub the RGB values of the user’s nose ellipse. This was to differentiate the different user’s on the page and also allow the user’s to uniquely identify themselves on other’s browsers.
Video documentation of moving around another nose.
The interaction of the two noses was interesting because it prompted either an aversion of the different noses overlapping or an urge to make the dots touch. The action of moving your nose along the screen was not as direct as it was perceived. The motion was laggy, which prompted by jerky motions from the users.
This experiment was an initial exploration into mapping multiple physical spaces together into a virtual space. In further iterations, I would make an action occur if the different positions of the noses overlapped on the browser page. I chose not to this time because I did not want to make anything that could be interpreted as a game mechanic. I wanted to see what the actual reaction would be amongst a few users. In another iteration, I would include other parts of the body to track, such as eyes, or wrist. The single tracking of the nose was effective for tracking the facial position of the user which is the most common angle seen from sitting down at a webcam (while working at a computer).
Overall I am interested in exploring networking computer vision technologies further in a pursuit to examine the multiplicity of spaces and existences we inhabit simultaneously.
The term ML5 and TensorFlow and PoseNet has been loosely tossed around in the DF studio nearing the end of this semester from my peers in Body-Centric Tech and perhaps I should be feeling lucky to finally dive into it, regardless of the stress from living deadline to deadline 🙂 But, snipes aside, this was a whole other realm of knowledge that was extremely new to me, and I decided that by getting a better sense of what ML5, etc. is, that I would find its true appeal. Below are key bullet points I made for myself:
ML5 – Machine Learning 5, in which it is symbiotic to P5.
ML5 doesn’t really need p5… but it makes things easier for you, in order for you to use Tensorflow, the mother to Tensorflow.js.
ML5 deals with a lot of pre-trained models.
ML5 library doesn’t know a lot about Humans.
After looking at all the juicy examples, I wanted to explore the capabilities of the Image Classification with MobileNet, since it is “trained to recognize the content of certain images.” I wondered to what extent is MobileNet been trained to, and how? In what context? How have its creators trained it, and Who were they? What kind of MobileNet am I tampering with here? Has it perhaps been… westernized? Or just based on completely obscure datasets?
With my burning questions, I played around with the ml5 index page’s example with the drag&drop/upload an image. Below are some of my discoveries when it came to images available on my desktop.
Male Japanese Celebrity is a diaper/nappy, A Sim from the Sims 4 is a bathing cap apparently, and the Poliwag from Pokémon is a CD Player. Great! A lot of the ML confidence decreased, when it came to anything with human-like features. I continued to upload a lot of other screenshots I’ve taken from animated/drawn media and it came to an interesting result:
Game Fanart, Spongebob and A cover from the Case Closed manga were ALL considered Comic Book with varying degrees of confidence. I started to realize how even images that are completely different can all be classified under one type of image. From there, I decided to upload pictures of myself (Yes, I volunteer! As Tribute!) because I was entirely curious as to how MobileNet would classify by appearance.
Surprisingly, I am Abaya. The funny thing about MobileNet classifying this as Abaya is how completely specific this word is to the middle-eastern context, where culturally, women of the Arabian Gulf would wear a black dress/cloak called an Abaya in their daily lives. I would think that only the first image could be classified Abaya due to the black material I am wearing on my head, but then it did it for my lighter colored one. Regardless of its confidence in recognizing me, there are varying percentages of how sure it defines the head scarf as Abaya, instead of say, Hijab. I wanted to see if this was the same for random images on the internet, and it turned out to be true.
Random lady with headscarf is more of an Abaya than the model with the Nike Sports Hijab, but both still turned out as Abaya. How Fascinating! I decided to take this further by setting up a webpage to see the other probabilities it would consider images of majority “Abaya” would be.
Following Daniel Shiffman’s examples, I checked various images and their probability percentages on a webpage I created called MobileNet vs Me.
Based on the Array results in the console, the results OTHER than abaya were significantly lower. Abaya was 66% while Cloak was at 29% and Sweatshirt was at 5%. I noticed that Cloak would be the next way in which I would be identified and then Sweatshirt. I tried this with a few other images and the results are below:
With different images, the third identifier in the array was Ski Mask and Bath Towel, which made Sweatshirt not a common third one. But yet, when I loaded the random images below:
Sweatshirt came out as a second result. So we have Abaya, Cloak and Sweatshirt to identify a woman wearing a headscarf, which is very interesting to me since none of the results were specific words such as the Hijab or Headscarf. MobileNet didn’t identify them as they were known as, which is to be worn on the head for religious reasons. Perhaps a good take away from this is that the machine knows no other concept than what objects/animals/types of things that the image might be similar to, which generalizes? the experience more than it specifies, regardless of how weirdly specific the identifier results were. In itself, perhaps the MobileNet dataset is Paradoxical in nature.
To step it up a notch, I decided to turn on my webcam to see if MobileNet can identify my appearance in real time, and how different the results would be as opposed to Abaya (or Cloak or Sweatshirt.)
It was interesting to investigate the capabilities of MobileNet in this manner to see if ML would elicit certain biases, of either political or sociological. What turned out was that it is quite innocent (for now) and that I may well be identified as an Abaya, a Bonnet or even a Bandaid or a Mosquito Net. These terms could even be what one would prefer to be identified with, rather than how mass media would choose to use certain terms to describe a woman with a headscarf (or anyone as near to how news outlets would describe someone who is of Islamic faith.)
To conclude, the varying probabilities that MobileNet would give us in real time could very much be a reflection to how different even people within faith can be, from the extremes to people just simply living their lives, and should not be placed into just one universal definition.
Computer visions seems very interesting to me. I checked the 4 videos posted on canvas and started this assignment by going through all the examples available on GitHub and the ML5 website. I just want to use some tensorflow.js model to implement some functions in the web. Then I don’t need to learn tensorflow. js systematically, I found I just need to use a ready-made model packaged as an NPM package. Such as MobileNet (image classification), coco-ssd (object detection),PoseNet (human gesture recognition), roll commands (voice recognition). The NPM pages for these models have detailed code examples that you can copy. There are also a number of third-party developments of off-the-shelf model packages, such as ML5, which includes pix2pix, SketchRNN and other fun models. We were asked to build upon one of the existing ML5 examples by changing the graphic or hardware input or output. I found StyleTransfer example is quite interesting, so I decided to work on that example. I already had the chance to explore PoseNet in the Body-Centric course. For this project, I decided to explore something different. I still used the webcams and picture, I decided to experiment with the style transfer example in the ML5.
It is an expansion on the style transfer example in ml5, where users select the paintings of their favorite artists as the materials within a limited range of choices, and the selected paintings will change the style of real-time images, thinking of a unique abstract painting video.
Firstly, the position detector detects the movement of the object. When the object moves to the visual center of the camera system, the detector immediately sends a signal to the image acquisition part to trigger the pulse.
Then, according to a predetermined program and delay, the image acquisition section sends pulses to the camera and lighting system, and both the camera machine and the light source are turned on.
The camera then starts a new scan. The camera opens the exposure mechanism before starting a new frame scan, and the exposure time can be pre-set. Turn on the lighting source at the same time. The lighting time should match the exposure time of the camera.
At this point, the screen scanning and output officially began. The image acquisition part obtains digital image or video through A/D mode conversion. At the same time, the obtained digital image/video is stored in the memory of the processor or computer, and then the processor processes, analyzes and recognizes the image.
Step1: allow styletransfer to use camera
Step 2: select the art work
（Start and stop the transfer process）
Step 3: check new synthesized video
In the example, there is only one painting, and the composition of video is too abstract. I wanted to give users more choices, so I tried several other paintings to see if they have different effects.
Now I have a big problem, that is, there is no big difference between the color and composition of video between the chrysanthemum painting and the abstract painting. I don’t know what the problem is. Then, I tried an abstract painting in blue.
The difference is so small that I don’t know what to do with the picture. The naked eye can only see very slight differences. But I’m going to make the framework so users can choose different images.
Users select the paintings of the artists they like as the materials, and the selected paintings will change the style of real-time images, as a unique abstract painting video.
For this workshop I chose to continue my explorations of PoseNet, which allows for real-time human pose estimation in the browser, having started learning the framework in the Body-Centric Technologies class. I wanted to try working with images and PoseNet but I couldn’t get the PoseNet model to track whenever I used images…my workaround for this was to group the images into a video. My idea was to explore how women posed on magazine covers by comparing poses from different fashion magazine covers. Below is a video showing the final results of poses that were captured when working with the PoseNet with webcam example.
I found that the body-tracking worked well only if I had the size of the video set to width (640 pixels) and height(480 pixels) which were the dimensions used in the ml5 examples.
What is ml5.js?A wrapper for tensorflow.js that makes machine learning more approachable for creative coders and artists. It is built on top of tensorflow.js, accessed in the browser, and requires no dependencies installed apart from regular p5.js libraries.
NOTE: To use ml5.js you need to be running a local server. If you don’t have a localhost setup you can test your code in the p5.js web browser – you’ll need to create an account.
I also found that the multi-pose tracking seemed to tap off at 3 poses max tracked whenever there were more than 3 poses. Additionally, the model’s skin color affected the tracking so that at times some body parts were not tracked. I also found that the model’s clothes also affected whether some parts were tracked or not. At times the models limbs were ignored or the clothes were tracked as additional limbs. The keypoints seemed to be detected all the time but the lines for the skeleton were not always completed. What are keypoints? These are 17 datapoints that PoseNet returns and they reference different locations in the body/skeleton of a pose. They are returned in an array where the indices 0 to 16 reference a particular part of the body e.g in the array index 0 contains results about the nose such as x,y co-ordinates and percentage of detection accuracy.
Below are some of the images I tested with:
I’d like to continue working on this however I would like to explore using OpenPose which is a framework like PoseNet that provides more keypoints tracked as compared to PoseNet’s 17 keypoints. From my working with PoseNet so far, I find that it is more beneficial in areas where you aren’t tracking a skeleton but are doing something more with the data gotten back from keypoints e.g. right eye is at this x and y position so do certain action.
I tried some of the other ml5 examples however I wasn’t satisfied with the results. I was particularly interested in the style transfer and the interactive text generator. However, I found that in order for them to be useful to me, I would have to train my own custom models and I didn’t have the time and adequate dataset to do this.
I also tried out the video classification example where I was toying around with the idea of having the algorithm detect an image in a video and explored for a video classification. It quickly dawned on me that this was a case for a custom model as the pre-trained model seemed to only work best when generic objects were in view. e.g. At times it recognized my face as a basketball, my hand as a band-aid, my hair as an abaya etc. I also noticed that if I brought the objects closer to the screen, the detection was slightly better. Below are some of my findings using MobileNet Video Classification in p5.
Pros & Cons of using a pre-trained model vs. a custom model? When using a pre-trained model like PoseNet & tensorflow.js a lot of the work has already been done for you. Creating a custom model is beneficial only if you are looking to capture a particular pose e.g. If you want to train the machine on your own body but in order to do this you will need tons of data. Think 1000s or even hundred of thousands of images, or 3D motion capture to get it right. You could crowdsource the images however you have to think of issues of copyright and your own bias of who is in the images and where they are in the world. It is imperative to be ethical in your thinking and choices.
Another issue to keep in mind is diversity of your source images as this may cause problems down the line when it comes to recognizing different genders or races. Pre-trained models too are not infallible and is recommended that you test out models before you commit to them.
I started this experiment by going through all the examples available on the ML5 website. I found the webcam classification quite interesting, but also difficult to work with because of how sensitive it was to background objects. In addition, I already had the chance to explore PoseNet in the Body-Centric course. So for this project, I decided to explore something different. Moving away from webcams and picture, I decided to experiment with the pitch detection example in the ML5. Having already done work in digital signal processing (DSP), I found it quite fascinating how quickly and accurately the software was able to identify the musical note in the piano example. I wanted to modify the example, creating an user interaction with the algorithm using the existing data available to provide a tool for them to practice their musical skill.
“Can You Sing?” is an expansion on the Piano example, where the user can select the note they are trying to mimic to practice specific notes. They software indicates when they user has successfully mimic the sound by highlighting the key in green. Only then the user is allowed to select another key to repeat the experience.
To do that, I had to create a way for the user to select the note that they wanted to mimic. I divided the heights of the keys into two. The top half would look for black keys and the bottom half looks for the white keys. Every time the a mouse key is pressed, the Y location of the mouse is checked. if it is in the bottom half of the piano shape, then base on the X location of the mouse, a white key is selected. If the Y position is in the top half, base on the X position a black key is selected.
After the key is selected, user’s voice is converted into a note and drawn on the screen. Only if the user’s input is the same as the selected note, the note will change color to green. After that it requests from the user to select another note.
In the other examples, I found the voice command also very interesting. So I added it to this program so that every time the user matches the selected key a voice will say “Nicely Done!”. This was purely so that I could also explore this feature of ML5.
Setting up the example was very challenging. For some reason the device software would not always receive data from the microphone which I found irritating. I had to restart the browser each time that I made some changes to the code. I wasn’t able to figure out if the problem was because of the libraries or it was just the local server, but it took a few tries each time to run the application.
But the most challenging part of the program was to identify the location for each note on the screen so that the user could choose their desired note by clicking on it on the piano shape. After a few tries I was able to get a good understanding of how they were drawn and was able to use the same technique to identify which position is associated with which key. I did end of dividing the keys into half base on their Y location just to simply the separation between the black and the white keys.
What I found useful was how the algorithm could detect the note independent on the octave. Plus, the speed that it was able to process the data and its accuracy made it an ideal tool for musicians. This can simply be used as an online tuner for almost all musical instruments which I find quit useful.
This is a fairly simple engagement with PoseNet. Tossing The Ball Around is a ball rendered in P5 that changes size based on the space between the user’s arms.
I started by messing around with some more ambitious ideas. I looked at trying to get PoseNet to create animated skeletons using .gif files in place of images. However, it seems that PoseNet doesn’t integrate with animated gifs easily.
After this I spent a fair bit of time trying to put together what I was thinking of as an “AR Costume”
I played with PoseNet and brought the code back to a place where I had access to all of the key points, and to a place where I was comfortable modifying things.
Then I went to the p5 Reference site for some tutorials on how to make particles. I imagined a series of animated streamers emanating from every key point, covering the user like a suit of grass.
I was able to create a particle system but I was not able to implement it in a way I was happy with, with multiple instances of it emanating from multiple points on the body.
I settled on experimenting with finding origins for drawings that were outside of the key points returned by PoseNet. I took the key points from the wrists, and drew an ellipse at the origin between them. I used the dist() function to measure the difference between the two user’s two wrists and return a circle that changes size based on the user’s movements. The effect is similar to that of holding a ball or a balloon that adjusts itself constantly.
I played around with coloration and making the image more complex, but I decided to quit while I was ahead.
Throughout the process, I tried to make use of coding techniques I had learned throughout the year. When I had first approached programming I knew nothing beyond the very basics. In coding this project I tried to use concepts like arrays, passing variables into functions, and object-oriented programming. I still have a lot more work to do to get comfortable, but this project demonstrated to me how far I’ve come.
My interest in ML5 is focused on real-time animated effects. Compared to other professional software such as Adobe Character, to make real-time face animation using ML5 is more customizable and simpler. Though the result may not be so highly finished, it is a great choice for designers and coders to produce visual work.
I found it easier for me to just use the p5 editor, however the ML5 script needs to be put in the HTML file in the p5 editor. (the fourth “script”)
The model used is poseNet. It allows real-time human pose estimation, it can track for example where my eyes, nose, hands are and then build visual work on those positions.
Then I set the canvas and draw functions in the p5 editor, I used the gray filter to add more fun.
Program the poseNet into my coding. When everything is settled, we can see that the ml5 recognizes “object, object, object (which should be my face)…” from the WebCam.
After some research, I learned that nose to feet are coded as 0 to 16 in poseNet. The left eye and the right eye should be 1 and 2.
The first try:
As the gif showed, if I move out of the screen the circles will not be able to track back.
The second try solved it: (if (poses.length > 0))
In fact, I can call my project successful at this point, however, I wanted to make it more finished.
In the third try, I tested the lerp function and instead of a set size, the size of the ellipses are defined by the “distance”, which allows the ellipses to become larger or smaller as I move forward and backward:
You can check it out here: https://vulture-boy.github.io/lstmPoetry/ [The text to speech seems to be giving the webhost some issues and only works some of the time. would recommend downloading it from the GitHub]
To accomplish this, I scraped poetry from a website and followed the tutorial listed on ml5: Training a LSTM
I used Web Scraper in Chrome in order to get the text information I needed to train the machine learning process. I needed to create a text file containing the information I wanted the algorithm to learn from, but I didn’t want to go through the laborious process of manually collecting it from individual web pages or Google searches. Using a web scraper makes the task automated by the computer. The only information that is required is a ‘sitemap’ that you can put together using Web Scraper’s interface: pick out the html elements that designate which text, links and data of interest are located to describe to the scraper how to navigate the page and what to collect.
After the process is complete (or if you decide to interrupt it), you can export a .csv containing the data collected by the Web Scraper process and copy the column(s) containing the desired data into a .txt file for the training process to use.
Training the Process
In order to prepare my computer for training, I had to install a few packages to my Windows 10 Powershell, namely Chocolatey, Python3, and a few python packages (pip, Tensorflow, virtual environment). It’s worth noting that in order to install these I needed to enable Remote Scripts: by default, Windows 10 prevents you from running scripts inside Powershell for safety purposes.
Once I had the packages installed, I ran the train.py file included in the training package repository on a .txt file collating all the text data I collected via web scraping. Each epoch denotes one full presentation of the data to the process and the time/batch section denotes how many seconds passed per process. The train_loss parameter indicates how accurate the process’ prediction was to the input data: the lower the value, the better the prediction. There are also several hyper-parameters that can be adjusted to improve the quality of the result and the time it takes to process (Google has a description of this here). I used the default settings for my first batch on the poetry:
with 15 minutes of scraped data (3500 iterations, poem paragraphs), it took about 15 minutes to process.
For a second batch, I collected about 30 minutes of data from a fanfiction website (227650 iterations, sentence and paragraph sizes) and I believe it took a little over 3 hours.
I adjusted the hyperparameters as recommended on the ml5 training instructions for 8mb of data on another 15 minute data set containing an entire novel (55000 iterations, 360 chapters) and instead chose to run the process on my laptop instead of my desktop computer. The average time/batch was ~7.5, larger than my desktop’s average of ~0.25 with default settings. This was also going to take approximately five days to complete, so I aborted the process. I tried again using default settings on my laptop: the iterations increased from 55000 to 178200 but the batch time was a respectable 0.115 on average.
The training file on completion creates a model folder, which can be substituted for any other LSTM model.
One of the contributed libraries for p5.js is the p5.speech library. The library is easily integrated into existing p5.js projects and has comprehensive documentation on their website. For my LSTM generator, I created a voice object and a few extra sliders to control the voice’s pitch and playback speed as well as a playback button that read the output text. Now I can listen to beautiful machine-rendered poetry!
Here’s a sample output:
The sky was blue and horry The light a bold with a more in the garden, Who heard on the moon and song the down the rasson’t good the mind her beast be oft to smell on the doss of the must the place But the see though cold to the pain With sleep the got of the brown be brain. I was the men in the like the turned and so was the chinder from the soul the Beated and seen, Some in the dome they love me fall, to year that the more the mountent to smocties, A pet the seam me and dream of the sease ends of the bry sings.
Eavesdropper is a web application that actively listens for the mention of a user’s name spoken in conversation. The application uses a voice to text API that transcribes conversations that are in listening range. The transcriptions are analyzed, and if the name of someone is said the application will sound an alert noting that that person is going to be notified through text message. Simultaneously, the clip of what was being said around the user is saved. The user then receives a text message and can go see what was being said about them.
Building upon my previous voice to text “DIY Siri” project, I wanted to play around with the concept “what if my computer could notify me if it heard something specific?”. I initially thought that it would be interesting to build directly off of the Wolf Ram Alpha API from the DIY Siri project to notify me if something specific was searched. From here I decided that I wanted to isolate the working parts and start with the concept of “the application hears a specific word, the user gets a notification”. I chose to use names as a trigger because they are rare enough that the trigger would not be sent frequently. This is important because both IFTTT and Adafruit IO have data sending and receiving limits. IFTTT has a limit of sending up to 100 text messages a month, and Adafruit IO has a limit of updating channels 30 times a minute.
I started off by using my existing code from DIY Siri and removing any of the PubNub server integration. I then changed the code to analyze the transcripts of what was being said. If my name was mentioned, then log this information.
My next step was to connect my Adafruit IO channel to the page. I created a new feed titled “overheard” with two channels: listening, and transcripts. Listening would indicate whether or not my name was overheard, and transcripts would save whatever was being said about me.
After creating those two channels, I connected my voice to text API to Adafruit to see if I would be able to save the value “true” and the transcript of the conversation. I tested with “if my name is included in this transcript, send the data to Adafruit”. This was successful.
Upon the guidance from Adafruit, I started to create an applet of my own to connect this feed to my phone. I chose the if “this” (trigger) to be Adafruit IO, and the “then that” (action) to be an SMS text message. On the Adafruit side, I selected to monitor the feed “overheard” and the channel “listening”. If “listening” was equal to the data “true” then send a text message. The UX of IFTTT made it simple to connect the two platforms together.
I started testing my application with all of the parts now connected. At first, I was not receiving text messages. This was because I was sending Adafruit a boolean value and not a string. The “equal to” on the IFTTT side of the platform was comparing the channel value to the string “true”. I changed the value of what I was sending to Adafruit to a string and was able to receive a text message.
Once I received a text message, I resulted in receiving six in a row. I realized that my voice-to-text alert that played upon hearing my name was vocalizing my name out of the speakers, which in result my application was picking up. This created an infinite loop of alerts. “Alert alert Olivia has been notified that you mentioned her and received a text message”. I attempted to stop the recursive loop by turning off the voice recognition and restarting it. The issue was with each time a new voice recognition object is instantiated explicit permission from the user to have their microphone activated was required. A quick fix for this was so that I could continue development was to not use my name in the text to voice alert from my speakers. I chose to use “O Prior has been notified” rather than using my name, Olivia.
For the UX/UI of the application, I chose to use a simple button. When the application was not actively listening a button would appear that said “Shhhhhhh!”. If the button was clicked, a microphone permissions prompt would display requesting access. Once the application was listening to the room the entire screen would turn black to be “inconspicuous”. The stop button was designed to be black and appears if the cursor hovers overtop of the element. If the name Olivia is heard in conversation, then a .gif file plays showing someone pressing an alert button. The video and message loop twice before returning to a black screen.
Video demo of Eavesdropper
One challenge I faced was attempting to connect two channel to the IFTTT applet. I wanted to additionally send the transcript as data through the SMS notification. The applet that was connected to Adafruit only allowed for the data of one channel to be used in the SMS. Due to the set up of the applet, I could only compare on direct values (such as greater than, is not equal too, etc.) This inhibited me from using the transcript channel as a trigger to send the message. Alternatively, I could have set up the applet so that it sent a message anytime the transcript channel was updated. With this method, I would have to be concerned with character length and substring the message to ensure that the data would not exceed the character limit for the SMS. I did not want to cut the transcript short, so I chose to use the boolean method. If the user wanted to see what was being said about them, they could investigate the transcript channel and use the time the text message was sent as an indicator for what was being said about them at that moment.
The other challenge I noted was the text to speech API. I had to use a function that checked many different iterations of “Olivia”. This included all different capitalizations of Olivia and with different punctuation. This was only an issue once so all of the iterations may not be necessary. The function that I used is incredibly useful if this program were adapted to listen for multiple keywords. The program loops through the transcript and checks for strings for a list of words that are kept in an array. The array can be modified to store whatever the user would want to be notified of in a conversation.
Next steps & conclusion
The next step for this project would be to find a way to use the data from both of the channels. Having different customized messages from the triggered conversation I think would provide a better experience for the user and would stop the application from being redundant.
Overall IFTTT is an intuitive platform that uses simple and accessible logic that allows many people to create bespoke trigger and responses. Some limitations may be an issue if one were to have their applet on a grander scale, such as the SMS message limit of 100 per month. This web application is a simple demo of what can be achieved with lots of other potentials to adapt to more bespoke applications.