Exploring PoseNet & ML5
During the workshop I was part of a group that explored PoseNet which allows for real-time human pose estimation in the browser using the tensorflow.js library. Read more about it here . We were able to test PoseNet in the demo browser and during explorations I noticed that the program would slow down when using their multiple pose capture feature. Additionally, I noticed that the skeleton drawn was pretty accurate regardless of how form fitting or loose one’s clothing was. At the time we were not able to test the effect of different colors of clothing as coincidentally all four of us had worn varying shades of gray. We attempted to download the Github repository found here however we had a lot of trouble running the code; A lot of dependencies and setup is required, that we didn’t quite understand.
When I couldn’t get the demo working locally on my laptop I tried following the Coding Train Hour of Code tutorial on using PoseNet that is available here. In the tutorial Daniel Shiffman uses ml5.js and p5.js – ml5.js is a tensorflow.js wrapper that makes the PoseNet and tensorflow.js more accessible for intermediaries or people who haven’t had much experience with tensorflow.js. The tutorial is however not suitable for people who haven’t used p5.js before although in the video, Shiffman links to other videos for complete beginners.
Insights from the tutorial:
In this tutorial I learned:
What is ml5.js? A wrapper for tensorflow.js that makes machine learning more approachable for creative coders and artists. It is built on top of tensorflow.js, accessed in the browser, and requires no dependencies installed apart from regular p5.js libraries. Learn more here
NOTE: To use ml5.js you need to be running a local server. If you don’t have a localhost setup you can test your code in the p5.js web browser – you’ll need to create an account.
You can create your own Instagram like filters! The aim of the tutorial was to create a clown nose effect where a red nose would follow your nose on screen. In theory, once you master this tutorial you can create different effects like adding a pair of sunglasses, or other effects. I learned about p5.js filter() effect which adds a filter to an image or video. I tested out THRESHOLD, which converts the image to black or white pixels if they are below a certain threshold, and GRAY, which adds a greyscale to the video. usage is filter(THRESHOLD) or filter(GRAY);
Pros & Cons of using a pre-trained model vs. a custom model? When using a pre-trained model like tensorflow.js a lot of the work has already been done for you. Creating a custom model is beneficial only if you are looking to capture a particular pose e.g. If you want to train the machine on your own body but in order to do this you will need tons of data. Think 1000s or even hundred of thousands of images, or 3D motion capture to get it right. You could crowdsource the images however you have to think of issues of copyright and your own bias of who is in the images and where they are in the world. It is imperative to be ethical in your thinking and choices.
Another issue to keep in mind is diversity of your source images as this may cause problems down the line when it comes to recognizing different genders or races. Pre-trained models too are not infallible and is recommended that you test out models before you commit to them.
What are keypoints? These are 17 datapoints that PoseNet returns and they reference different locations in the body/skeleton of a pose. They are returned in an array where the indices 0 to 16 reference a particular part of the body as shown below:
In the array additional information for the pose such as the certainty percentage and x,y co-ordinate of the keypoints are returned. These keypoints are important as they are how you will determine where to generate your filter or effects e.g. clown nose.
source: TensorFlow here
Some keypoint readings and accuracy recorded from the motion capture of the image above of me sitting down were. These results are printed to the console and are shown here with the array expanded: 0.99 “leftEye”, 0.84 “rightEye”, 0.97 “leftEar”, 0.41 “rightEar”, 0.01 “leftShoulder”, 0.00 “rightShoulder” … 0.02 “leftHip”.
Once I determined that ml5 was working correctly. I drew the clown nose – a red ellipse drawn at the x and y co-ordinates of my nose. To do this I used the keypoint data at index 0 of the array which corresponds to nose info. To access this data I first needed to access the 0 index of the poses array which holds all the detected poses. This will give me latest pose. Once I have the latest pose, I used the following to update a global variable noseX and noseY e.g.
noseX = poses.pose.keypoints.position.x
noseY = poses.pose.keypoints.position.y
The nose following crashes when you go off screen! You need to use an if-function to detect whenever at least one pose has been found, otherwise the nose will remain stuck at the last part you were on screen
The red nose is too bouncy! I noticed that the red nose was a little jumpy as it moved from position to position. To fix this, I used the lerp function to smooth the values so that the nose doesn’t jump immediately to a new positions. The value to use in the lerp function depends on what looks good to you. Tried 0.2 at first but this was too choppy, so I upped it to 0.5. Since I knew how to detect the nose, I attempted to add an additional keypoint tracking and tracked my left-eye which is at keypoint 1 index.
Red nose is out of proportion! I learned that the distance between keypoints is bigger when you are closer to the camera and smaller when you are further away which caused the nose to be really big when far away and really small when closer. In order to fix this I needed to estimate the camera distance and draw the nose proportional to the distance between my eye and my nose keypoints. This corrects the proportions so that up close, the nose is big and far away it shrinks in size.
Proportions are off
It is possible to continue adding effects e.g. I could create sunglasses or a hat to go with my red nose. I however did not like this approach, as it works best only for selfies and not full body poses because there are too many keypoints to keep track off when attempting to create a unique effect at each point, especially with the addition of lerping. To create an effect where there is no keypoint e.g. there is no keypoint for the top of your head but you can use the position of the right and left eye to determine where a hat should go.
Video Classification Example
I was toying around with the idea of having the algorithm detect an image in a video and explored for a video classification. It quickly dawned on me that this was a case for a custom model as the pre-trained model seemed to only work best when generic objects were in view. e.g. At times it recognized my face as a basketball, my hand as a band-aid, my hair as an abaya etc. I also noticed that if I brought the objects closer to the screen, the detection was slightly better. Below are some of my findings using MobileNet Video Classification in p5.
Ideation & Exploring PoseNet with Webcam:
I wanted to leverage the power of PoseNet to track poses in music videos but also subvert its usage to create a trivia game that I called Name That Singer. The idea was to create a video that showed only the pose skeletons dancing and a viewer would have to guess who the singer was based on the pose on the screen. I chose a viral video – Beyonce’s Single Ladies – that I assumed would be easy to figure out. I didn’t take into account how fast they dancers in the videos were moving and this made it hard to determine which song was playing when the skeletons were showing on the screen.
For this part, I decided not to use the lerping function to create a unique effect and instead used the pre-determined functions in ml5.js for PoseNEt Webcam to capture the skeleton. These pre-determined functions were beneficial in this case as my points and skeletons are identical in aesthetic so I was able to cut down on coding needed. I followed the tutorial here and instead of using webcam I loaded my own videos.
Below are some screenshots from my testing. I also tested the poses when filters such as threshold, invert, and blur were added to the video and found that the tracking was really good. Even with cartoons.
Artists/Creative Coding Projects:
Chris Sugrue – She is an artist and programmer working across the fields of interactive installations, audio-visual performances, and experimental interfaces. website
source: Chris Sugrue
Delicate Boundaries – Light bugs crawls off a computer screen onto human bodies as people touch the computer screen, exploring how our bodies interact with the virtual world if the world in our digital devices could move into our physical world.
I liked this project [Delicate Boundaries] because it explores beyond the computer screen, it could be cool to do something like this with PoseNet where instead of just mapping onto the screen, poses can be mapped onto the body.
Real-Time Human Pose Estimation in the Browser with TensorFlow.js – here
PoseNet with webcam in ML5.js – here
Github code: here