Exploring PoseNet & P5.js & ml5.js

For this workshop I chose to continue my explorations of PoseNet, which allows for real-time human pose estimation in the browser, having started learning the framework in the Body-Centric Technologies class. I wanted to try working with images and PoseNet but I couldn’t get the PoseNet model to track whenever I used images…my workaround for this was to group the images into a video. My idea was to explore how women posed on magazine covers by comparing poses from different fashion magazine covers. Below is a video showing the final results of poses that were captured when working with the PoseNet with webcam example.

I found that the body-tracking worked well only if I had the size of the video set to width (640 pixels) and height(480 pixels) which were the dimensions  used in the ml5 examples.

What is ml5.js? A wrapper for tensorflow.js that makes machine learning more approachable for creative coders and artists. It is built on top of tensorflow.js, accessed in the browser, and requires no dependencies installed apart from regular p5.js libraries.

NOTE: To use ml5.js you need to be running a local server. If you don’t have a localhost setup you can test your code in the p5.js web browser – you’ll need to create an account.

I also found that the multi-pose tracking seemed to tap off at 3 poses max tracked whenever there were more than 3 poses. Additionally, the model’s skin color affected the tracking so that at times some body parts were not tracked. I also found that the model’s clothes also affected whether some parts were tracked or not. At times the models limbs were ignored or the clothes were tracked as additional limbs. The keypoints seemed to be detected all the time but the lines for the skeleton were not always completed. What are keypoints? These are 17 datapoints that PoseNet returns and they reference different locations in the body/skeleton of a pose. They are returned in an array where the indices 0 to 16 reference a particular part of the body e.g in the array index 0 contains results about the nose such as x,y co-ordinates and percentage of detection accuracy.

Below are some of the images I tested with:


I’d like to continue working on this however I would like to explore using OpenPose which is a framework like PoseNet that provides more keypoints tracked as compared to PoseNet’s 17 keypoints. From my working with PoseNet so far, I find that it is more beneficial in areas where you aren’t tracking a skeleton but are doing something more with the data gotten back from keypoints e.g. right eye is at this x and y position so do certain action.

I tried some of the other ml5 examples however I wasn’t satisfied with the results. I was particularly interested in the style transfer and the interactive text generator. However, I found that in order for them to be useful to me, I would have to train my own custom models and I didn’t have the time and adequate dataset to do this.

I also tried out the video classification example where I was toying around with the idea of having the algorithm detect an image in a video and explored for a video classification. It quickly dawned on me that this was a case for a custom model as the pre-trained model seemed to only work best when generic objects were in view. e.g. At times it recognized my face as a basketball, my hand as a band-aid, my hair as an abaya etc. I also noticed that if I brought the objects closer to the screen, the detection was slightly better. Below are some of my findings using MobileNet Video Classification in p5.


Pros & Cons of using a pre-trained model vs. a custom model? When using a pre-trained model like PoseNet & tensorflow.js a lot of the work has already been done for you. Creating a custom model is beneficial only if you are looking to capture a particular pose e.g. If you want to train the machine on your own body but in order to do this you will need tons of data. Think 1000s or even hundred of thousands of images, or 3D motion capture to get it right. You could crowdsource the images however you have to think of issues of copyright and your own bias of who is in the images and where they are in the world. It is imperative to be ethical in your thinking and choices.

Another issue to keep in mind is diversity of your source images as this may cause problems down the line when it comes to recognizing different genders or races. Pre-trained models too are not infallible and is recommended that you test out models before you commit to them.

Leave a Reply