Workshop 6 Notes
PoseNet Experimentations: Replacing the mouse with different tracked body points, and using gesture as a type of controller through PoseNet js.
I introduced myself to PoseNet through Daniel Shiffman’s coding train example that uses PoseNet with p5.js. This example was thorough and explored the base concepts of:
- What does PoseNet track?
- How do you extrapolate that data?
- How can you use the key data points analyzed through computer vision in relation to another?
A clip of myself following Daniel Shiffman’s PoseNet example. Note how the code is written to isolate certain keypoints in “keypoints” etc.
PoseNet uses computer vision to document the following points in the form of “pixels” on your body with a declared accuracy. The library for PoseNet allows you to identify the poses of multiple people in a webcam. My explorations will be focussing on actions and gestures from one participant.
This data is then translated into JSON that the PoseNet library instantiates as a data structure and that beginning of your code.
The JSON gives you an x and y coordinates that indicate the position of each point being tracked. Because of the way the JSON is structured you are always able to accurately grab points to use as inputs.
I found that the points are very similar to p5.js mouseX and mouseY coordinates. I decided to pursue a series of small experiments to see how to use gestures as a mouse movement on the screen.
Screen position & interaction
I wanted to see how I could use the space of the screen to interact with my body that was analyzed for data points on a webcam. I wanted to explore raising my hand and see if I could activate an event within a part of the screen. Interestingly enough, PoseNet does not track the point of a user’s hand but is able to track their wrist. I decided to experiment with the wrist point as it was the closest to the hand gesture I wanted to explore.
I used the method in Shiffman’s tutorial to isolate my wrist (key point 10). I then divided the webcam screen into 4 quadrants. I found the x and y coordinates of my right wrist. If the coordinates of my right wrist were located within the top left* quadrant, an ellipse would appear on my wrist.
*Note, the webcam image is a flipped image of the user. The right hand is shown on the left side of the screen.
Video documenting myself activating the top left quadrant with my hand.
I expanded on this idea by seeing how I could use this gesture as a controller to stop and start an action. I chose to draw an ellipse centred on my nose. Upon page load, the colour of the ellipse would randomly be chosen. If the coordinates of my right wrist were located within the top left quadrant, the colour of my nose ellipse would start to change.
Changing the colour of my nose by using the gesture action of raising my right wrist to randomly chose the red, green, and blue values.
When the wrist leaves the quadrant, the nose takes on the last colour that was randomly chosen. This was a quick proof of concept. The loop happens quite quickly so that the user has little control over colour is selected.
The right wrist point was an interesting key point to chose. I found myself always tracking my hand rather than my wrist. Upon testing, I noticed myself having to raise my hand higher than I anticipated to activate the quadrant. I do not think this is because of the tracking accuracy, but rather the human discrepancy vs the machine understanding of where does my wrist start and end.
2) Buttons (almost)
I was able to isolate a specific quadrant on the page and could start and stop actions. I was curious to see if I could emulate a button click, and give further didactic information through HTML inputs.
I created a button that would change the colour of the nose upon hover. I found the coordinates of the button, the width and height, and then compared this to the value of the nose pose. For this experiment, I chose the nose rather than the wrist because I wanted a one to one relationship with the control of the action and the gesture (rather than having a different body part control the output of another point).
As per the previous experiment, p5.js generated a random colour for the nose. When the nose interacted with the button on the page that declared “change colour of nose”, the colour of the nose would change colour.
PoseNet tracking my nose and interacting with a button created in p5.js
This button is an artificial stand-in since there is no mouse press event that is fired. It is simply tracking through position. I considered attempting to feign a mouse press, but chose to not because of the outcome of this design was the same desired result.
My next experiment went to further gesture through the exploration of mapping. I found the nose ellipse colour to be an effective representation of action on the page. I chose to map the coordinates of my hand across the screen to the colour of the nose ellipse.
A video of my wrist controlling the colour of my nose through a sweeping gesture.
I enjoyed how this emulated a slider. P5.js has the capability to include a slider for volumes and other values, such as RGB. This gesture almost replaces the functionality. The next step for this gesture would be to “lock” the value of the slider rather than just having a continuously mapped value.
4) Two body points interacting together
For this experiment, I wanted to explore two parts: the first, how was it possible to create an “interaction” with two points; the second, is it possible to trigger an output like a sound?
To achieve this, I chose to see if I could use my knee and my elbow to create a gong sound when they “collided” together.
I started off by finding the coordinates for the knee and the elbow. The x and y coordinates provide very specific points, it is not a range, nor is there an identifiable threshold that is given. These two sets of coordinates are not tied to the screen as the previous experiments and would need to respond in relation to each other. This posed a bit of an issue, how would I determine if the knee and elbow were colliding into each other? It would be too difficult and unintuitive to attempt to have both the knee and elbow interact when they had the exact same coordinates. I decided that the knee would need to surpass the y coordinate of the elbow marginally to indicate a connection between the two points.
Two start the visual representation of this experiment, I did not draw anything on the screen. When the knee and the elbow “interacted”, an ellipse on the knee and elbow would draw.
A video of my knee and elbow attempting to trigger the drawing of two ellipses.
I found that this was effective every other attempt. I added ten extra pixels in each direction to see if this would help expand the range. This helped only marginally.
My next step was coordinating sound. I found a free gong sound file that I loaded into p5. This, unfortunately, did not work as I wanted to. Since I was triggering a sound through the .play() function, when my knee and elbow collided for more than a 1/60 of a second, the .play() function was executed. This was the beginning of a journey of callbacks and booleans that I decided not to include as it was deterring me from the exploration of PoseNet.
The knee and the elbow colliding was not as satisfying as an action or of use as a controller as I would have liked. The action itself is very enjoyable and having the sound of a gong would turn the body into an instrument. This action would work better if the sound was continuously playing, and then body controlled the volume or other aspects of the song rather than controlling the “starting” and “stopping” of the audio.
PoseNet provides a useful data model that allows people to located their key body points on a screen. The points are accurate coordinates that can be used to interact with the screen without having to use a mouse. The experiments I conducted are initial explorations into how it is possible to use the data from the PoseNet as inputs. The explorations can be used as basic models for common mapping functions, mouse clicks, and multiple input events. The output of the colour changing nose can be replaced with outputs such as sliders, changing filters, or starting and stopping events. PoseNet is a powerful tool that can take complex physical data and create a virtual map of coordinates for people to interact with online.
My next steps in these explorations would be to explore more complex and combined patterns of interaction. All of these explorations are single actions that provide one result. PoseNet offers the ability to use up to seventeen points of data from one person, which can be multiplied with every person interacting on the same screen. The multiple poses could lead to interesting group cooperation exercises, such as all participants raising their left wrist to create an output. The basic experiments of input and output still apply, though the logic would need to change to account for multiple bodies.
Overall, gesture-based controllers seem futuristic but the technology has become extremely accessible through PoseNet and built-in webcams. PoseNet allows an easy introduction to browser-based computer vision applications. These experiments offer a very basic understanding and introduction into common gesture interactions that now become possible through PoseNet.