Pose estimation, or the ability to detect humans and their poses from image data, is one of the most exciting — and most difficult — topics in machine learning and computer vision. Recently, Google shared PoseNet: a state-of-the-art pose estimation model that provides highly accurate pose data from image data (even when those images are blurry, low-resolution, or in black and white). This is the story of the experiment that prompted us to create this pose estimation library for the web in the first place.
Months ago, we prototyped a fun experiment called Move Mirrorthat lets you explore images in your browser, just by moving around. The experiment creates a unique, flipbook-like experience that follows your moves and reflects them with images of all kinds of human movement — from sports and dance to martial arts, acting, and beyond. We wanted to release the experience on the web, let others play with it, learn about machine learning, and share the experience with friends. Unfortunately we faced a problem: a publicly accessible web-specific model for pose estimation did not exist.
With PoseNet out in the wild, we can finally release Move Mirror – a project that is a testament to the value that experimentation and play can add to serious engineering work. It was only through a true collaboration between research, product, and creative teams that we were able to build PoseNet and Move Mirror.
Move Mirror is an AI Experiment that finds your pose and matches your moves with thousands of images from around the world
Read on to get an in-depth view into how we made the experiment, what excites us about pose estimation in the browser, and the ideas on the horizon that we’re excited for.
What is pose estimation? What is posenet?
As you might guess, pose estimation is a pretty complex issue: humans come in different shapes and sizes; have many joints to track (and many different ways those joints can articulate in space); and are often around other people and/or objects, leading to visual occlusion. Some people use assistive devices like wheelchairs or crutches, which may block the camera’s view of their bodies; others might not have certain limbs; and still others may have very different proportions. We want our machine learning models to be able to understand and smartly infer data about all these different bodies.
Here, you can see PoseNet’s joint detection results on folks who are using assistive devices (like canes, wheelchairs, and prosthetic limbs).
In the past, technologists have approached the problem of pose estimation using special cameras and sensors (like stereoscopic imagery, mocap suits, and infrared cameras) as well as computer vision techniques that can extract pose estimation from 2d images (like OpenPose). These solutions, while effective, tend to require either expensive and not widely distributed technology, and/or familiarity with computer vision libraries and C++ or Python. This makes it harder for the average developer to quickly get started with playful pose experiments.
This was the perfect opportunity, we realized, to connect TensorFlow.js to PoseNet. TensorFlow.js would allow users to run machine learning models right in their browser — no server required. By porting PoseNet to TensorFlow.js, anyone with a decent webcam-equipped desktop or phone could experience and play with this technology, right from within a web browser, without having to worry about low-level computer vision libraries orsetting up complicated backends and APIs. Working closely with Nikhil Thoratand Daniel Smilkov of the TensorFlow.js team, Google researchers George Papandreou and Tyler Zhu, and Dan Oved, we were able to port a version of the PoseNet model to TensorFlow.js. (You can read more about that process here.)
A few things that made us super excited about PoseNet in TensorFlow.js:
Ubiquity/Accessibility: Most developers have access to a text editor and a web browser, and usage of PoseNet is as simple as including two script tags in your HTML file — no fancy server setup required. You also don’t need any special high-res or infrared cameras or sensors to get data — in fact, we found that PoseNet still works well on low-res, black-and-white, and vintage photography.
Shareability: Because everything can run in the browser, TensorFlow.js PoseNet experiments can also be shared in the browser super-easily. No need to make operating-system-specific builds — just upload your webpage and go.
Privacy: Because all of the pose estimation can be done in the browser, that means none of your image data ever has to leave your computer. Rather than sending your photos to some server in the sky to do pose analysis on a centralized service (i.e. such as when you use a vision API which you may not control, or which may fail, or any number of things), you can do all the pose estimation on your device, controlling exactly where your image goes. With Move Mirror, we match the (x,y) joint data that PoseNet spits out with our bank of poses on our backend — but your image stays entirely on your computer.
Okay, enough tech talk: let’s talk design!
Design and Inspiration
We spent a few weeks just goofing around with different pose estimation prototypes. For those of us who came from C++ or Kinect-hacking, just seeing our skeleton reflected back to us in our browser, using our webcam, was a pretty amazing demo on its own. We played with trails, puppets, and all sorts of other silly things before we landed on the concept that would become Move Mirror.
It probably isn’t surprising to hear that a lot of us here in the Google Creative Lab are interested in search and exploration. In talking about what we could do with pose estimation, we were tickled by the idea of being able to search an archive by pose. What if you could strike a pose and get a result that was the dance move you were doing? Or — maybe even funnier — what if you struck a pose and got a result that was the same, but totally out of context for what you were doing? How could we find weird, serendipitous connections across the breadth of human movement — from martial arts to cooking to skiing to babies taking their first steps? How might that surprise us, delight us, and make us laugh?
We took inspiration from projects like Land Lines (in which gestural data is used to explore similar lines in Google Earth) and the Cooper Hewitt Gesture Match (which is an on-site installation that uses pose-matching to suggest items from the archive). Aesthetically, however, we were drawn in a much faster, more real-time direction. We loved the idea of having a constant stream of images respond to your movements, blurring folks from all walks of life together, connected by your movement. Inspired by rotoscoping and timelapse photography, as are used in The Johnny Cash Project, and the trend of selfie timelapses on YouTube, we decided to lean hard on the gas pedal and attack real-time responsive pose matching in the browser — a complex problem itself.
Gif of The Johnny Cash Project, in which more than 250,000 people individually drew frames for “Ain’t No Grave” to make a crowdsourced music video.
Building Move Mirror
Although PoseNet took care of the pose estimation for us, we still had plenty of things to figure out. The core experience is all about finding matching images to user poses, so that if you stand straight with your right arm raised up, Move Mirror finds an image where someone is standing with their right arm raised up. For that we needed three components: an image dataset, a search technique for that dataset, and a pose matching algorithm. Let’s break it down and look at each piece.
Building a dataset: searching for diversity
To create a useful dataset, we had to search for images that collectively covered a huge variety of human movement. There was no point in having 400 images of a person standing with a raised right arm if other poses were not represented in the dataset. To keep the experience consistent, we also decided we’d focus on finding only full-body images. In the end, we licensed a set of videos we thought represented not just a variety of movement, but also a diverse set of body types, skin tones, cultures, and physical abilities. We split these videos into about 80,000 still frames, then processed each image with PoseNet and stored the associated pose data. Next, let’s talk about the hard parts: pose matching and search.
We parsed thousands of images through PoseNet. You’ll notice not all images are parsed correctly so we discarded a few to end up with a dataset of about 80,000 images.
Pose matching: the challenge of defining similarity
For Move Mirror to work, we first had to figure out how to define a ‘match’. A match is the image we return, based on the pose data we receive, when a user strikes a pose. When we talk about the ‘pose data’ coming out of PoseNet, we’re referring to a set of 17 body or face parts, such as an elbow or a left eye, that are called “keypoints”. PoseNet returns the x and y position of each keypoint in relation to the input image, plus an associated confidence score (more on this later).
PoseNet detects 17 pose keypoints on the face and body. Each keypoint has three important pieces of data: an (x,y) position (representing the pixel location in the input image where PoseNet found that keypoint) and a confidence score (how confident PoseNet is that it got that guess right).
Deciding what ‘similarity’ meant became our first hurdle. How should we decide how similar a set of 17 keypoints from a user is to a set of 17 keypoints from an image in our dataset? We tried a few different measures for similarity and settled on two that seemed to work well: cosine similarity and a weighted match taking into account keypoint confidence scores.
Matching strategy #1: cosine distance
If we were to convert each set of 17 keypoints into a vector and plot all of them in high dimensional space, our task of finding the two most similar poses would translate into finding the closest two vectors in this high dimensional space. This is exactly what cosine distance allows us to do.
Cosine similarity is a measure of similarity between two vectors: basically, it measures the angle between them and returns -1 if they’re exactly opposite, 1 if they’re exactly the same. Importantly, it’s a measure of orientation and not magnitude.