Recently, I had the opportunity to put in a few extra hours working on one of our projects here at Clarity that leverages Microsoft’s Kinect for Windows device. I own an Xbox 360 and an Xbox One, so I’m very familiar with the capabilities (and limitations) of Kinect. At home, I use Kinect to drive my entire entertainment system by voice. The only instance where I have to use a remote is while using Windows Media Center on my Xbox 360.
With that said, I have never had the pleasure of programming against one. All of the projects I have been involved with previously (both at Clarity and elsewhere) have been server-based web applications. So I was thrilled to have the opportunity to leverage what I consider to be one of the coolest pieces of consumer-grade tech ever released.
What Were You Doing?
Part of this project uses the Kinect to identify a variety of gestures like jumping, spinning, or shaking. Using the Kinect’s skeleton tracking makes identifying where the primary user is at any given time a breeze. Where the issues come in are that gestures are not based on point in time positions but a series of actions done in concert over a small time interval. So it’s necessary to keep track of a series of skeletons over the course of the gesture to make sure each subset of the gesture is acted accordingly in concert.
I was working in concert on different gestures from a counterpart in Croatia. We both started at the same time and didn’t share code until after we were done. As a result, we used very different strategies to track a gesture. My counterpart broke each gesture into component subgestures. So for jumping that would be standing, body moving vertically, and then body falling. Putting those three subgestures together corresponds to a single “jump” gesture.
On the other hand, I captured the specific joint positions for each frame for a set window of time (3 seconds) and then looked at each frame in concert to verify that the gesture was correct. This was especially useful for detecting spinning because each subsequent frame is dependent upon the frame before it.
The Kinect is not without its limitations though. Since it’s a set of 2D cameras attempting to imitate 3D, it has difficulty detecting which way you’re facing (toward or away). For a gesture like spinning, this is problematic since the moment you’re facing away from the sensor your body gets flipped in the sensor (i.e. your left shoulder is detected as your right shoulder when facing away from the camera). This is a limitation of the v1 detection software because it doesn’t distinguish faces as part of skeleton tracking.
How Easy Was It?
Leveraging the Kinect to create gestures was surprisingly simple and straightforward for such a complex piece of technology. You get the set of skeletons for the given frame and then analyze the position of the desired joints of each skeleton to see if it meets the gesture requirements. All of this is outlined in the sample programs so I was able to go from nothing to successfully detecting gestures in several hours.
It was refreshing to have such an efficient experience. A lot of esoteric technology requires significant configuration just to get it into a working state. Also, the majority of the clever or difficult use cases are not covered in the samples requiring a significant amount trial and error or searching for solutions. I really hope to use the Kinect again in future project work.