Research: Using Kinect Depth data for Multi Touch

Posted on May 23, 2011


Main issues with TouchLib and the similars

TouchLib and similar libraries for multi-touch solutions are using camera input to identify where the user touches a table. The blobs are either shadow cast on the surface from lighting above, or reflection from Infrared sources below the table.

The big benefit of TouchLib is that it is relative cheap. All you need to build a Touch table is a (modified) web can, some software and a computer.

The main issue with TouchLib is this:

  1. It can not distinguish artifacts from your fingers – Where artifacts can be light sources or shadows appearing as blobs
As all information processed by TouchLib is 2 dimensional, it can not figure out if a blob is created/casted by fingers touching the surface or by something else in the room.

Base Assumption

Using the 3D info from Microsoft Kinect, it should be possible to build a reliable Multi-touch table, using 3D points as input.


Blobs are items in 2D or 3D space.

Bubble boy

This Processing demo uses the Kinect Point Cloud to present bubbles, representing the person behind the camera. That code will be my starting point.

IR dots

The Kinect projects a grid of 640 x 420 Infrared (IR) dots into the space it is in. These dots are hitting objects in that space and the reflection is recorded and translated by the Kinect device. The result is called a “point cloud”.

Retrieving fingers from Point clouds

Point clouds are “clouds of points” representing the locations where the IR dots are reflected by objects in space.

You can start finding fingers “pressing a table” using those Point clouds.

This is done by two basic steps:

  1. Optimizing scanning to reduce overhead – You do not want to scan all points in the Point Cloud. You only want the points that matter.
  2. Only reacting to “fingers” after a certain treshold – As “fingers” can be any pointy item in a room, we only want to measure them within a specific proximity to our table.

Reducing overhead by using heat maps and proximity grids

I want the library to be fast. This means that it will skip as many steps as possible when processing the 3D data from the Kinect camera.

The less 3D points I have to scan to find out where fingers are in the 3D space, the faster the process will be and the less resources it will take form the computer. Leaving space for other applications to do their work.

Therefore the process described below.

The first scan will divide the Kinect Point Cloud into a grid of 5 x 4 points, each covering 100 pixels. In each grid, one point will be measured.

Based on the proximity of the point, a proximity “heat map” is created. The closer a point, the “hotter” it is.

The “hottest” points are representing objects in closest proximity – which are most interesting for us.

Each “hot” cell from the first scan will be divided into a new grid, where we sample in more detail what is going on.

Resulting in the “Heat map detail”.

Overhead is reduced by:

  1. Creating a rough overview – By only scanning a limited set of points we can get a quick indication where objects might be. As hands and bodies are usually big enough to cover multiple 3D points, we can identify them easily enough
  2. Skipping the “cells” we do not need – For instance because objects are outside our proximity range
  3. Only zooming into the most interesting parts – By taking the “hottest” points from the initial scan
  4. Repeating step 2 until we are satisfied – Due to the amount of detail we really need to be sure

Refining the scans by prediction

Once we know where a body is and where body parts (hands) are, we can focus our scanning grids on where the hand might be in the next frame we scan.
  1. Only a limited amount of scans needed to pinpoint the object – From very course scans we can find the reflected dots from body parts. By following these parts we can limit the scans to those areas to find how they stick out towards the camera
  2. Increased reliability when we proceed – As the first scan is based on a rough matrix, we might miss a finger in cell “A1” which is part of the hand that is measured in cell “A2”. Once we know where the hand is in “A2” we can project a field around the center of that “hand” and scan for limbs which might be in the surrounding area

Calibrating the table

Calibrating the table will be done in the same way as done for Touch screens: by touching reference points. Both 2D and 3D position of the blobs representing the finger will be used to translate the 2D and 3D data from the touch points.

No skeletons

It is possible to create a very sophisticated “Minority Report” interface using the Kinect, some libraries and some programming smarts. This is – however – beyond our needs.
For a multi-touch table we only need to track “blobs” which are:
  1. Fingers
  2. Objects

Moving it further: Using blobs and proximity for a simplified Minority Report interface

To create a simplified “Minority Report” type of interface, you need two things:
  1. Blobs and their relationship to each other – Are they close to each other? What is the angle or rotation between them?
  2. The distance of each blob related to the camera – Are they close or far away?
From this you can create a 2 hand, 4 finger “Minority Report” interface where the fingers and the rotation of the hand define:
  1. “Contact” – Are we doing a “mouse over” or really “grabbing” an object?
  2. “Speed” – By “dialling” the hand we change the rotation of the two fingers on each hand.  The bigger the rotation of the hand, the bigger the speed
  3. “Direction” – Dialling left will move objects to the left, dialling to the right will move them to the right. Dialing can also induce zooming in / out.
  4. “Pressure” – Moving your hands closer and further away from the camera can increase and decrease “pressure” on an object, or can lead to tilting of an object.
  5. Moving objects around – Once you “grab” objects with the fingers and move the hand, you can move the object
Posted in: Uncategorized