Human Figure Tracking Using a Cascade-Based Architecture
Georgia Institute of Technology
Advisor: Dr. James Rehg
January 2004 - May 2005
Human figure tracking is a very important open problem in computer vision because of its wide variety of applications, ranging from security concerns such as surveillance and reconnaissance to historical purposes such as preservation and exploration to creative works such as re-creation of real-life objects on the computer and making computer models depicting human motion (for example in a sports game or a dancing-sequence). Figure tracking refers to the ability of identifying a person in an image or video and being able to follow that person and identify his or her actions throughout subsequent frames. One of the goals of this project is also to eventually be able to describe a video's content automatically, thus making it easy to index and search through large databases of video data quickly and efficiently, much as Google allows us to do today for text data.
As a first step towards this goal, my research at Georgia Tech is focused on building an extremely effective detector for a human arm. Although a variety of limb detectors currently exist, they are of rather poor quality, frequently losing track of the limb they are tracking in addition to being rather slow procedures. We hope to build a much better detector based on principles developed in papers [1], [2], and [3]. This cascade-based architecture offers many advantages over previous techniques, as it is a very fast method (face detection implemented using this system can be performed at real-time speeds). It will also hopefully be a more robust method due to its reliance not on kinematic models (which suffer from the problems mentioned above), but only image data from pairs of frames.
My role in this project has been to generate training data to improve the quality of our detector (since it this is a supervised learning problem, the quality of the results is directly dependant on the quality of our labelled input data). There are two main aspects to this: generating positive, labelled data (i.e. data containing an arm in it, with the arm labelled somehow), and negative data (i.e. data which does not contain an arm in it).
The negative data is much easier to generate because it does not have to be labelled. However, the requirement that it not contain human arms makes it somewhat challenging to find large amounts of data, since most videos focus on humans and thus contain arms in them. Therefore, I first obtained data from various video sources that did not contain humans in them, removing any human portions by hand.
The much more challenging problem is labelling data containing arms. My first approach consisted of generating this data using a 3D animation software package named Poser. This software is tailored specifically towards modelling and animating human figures, and thus allowed me very easily to create a variety of videos of different people walking in different ways. Creating a camera that was then "attached" to the arm, I could create videos in which the arm stayed constant in each video frame and the rest of the body and background moved behind it. In this way, the data was easily labelled, and we obtained some promising results using this method.
However, a major problem we've encountered with using this approach is that the positive training data doesn't accurately represent the types of test data we will face, since the training data looks "animated" and test data will consist of real people. Therefore, there is a strong need to get real data and label it automatically (because manually labelling video data frame-by-frame is a very time-consuming and labor-intensive process). To do this, we are exploiting the use of a motion-capture (mocap) lab that Georgia Tech has access to. The basic idea is to attach motion capture balls to an actor who can then move in various ways, while being recorded by a video camera as well as the mocap system. Once a synchronization and calibration between the video data and the mocap data can be obtained, the location of the markers on the person could be used to automatically label the portion of the arm that we wish to use for training data. This method, once, working, would allow easy creation of custom training data for the limb detector, enabling us to quickly improve the detector by taking various training videos depending on what aspects we wish to teach the detector (i.e. different backgrounds, different people, different clothing, different actions, etc.).
The first step of this process is to obtain the synchronization (i.e. align in time) and calibration (i.e. align in space) the video and mocap datastreams. We follow an approach used in [4] to perform these two alignments. From this, we are able to quickly obtain the necessary offsets and camera transformations to allow us to project mocap data onto the video stream.
Currently we are developing a hardware solution to the syncing problem, by sending a clock signal to the mocap cameras and the video cameras, allowing them to all take data at the same time. This should completely eliminate the problem of synchronization, allowing us to focus on gathering data and start improving our detector. In addition, we hope to eventually have several cameras in the mocap lab, available for general use (i.e. for having synchronized and calibrated video and mocap data).
Images (click for larger version):
Some examples of the simple features used in building complete classifiers. Notice that they are extremely simple shapes, which can be computed very quickly. |
An example of how the different features match to a given face. A complete classifier can be built by using several of these features together |
The cascade-architecture is demonstrated in this illustration. At each node, if the test input image does not match the classifier, it is outrightly rejected. Thus, only a minimum of computation has to be performed at early stages for most input windows. |
A simple image pair demonstrating the new kinds of input windows when evaluating a video (i.e. this is sequential pair of frames, their difference, and their difference when the second image is shifted in each of the four directions). |
Sample training image for the pedestrian detection system given in [2]. The yellow boxes were labelled manually |
Another training image. |
Another training image. |
Another training image. |
A sample result of the detector. The white box was generated by the detector. |
Another result. |
Another result. |
Another result. |
References:
[1] P. Viola and M. Jones. Fast and robust classification using asymmetric AdaBoost and a detector cascade. Advances in Neural Information Processing Systems 14 (NIPS*2001) , MIT Press, 2002.
[2] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. Proceedings of the International Conference on Computer Vision , 2003.
[3] J. Wu, J. M. Rehg, and M. D. Mullin. Learning a rare event detection cascade by direct feature selection. Advances in Neural Information Processing Systems 16 (NIPS*2003) , MIT Press, 2004.
[4] P. Sand, L. McMillan, and J. Popovic. Continuous Capture of Skin Deformation. ACM Transactions on Graphics, Vol. 22, No. 3 (Proceedings of SIGGRAPH 2003, San Diego, CA, July 27-31), 2003.