Specious Logic

Neeraj Kumar's Personal Website

Research

Human Figure Tracking Using a Cascade-Based Architecture
Georgia Institute of Technology
Advisor: Dr. James Rehg
January 2004 - May 2005

Human figure tracking is a very important open problem in computer vision because of its wide variety of applications, ranging from security concerns such as surveillance and reconnaissance to historical purposes such as preservation and exploration to creative works such as re-creation of real-life objects on the computer and making computer models depicting human motion (for example in a sports game or a dancing-sequence). Figure tracking refers to the ability of identifying a person in an image or video and being able to follow that person and identify his or her actions throughout subsequent frames. One of the goals of this project is also to eventually be able to describe a video's content automatically, thus making it easy to index and search through large databases of video data quickly and efficiently, much as Google allows us to do today for text data.

As a first step towards this goal, my research at Georgia Tech is focused on building an extremely effective detector for a human arm. Although a variety of limb detectors currently exist, they are of rather poor quality, frequently losing track of the limb they are tracking in addition to being rather slow procedures. We hope to build a much better detector based on principles developed in papers [1], [2], and [3]. This cascade-based architecture offers many advantages over previous techniques, as it is a very fast method (face detection implemented using this system can be performed at real-time speeds). It will also hopefully be a more robust method due to its reliance not on kinematic models (which suffer from the problems mentioned above), but only image data from pairs of frames.

My role in this project has been to generate training data to improve the quality of our detector (since it this is a supervised learning problem, the quality of the results is directly dependant on the quality of our labelled input data). There are two main aspects to this: generating positive, labelled data (i.e. data containing an arm in it, with the arm labelled somehow), and negative data (i.e. data which does not contain an arm in it).

The negative data is much easier to generate because it does not have to be labelled. However, the requirement that it not contain human arms makes it somewhat challenging to find large amounts of data, since most videos focus on humans and thus contain arms in them. Therefore, I first obtained data from various video sources that did not contain humans in them, removing any human portions by hand.

The much more challenging problem is labelling data containing arms. My first approach consisted of generating this data using a 3D animation software package named Poser. This software is tailored specifically towards modelling and animating human figures, and thus allowed me very easily to create a variety of videos of different people walking in different ways. Creating a camera that was then "attached" to the arm, I could create videos in which the arm stayed constant in each video frame and the rest of the body and background moved behind it. In this way, the data was easily labelled, and we obtained some promising results using this method.

However, a major problem we've encountered with using this approach is that the positive training data doesn't accurately represent the types of test data we will face, since the training data looks "animated" and test data will consist of real people. Therefore, there is a strong need to get real data and label it automatically (because manually labelling video data frame-by-frame is a very time-consuming and labor-intensive process). To do this, we are exploiting the use of a motion-capture (mocap) lab that Georgia Tech has access to. The basic idea is to attach motion capture balls to an actor who can then move in various ways, while being recorded by a video camera as well as the mocap system. Once a synchronization and calibration between the video data and the mocap data can be obtained, the location of the markers on the person could be used to automatically label the portion of the arm that we wish to use for training data. This method, once, working, would allow easy creation of custom training data for the limb detector, enabling us to quickly improve the detector by taking various training videos depending on what aspects we wish to teach the detector (i.e. different backgrounds, different people, different clothing, different actions, etc.).

The first step of this process is to obtain the synchronization (i.e. align in time) and calibration (i.e. align in space) the video and mocap datastreams. We follow an approach used in [4] to perform these two alignments. From this, we are able to quickly obtain the necessary offsets and camera transformations to allow us to project mocap data onto the video stream.

Currently we are developing a hardware solution to the syncing problem, by sending a clock signal to the mocap cameras and the video cameras, allowing them to all take data at the same time. This should completely eliminate the problem of synchronization, allowing us to focus on gathering data and start improving our detector. In addition, we hope to eventually have several cameras in the mocap lab, available for general use (i.e. for having synchronized and calibrated video and mocap data).

Images (click for larger version):


Some examples of the simple features used in building complete classifiers. Notice that they are extremely simple shapes, which can be computed very quickly.

An example of how the different features match to a given face. A complete classifier can be built by using several of these features together

The cascade-architecture is demonstrated in this illustration. At each node, if the test input image does not match the classifier, it is outrightly rejected. Thus, only a minimum of computation has to be performed at early stages for most input windows.

A simple image pair demonstrating the new kinds of input windows when evaluating a video (i.e. this is sequential pair of frames, their difference, and their difference when the second image is shifted in each of the four directions).

Sample training image for the pedestrian detection system given in [2]. The yellow boxes were labelled manually

Another training image.

Another training image.

Another training image.

A sample result of the detector. The white box was generated by the detector.

Another result.

Another result.

Another result.

References:

[1] P. Viola and M. Jones. Fast and robust classification using asymmetric AdaBoost and a detector cascade. Advances in Neural Information Processing Systems 14 (NIPS*2001) , MIT Press, 2002.

[2] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. Proceedings of the International Conference on Computer Vision , 2003.

[3] J. Wu, J. M. Rehg, and M. D. Mullin. Learning a rare event detection cascade by direct feature selection. Advances in Neural Information Processing Systems 16 (NIPS*2003) , MIT Press, 2004.

[4] P. Sand, L. McMillan, and J. Popovic. Continuous Capture of Skin Deformation. ACM Transactions on Graphics, Vol. 22, No. 3 (Proceedings of SIGGRAPH 2003, San Diego, CA, July 27-31), 2003.


Alignment Registration of Multiple Range Scans
University of North Carolina at Chapel Hill
Advisor: Dr. Lars Nyland (now at the Colorado School of Mines)
Summer 2002, Summer 2003

At UNC, I worked on a project that used a laser rangefinder device to capture and store large environments as computer models, in which my specific role consisted of creating a program to automatically align multiple overlapping sets of geometric data.

The models created by this project have been used for many different purposes, among which there are prominent applications in historical preservation and reconstruction (for forensic, artistic, or entertainment purposes). Under the historical preservation category, one of the highlight uses of this technology was to scan Thomas Jefferson’s Monticello residence and then put the resulting computer model on display at the New Orleans Museum of Art for an exhibit detailing Jefferson’s life. This exhibit was very successful, as it allowed people to "virtually" look into his study in full photo-realistic detail, with view dependence in the exhibit (i.e. giving spectators a 3D-accurate representation of the room when looking at it from different angles).

Another showcase use was constructing and scanning a mock crime-scene, complete with murder victim. The scene could then be analyzed in detail using the computer model. In addition, the use of a motion tracking system allowed for a user to hold a tracking-device and explore the body from all angles virtually.

Looking to the future, the use of this setup can be used for creating photo-realistic 3D environments for virtual reality simulators, video games, and motion pictures.

The Deltasphere laser range finder we used (co-designed by Dr. Nyland) scans an area using a rotating laser and calculates the distance to each point using the time of travel. The same area is then photographed by a digital camera (placed at the same center of focus). This 2-part scanning process is done from a variety of angles, to obtain complete geometric (3D) and photographic (2D) data of the entire scene. The processing phase then aligns the multiple range scans to each other, removes or blends duplicate information (such as textures for the same region as captured from different angles), maps the 2D pictures onto the 3D geometry, and finally simplifies the data to make it smaller and more manageable. Each step of this process is quite involved, but the alignment registration of the different 3D scans (the stage I worked on) is a particularly challenging problem because the presence of noisy data makes finding an accurate alignment match very difficult.

The program I wrote for this task expanded on the works of others in the field, particularly the Stanford University dissertation of Dr. Szymon Rusinkiewicz [5]. In this paper, Dr. Rusinkiewicz proposed a new method for registration that was ideally suited to our project because it exploited our knowledge of the camera location to more accurately sample the search space (i.e. in a radial manner as opposed to a linear one). I implemented this technique of aligning two scans, building it with a much more intuitive user-interface than previous systems; later I expanded this framework to align multiple scans simultaneously. However, I found that even the relatively small amount of noise in the input data greatly hampered our ability to achieve a perfect match, as local optimizations threw off the global match, and attempts at direct global registration were thwarted by the large non-overlapping portions of the data. Work on this problem still continues.

Images (Click for larger images):


This is a 360 degree view of the murder scene, as captured by the range scanner. Notice that it contains geometric data only (i.e. no color data), and that it is distorted due to the radial scanning pattern of the DeltaSphere. The blue areas are those where the laser sent out by the scanner did not return. This is usually glass or other highly reflective surfaces.

This is the interface to the scanning program, which shows a closeup from the scan. Notice that real-world distances can be found using the program. This data is originally in the form of a point cloud, but the program can be used to generate VRML files or other tesselated outputs.

This shows the geometric complexity of the captured scene. Notice that even flat walls are highly tesselated, due to slight errors in the measurement, which cause the flat wall to appear to have some local curvatures. This is just one of the effects that has to be corrected in post-processing.

One (cleaned-up) example of the output of the scanning process. This is NOT simply a photo! Notice that there are still some problems, such as the warping of the floor, the somewhat conflicting lighting on the walls, the holes in the floor and on the lamps, etc.

This is a view of the murder scene from an angle which was NOT captured by the DeltaSphere. Here we can see just how much data is still missing from the complete room; this leads to what is currently an active research problem: the next best scan. The goal of this problem is to determine where next to place the scanner so as to get most coverage of the missing areas.

Another view from an uncaptured angle. Notice the numerous artifacts throughout the room, suggesting the need for more post-processing work to clean up the data.


A view of Monticello, as captured by the DeltaSphere. This was later made into a stereogram by (art)n, which is now on display at UNC-CH.


A different view of Monticello, shown partially rendered in wireframe, to highlight the geometric complexity captured by the DeltaSphere

A 360 degree scan showing a garage

The garage as seen from above. The large circular hole is due to the floor and ceiling which were not scanned due to the DeltaSphere being placed directly in the center of that circle

A side view of the garage

Another view of the garage

References:

[5] S. Rusinkiewicz. Real-time Acquisition and Rendering of Large 3D Models. Ph. D. dissertation, Stanford University, Stanford, California, Aug. 2001.

©2005 Neeraj Kumar
neeraj.kumar1 (at) gmail.com