Human Detection & Re-Identification for Mass Transit Environments
Overview and Significance
Large networks of cameras are ubiquitous in urban life, especially in densely populated environments such as airports, train stations and sports arenas. For cost and practicality, most cameras in such networks are widely spaced so that their fields of view are non-overlapping. Automatically matching objects, especially humans, which re-appear across different cameras in such networks, is a key research goal in computer vision and a critical problem in homeland-security-related surveillance applications.
In recent years, the fundamental research question to reach the goal of automatically matching objects has been distilled into the human re-identification (re-id) problem. That is, given a cropped rectangle of pixels representing a human in one view, a re-id algorithm produces a similarity score for each candidate in a gallery of similarly cropped human rectangles from a second view. Computer vision research in re-id largely focuses on two issues. The first is feature selection; i.e., determining effective ways to extract representative information from each cropped rectangle to produce descriptors. The second is metric learning; i.e., determining effective ways to compare descriptors from different viewpoints. Feature selection and metric learning should work together so that images of the same person from different points of view yield high similarity, while images of different people yield low similarity. Re-id algorithms are typically validated on benchmarking datasets agreed upon by the academic community, notably the VIPeR, ETHZ and i-LIDS MCTS datasets.
However, feature selection and metric learning only represent two aspects of creating an effective real-world re-id algorithm. In practice, a re-id system must be fully autonomous from the point that an end user draws a rectangle around a person of interest to the point that candidates are presented to them. This implies that the system must automatically detect and track humans in the field of view of all cameras with speed and accuracy. The candidates in the re-id gallery in practice are, thus, automatically generated and are typically much lower-quality than the hand-curated gallery of a benchmark dataset; in fact, many candidate rectangles may not even represent humans. Furthermore, in a typical branching camera network, the camera in which the target reappears is unknown, so there are actually several separate galleries to search. The timing of the reappearance is also unknown; the galleries will be constantly updated with new candidates over the course of minutes or hours instead of being presented to the algorithm all at once. Finally, real-world re-id maps naturally onto a multi-shot problem. That is, there are multiple images available to describe both the target and the matching candidates, since after a target of interest is detected in the field of view of one camera, he/she is usually tracked until leaving the current view.
Additionally, the deployment of a re-id algorithm in a real-world environment faces many practical constraints not typically encountered in an academic research lab. In contrast to recently-purchased, high-quality digital cameras, a legacy surveillance system is likely to contain low-quality, perhaps even analog, cameras whose positions and orientations cannot be altered to improve performance. The video data collected by cameras in the network is likely to be transmitted to secure servers over limited-bandwidth links, and these servers are likely to have limited storage since many cameras’ data must be compressed and archived. These servers are also likely to be closed off from the internet, so that any algorithm upgrades and testing must be physically done on-site. Because the algorithm must run autonomously, a robust, crash-proof software architecture is required, that takes advantage of any possible computational advantage (e.g., parallel or distributed processing) while still guaranteeing low latency. On the front end, the algorithm must run in real time, updating a ranked list of matching candidates as fast as they appear in each potential camera, and the results must be presented to the user in an easy-to-use, non-technical interface. This project addresses the design and deployment of real-world re-id algorithms specifically designed for mass transit environments in which we had to surmount the above challenges. We call this implementation of re-id “tag-and-track’’, because the system begins with the user tagging a person of interest in one camera and attempts to track them throughout the broader camera network at the airport in real time. This involves:
1) The design and analysis of new computer vision algorithms for human detection and tracking, feature selection, and metric learning problems for re-id; as well as,
2) The selection and implementation of suitable algorithms in a modular, low-latency software architecture deployed at the Cleveland-Hopkins International Airport (CLE) for validation of the overall framework in an ALERT-designed camera network testbed.
Video surveillance is an integral aspect of homeland security monitoring, forensic analysis and intelligence collection. The research projects in this area were directly motivated (and in fact, requested) by DHS officials as critical needs for their surveillance infrastructure.Phase 2 Year 2 Annual Report
Rensselaer Polytechnic Institute
Students Currently Involved in Project
- Srikrishna Karanam
- Austin Li
- Gyanendra Sharma