Research

Advanced Surveillance 

HomeResearchPublicationsActivitiesCollaboratorsPeople[PhD and Internship Projects]DatasetsContact

Overview

The project targets the domain of threat detection and assessment as well as public safety applications. The overall aims of the project is to develop new methods and technologies for providing situational awareness and proactive response to security issues. Specifically, the current focus is on:

    • Building situational awareness and identity inference from surveillance video and/or low-quality images
    • Real-time forensic analysis of multiple surveillance video feeds

Background and Motivation

There is evidence that record-only surveillance systems provide a significant deterrent to criminal acts by helping to identify and prosecute the offenders after the event. However, record only systems provide little or no deterrent to prevent acts of terror and do not prevent or mitigate harm – this can only be achieved with systems which can raise an alarm before or during the harmful event facilitating an appropriate human security response.  

The user-inspired motivation for this project is to develop a way to provide reliable real-time alarms and situational awareness from existing surveillance networks without the enormous cost of intensive human monitoring. The scientific challenge is to analyse a vast number of video streams in real-time to detect a range of events relevant to security needs.  To achieve this goal, a small subset of the parallel visual cognition ability of the human brain must be developed in a form which can be implemented on embedded hardware. 

A key problem for "face in the crowd" identity inference from existing surveillance cameras in public spaces (such as mass transit centres) is the issue of pose mismatches between probe and reference faces. In addition to accuracy, scalability is also important, necessarily limiting the complexity of face classification algorithms. Uncontrolled face recognition from CCTV video is a grand challenge. While most reports in the literature focus on passport quality face recognition, there is little work on video based face recognition. Various techniques are being developed to address limitations inherent in video based face recognition in particular and surveillance in general.

Multi-person tracking and labelling at a metropolitan railway stationFace recognition at a metropolitan railway station

The Research and its Potential Impact

Even though automatic identity inference of cooperative subjects through face recognition has achieved good results in controlled applications such as passport control (i.e. high resolution images and known pose, lighting, and expression), recognition in CCTV conditions is considerably more challenging. In summary, there are a number of key problems for “face in the crowd” recognition from existing surveillance cameras in public spaces (such as mass transit centres): (i) imprecise localisation/alignment (resulting in translation, scaling and in-plane rotation issues), (ii) low-resolution video, (iii) illumination variations, (iv) expression variations, (v) pose variations (out-of-plane rotations), (vi) scalability, (vii) real-time performance.

In general, an appearance based face recognition system is typically comprised of region of interest extraction (face localisation  and segmentation), followed by feature extraction and classification. The desired output of the localisation stage is a size normalised face image with eyes at fixed locations. However, there are no explicit guarantees that this will be the case – the output face might be at the wrong scale, subject to translations (i.e. shifts) and/or rotations, due to the quality of the data and/or the nature of the face localisation algorithm. Robustly extracting stable features can also be a challenge – even if the output of the localisation step is as desired, there are still issues with low-resolution video, varying illumination, expression, and pose, which all have the potential to affect the information extracted from a given image.

According to Phillips et al. head pose is believed to be the hardest factor to model [1]. In mass transport systems, surveillance cameras are often mounted in the ceiling in places such as railway platforms and passenger trains. Since the subjects are generally not posing for the camera, it is rare to obtain a true frontal face image. As it is not feasible to consider remounting all the cameras to improve recognition performance, any practical system must have effective pose compensation or be specifically designed to handle pose variations. A further complication is that we might have only one frontal gallery image of each person of interest (e.g. a passport photograph or a mugshot).

In addition to robustness and accuracy, scalability and fast performance are also of prime importance for surveillance. A recognition system should be able to handle large numbers of people (e.g. peak hour at a railway station), possibly processing hundreds of video streams. While it is possible to setup elaborate parallel computation machines, there are always cost considerations limiting the number of CPUs available for processing. In this context, an algorithm should be able to run in real-time or better, which necessarily limits complexity.

While there are existing approaches which solve one or two of the above-mentioned problems, there is currently no algorithm which concurrently addresses all of them. Most reports in the literature focus on passport quality face recognition, with little work specific to addressing surveillance conditions [2]. Although real-time face localisation is achievable [3], the quality of the localisation is variable due to the inherent nature of the approach (e.g. relatively large steps in scale). As such, any subsequent feature extraction and pattern recognition algorithms should take this variability into account. However, much research on facial feature extraction and classification naively assumes that the face localisation step is perfect. Most research on holistic face recognition can be placed into this category. (In holistic approaches, the spatial relations between face areas, such as the eyes and nose, are in effect rigidly kept).

Previous approaches to addressing pose variation include quasi-3D approaches [4, 5] where the 3D shape is inferred from 2D images. However, the computer graphics based approach presented in [4] suffers from high computational costs, while the method presented in [5] relies on good illumination conditions. We note that while true 3D based approaches in theory allow face matching at various poses, current 3D sensing hardware has too many limitations [6], including cost and range. Moreover unlike 2D recognition, 3D technology cannot be retrofitted to existing surveillance systems.  Other approaches for dealing with pose variations include the synthesis of new images at previously unseen views [7, 8], direct synthesis of face model parameters [9] and local feature based representations with relaxed constraints on the spatial relations between face parts [10, 11, 12]. 

[1] P.J. Phillips, P. Grother, R. Micheals, D.M. Blackburn, E. Tabassi, M. Bone. Face recognition vendor test 2002. Analysis and Modeling of Faces and Gestures. 2003.
[2]S. Zhou, V. Krueger, R. Chellappa. Probabilistic recognition of human faces from video. Computer Vision and Image Understanding, Vol. 91, 2003, pp. 214-245.
[3]P. Viola, M.J. Jones. Robust real-time face detection. International Journal of Computer Vision, Vol. 57, No. 2, 2004, pp. 137-154.
[4]
V. Blanz, T. Vetter. Face recognition based on fitting a 3D morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 25, No. 9, 2003, pp. 1063-1074.
[5] J. Zhang, Y. Yan, M. Lades. Face recognition: eigenface, elastic matching, and neural nets. Proceedings of the IEEE, Vol. 85, No. 9, 1997, pp. 1423-1435.
[6] K. Bowyer, K. Chang, P. Flynn. A survey of approaches and challenges in 3D and multimodal 3D+2D face recognition. Computer Vision and Image Understanding, Vol. 101, No. 1, 2006, pp. 1-15.
[7]V. Blanz and P. Grother and P.J. Phillips and T. Vetter. Face recognition based on frontal views generated from non-frontal images. International Conference on Computer Vision and Pattern Recognition, Vol. 2, 2005, pp. 454-461.
[8]T. Shan, B.C. Lovell. Face recognition robust to head pose from one sample image. International Conference on Pattern Recognition, Vol. 1, 2006, pp. 515-518.
[9]C. Sanderson, S. Bengio, Y. Gao. On transforming statistical models for non-frontal face verification. Pattern Recognition, Vol. 39, No. 2, 2006, pp. 288-302.
[10]F. Cardinaux, C. Sanderson, S. Bengio. User authentication via adapted statistical models of face images. IEEE Transactions on Signal Processing, Vol. 54, No. 1, 2006, pp. 361-373.
[11] S. Lucey, T. Chen. Learning patch dependencies for improved pose mismatched face verification. International Conference on Computer Vision and Pattern Recognition, Vol 1, 2006, pp. 909-915.
[12] E. Nowak, F. Jurie, B. Triggs. Sampling strategies for bag-of-features image classification. European Conference on Computer Vision, Part IV, Lecture Notes in Computer Science, Vol. 3954, 2006, pp. 490-503.