Research
| Home | Research | Publications | Activities | Collaborators | People | [PhD and Internship Projects] | Datasets | Contact |
Overview The project targets the domain of threat detection and assessment as well as public safety applications. The overall aims of the project is to develop new methods and technologies for providing situational awareness and proactive response to security issues. Specifically, the current focus is on:
Background and Motivation
There is evidence that record-only surveillance systems provide a significant deterrent to criminal acts by helping to identify and prosecute the offenders after the event. However, record only systems provide little or no deterrent to prevent acts of terror and do not prevent or mitigate harm – this can only be achieved with systems which can raise an alarm before or during the harmful event facilitating an appropriate human security response. The user-inspired motivation for this project is to develop a way to provide reliable real-time alarms and situational awareness from existing surveillance networks without the enormous cost of intensive human monitoring. The scientific challenge is to analyse a vast number of video streams in real-time to detect a range of events relevant to security needs. To achieve this goal, a small subset of the parallel visual cognition ability of the human brain must be developed in a form which can be implemented on embedded hardware. A key problem for "face in the crowd" identity inference from existing surveillance cameras in public spaces (such as mass transit centres) is the issue of pose mismatches between probe and reference faces. In addition to accuracy, scalability is also important, necessarily limiting the complexity of face classification algorithms. Uncontrolled face recognition from CCTV video is a grand challenge. While most reports in the literature focus on passport quality face recognition, there is little work on video based face recognition. Various techniques are being developed to address limitations inherent in video based face recognition in particular and surveillance in general.
The Research and its Potential Impact Even though automatic identity inference of cooperative subjects through face recognition has achieved good results in controlled applications such as passport control (i.e. high resolution images and known pose, lighting, and expression), recognition in CCTV conditions is considerably more challenging. In summary, there are a number of key problems for “face in the crowd” recognition from existing surveillance cameras in public spaces (such as mass transit centres): (i) imprecise localisation/alignment (resulting in translation, scaling and in-plane rotation issues), (ii) low-resolution video, (iii) illumination variations, (iv) expression variations, (v) pose variations (out-of-plane rotations), (vi) scalability, (vii) real-time performance. In general, an appearance based face recognition system is typically comprised of region of interest extraction (face localisation and segmentation), followed by feature extraction and classification. The desired output of the localisation stage is a size normalised face image with eyes at fixed locations. However, there are no explicit guarantees that this will be the case – the output face might be at the wrong scale, subject to translations (i.e. shifts) and/or rotations, due to the quality of the data and/or the nature of the face localisation algorithm. Robustly extracting stable features can also be a challenge – even if the output of the localisation step is as desired, there are still issues with low-resolution video, varying illumination, expression, and pose, which all have the potential to affect the information extracted from a given image. According to Phillips et al. head pose is believed to be the hardest factor to model [1]. In mass transport systems, surveillance cameras are often mounted in the ceiling in places such as railway platforms and passenger trains. Since the subjects are generally not posing for the camera, it is rare to obtain a true frontal face image. As it is not feasible to consider remounting all the cameras to improve recognition performance, any practical system must have effective pose compensation or be specifically designed to handle pose variations. A further complication is that we might have only one frontal gallery image of each person of interest (e.g. a passport photograph or a mugshot). In addition to robustness and accuracy, scalability and fast performance are also of prime importance for surveillance. A recognition system should be able to handle large numbers of people (e.g. peak hour at a railway station), possibly processing hundreds of video streams. While it is possible to setup elaborate parallel computation machines, there are always cost considerations limiting the number of CPUs available for processing. In this context, an algorithm should be able to run in real-time or better, which necessarily limits complexity. While there are existing approaches which solve one or two of the above-mentioned problems, there is currently no algorithm which concurrently addresses all of them. Most reports in the literature focus on passport quality face recognition, with little work specific to addressing surveillance conditions [2]. Although real-time face localisation is achievable [3], the quality of the localisation is variable due to the inherent nature of the approach (e.g. relatively large steps in scale). As such, any subsequent feature extraction and pattern recognition algorithms should take this variability into account. However, much research on facial feature extraction and classification naively assumes that the face localisation step is perfect. Most research on holistic face recognition can be placed into this category. (In holistic approaches, the spatial relations between face areas, such as the eyes and nose, are in effect rigidly kept). Previous approaches to addressing pose variation include quasi-3D approaches [4, 5] where the 3D shape is inferred from 2D images. However, the computer graphics based approach presented in [4] suffers from high computational costs, while the method presented in [5] relies on good illumination conditions. We note that while true 3D based approaches in theory allow face matching at various poses, current 3D sensing hardware has too many limitations [6], including cost and range. Moreover unlike 2D recognition, 3D technology cannot be retrofitted to existing surveillance systems. Other approaches for dealing with pose variations include the synthesis of new images at previously unseen views [7, 8], direct synthesis of face model parameters [9] and local feature based representations with relaxed constraints on the spatial relations between face parts [10, 11, 12]. |


