VOCUS: A Visual Attention System for Object Detection and Goal-directed search
The phd-thesis was done at the Fraunhofer Institute for Autonomous Intelligent Systems (AIS) in Sankt Augustin, Germany. It was handed in at the University of Bonn in April 2005 and accepted July 2005.
In 2006, it was published in Springer's series "Lecture Notes in Artificial Intelligence (LNAI)", Vol. 3899 / 2006, Springer Berlin/Heidelberg. ISBN: 3-540-32759-2. You can access the thesis over the Springer online pages or directly here: PDF (© Springer).
Visual attention is a mechanism in human perception, which selects relevant regions from a scene and provides these regions for higher-level processing as object recognition. This enables humans to act effectively in their environment despite the complexity of perceivable sensor data. Computational vision systems face the same problem as humans: there is a large amount of available information that has to be processed and if an efficient processing shall be achieved, maybe even real-time performance in robotic applications, the order in which a scene is investigated has to be determined in an intelligent way. A promising approach to achieve this is the use of computational attention systems that simulate human visual attention.
This thesis introduces the biologically motivated computational attention system VOCUS (Visual Object detection with a CompUtational attention System) that detects regions of interest in images. It operates in two modes, in an exploration mode in which no task is provided, and in a search mode with a specified target. In exploration mode, regions of interest are defined by strong contrasts (e.g. color or intensity contrasts) and by the uniqueness of a feature. For example, a black sheep is salient in a flock of white sheep. In search mode, the system uses previously learned information of a target object to bias the saliency computations with respect to the target. In various experiments, it is shown that the target is in average found with less than three fixations, that usually less than five training images suffice to learn the target information, and that the system is mostly robust with regard to viewpoint changes and illumination variances.
Furthermore, we demonstrate how VOCUS profits from additional sensor data: we apply the system to depth and reflectance data from a 3D laser scanner and show the advantages that the laser modes provide. By fusing the data of both modes, we demonstrate how the system is able to consider distinct object properties and how the flexibility of the system increases by considering different data. Finally, the regions of interest provided by VOCUS serve as input for a classifier that recognizes the object in the detected region. We show how and in which cases the classification is sped up and how the detection quality is improved by the attentional front end. Especially useful is this approach if many object classes have to be considered, a frequently occurring situation in robotics.
VOCUS provides a powerful approach to improve existing vision systems by concentrating computational resources to regions that are more likely to contain relevant information. The more the complexity and power of vision systems increases in the future, the more they will profit from an attentional front-end like VOCUS.