CASY: Cognitive Assistive Systems
Spokespersons for the DAAD PhD Programme CASY and the associated PhD Programme CINACS
Prof. Dr. Stefan Wermter (CASY)
Head of Knowledge Technology
University of Hamburg, Department of Informatics
Prof. Dr. Jianwei Zhang (CINACS)
Head of Technical Aspects of Multimodal Systems
University of Hamburg, Department of Informatics
The programme “Cognitive Assistive Systems (CASY)” will contribute to the focus on the next generation of human-centred systems for human-computer and human-robot collaboration. A central need of these systems is a high level of robustness and increased adaptivity to be able to act more natural under uncertain conditions. To address this need, research will focus on cognitively motivated multi-modal integration and human-robot interaction. Modalities from sensors and processed representational modalities will be combined in multi-modal perception or communication models and will be evaluated for their use in assistive systems based on computational, statistical, neural or hybrid approaches.
Aims and Objectives
Key to human and machine intelligence and behaviour is learning and memory. Based on frequent interactions between users and the real world, incrementally diverse skills can be acquired from users and be integrated to support intelligent cooperation. However, to collaborate effectively with a computer, mobile device or intelligent robot, next generation human-centred systems for human-computer and human-robot collaboration need to be more robust, more adaptive, more predictive and acting more natural under uncertain conditions. This aim is key to acceptance by users and for an intuitive collaboration between humans and cognitive assistive systems. Projects following this human-centred approach allow to tackle challenges like hybrid architectures based on data, information and knowledge; personalized interfaces and learning robots; novel learning technologies in self-aware machines and complex system engineering. From a cognitive perspective, the adaptation of multi-sensory processes and the communication across different modalities and systems in the brain is of great interest. From our computational perspective, we are concerned with the design and implementation of those observed cross-modal interactions and their use for computational systems.
The research questions to be addressed include:
- Which role does multi-modality play in human-computer and human-robot interaction?
- How can the principles of cross-modal interaction observed in natural systems be implemented in artificial systems?
- How is cross-modal information integrated to support robust decision making?
- Which codes does the brain use to allow communication across different modalities and systems?
- How does multi-sensory learning and alignment adapt and change in the long run?
- How are the principles of cross-modal interaction at perceptual level related to principles of cross-modal interaction and integration of higher cognitive functions like language processing and image understanding?
- How can human-computer and human-robot interaction be enhanced with rapid response capabilities to provide immediate confirming or corrective feedback?
T1: Neurocognitive Architectures for Human-Robot Assistance
All our cognitive performance rests on the brain and cognitive performance based on the brain still outperforms computational processing in terms of interaction or navigation. Therefore, we are pursuing a research programme how brain-inspired findings from cognitive science and neuroscience can lead to new forms of embodied neurocognitive architecture based on robust, learned and embodied processing. The research builds on our previous brain-inspired models of higher-level cognitive functions related to language, vision, working memory, and intelligent control systems (Wermter et al. 2001, Wermter et al. 2004, Murray et al. 2009, Wermter et al. 2009). Our research has led to unsupervised learning algorithms that explain neural development in the visual and motor cortex (Wermter and Elshaw 2003, Wermter et al. 2005b). Biological evidence on mirror neurons (Rizzolatti and Arbib 1998) has contributed to our models of the motor cortex (Elshaw et al. 2004, Weber et al. 2005) and the relation to language representations (Panchev and Wermter 2006, Pulvermüller and Fadiga 2010). This neurocognitive evidence from the brain is considered for the design of a computational methodology and hybrid robotic architecture based on computational neural networks and statistical symbolic representations. This brain-inspired embodied robot approach may not only lead to a better functional understanding of brain processing, but in the long run also to robust adaptive assistants functioning under noisy conditions. To focus the project we will concentrate on human robot assistance in a home environment where a learning humanoid robot will be trained on assisting with small tasks like identifying a human based on sound and vision, approaching a person, delivering small carried items and finding the way back to a charging station. Neural and statistical symbolic structures will be integrated as we have, for example, identified neural fields to be interchangeably used with statistical particle filters (Yan et al. 2011). Emphasis will be put on integrating the implementations and representations into the neural and statistical architecture.
T2: Interactive and Incremental Spoken Language Comprehension for Assistive Systems
Language comprehension is a complex task which involves a range of different representational levels and processing components which mediate between them. In addition to the phonetics of sounds, syllables and lexical items, the range of phenomena language comprehension has to deal with includes at least syntactic and semantic regularities, which establish the meaning of isolated utterances as well as aspects of the discourse structure, which captures the relationships between them. Moreover, suprasegmental (prosodic) properties of the speech signal not only indicate sentence mode and phrasal boundaries, but also help to guide the attention of the hearer to those parts of an utterance, the speaker wants to emphasize. Usually, multi-component and multi-level architectures are adopted to cope with this multitude of different issues. Complexity increases further if multi-modal interaction, possibly with different sensory input channels, is considered. In the simplest case a dialog system is composed out of four different components, speech recognition and synthesis, a backend application, and a dialog model which connects the other components. Even in this case, non-trivial interaction patterns between the components arise if the whole system is expected to work incrementally and to exhibit an immediate response capability. This is particularly important in the context of cognitive assistive technology. The project aims at investigating such interaction patterns under the conditions of recognition uncertainty. In particular, the effect of top-down expectations originating from the verbal or situational context will be examined. They can already be studied in the interplay between a speech recognizer and a dialog model, but more sophisticated architectures including components explicitly dealing with world knowledge and user intentions might also be included here. A dialog model represents the regularities of typical verbal exchanges by means of transitions between dialog states, thus being able to capture essential aspects of discourse structure across the individual verbal contributions of the dialog partners. Moreover, it can be used to generate dynamic, i.e. dialog-state dependent predictions, which are usually highly effective in reducing the hypothesis space of the speech recognition.
T3: Multi-Sensor Based Situation Awareness for Human-Centric Assistive Navigation
The objective of this PhD topic is to investigate fundamental issues on situation awareness in assistive navigation for blind persons to access GPS-denied environments with wearable sensors and assistive display units (i.e. audio and tactile). People with normal vision orient themselves in physical space and navigate from place to place with ease, even in unfamiliar environments. However, these are challenging tasks for people who are blind or have significant visual impairments even with the help of electronic travel aids and vision techniques(Laurent and Christian 2007, Cardin et al. 2007, Ran et al. 2004, Coughlan et al. 2006). While GPS-guided travel aids show much promise in waypoint navigation in outdoor environments, there is still a paucity of technologies to help blind people to independently find doors, rooms, elevators, stairs, bathrooms, and other building amenities in GPS-denied indoor environments. The ability of blind/visually impaired people to access, understand, and explore unfamiliar environments will improve their inclusion and integration into society. The proposal is to be backed up by the previous research experience in computer vision and the interaction of different modalities of sensory and cognitive systems (Mader et al., 2008; Chen et al., 2008; Zhang et al., 2011; Elmogy et al., 2011). The aim of this project is to 1) develop assistive navigation software based on the simultaneous localization and mapping (SLAM) principle, which will be able to detect landmarks by feature extraction (both spatial and temporal, text and iconic, etc), generate and update a navigation map, register landmarks in the map, and associate new measurements with landmarks and detect loop closure to help the blind users obtaining a global perception of the environment and making travel decisions; 2) develop new methods and algorithms to understand surrounding environments (i.e. detect elevators, exits, bathrooms, doors, etc) to handle occlusions, big intra-class variations, small inter-class variations, and different views and scales by associating contextual information.
T4: Verbally Assisted Exploration of Virtual-Tactile Graphical Representations
The use of spatial knowledge is ubiquitous in our daily life. For example, the task of selecting a residence to rent or of planning a route in an unknown territory is not possible without the spatial knowledge of the apartment. In both cases, maps, i.e. external representations, are effective means to solve such problems. External representations (graphs, charts, or diagrams), which are related to but different from maps, are based on graphical elements constituting meaning by use of their spatial properties and spatial relations among them (Kosslyn 1989, Habel & Acartürk 2007, 2011, Acartürk et al. 2008). For visually impaired people, the information presented by traditional, i.e. printed or drawn maps, graphs, etc. are not directly accessible. In order to overcome these limits, appropriate substitutions of such graphical representations are required. Whereas visual perception supports comprehension processes, which switch between global and local aspects of a graphical representation, haptic perception has a more local and in particular a more sequential character. Thus, compared to visual maps one drawback of tactile maps is the restriction of the haptic sense regarding the possibility of simultaneous perception of information (see, Loomis et al., 1991). The additional effort in haptic perception leads, for example in the case of exploring embossed maps, to specific limitations for building up cognitive maps, such as sparse density of information and less sufficient survey knowledge compared to route knowledge. Similar problems occur in exploring physically realized tactile graphs and diagrams. To overcome the problems due to the restriction of the haptic sense in simultaneous perception of information, providing additional information, such as auditory assistance through the auditory channel, has been proved to be helpful(Wang et al. 2009, Lohmann et al. 2011, Lohmann & Habel 2012). This project aims at investigating multimodal comprehension and generation on two layers, (1) the representational layer, e.g. language – graphics, and (2) the perceptual layer, in particular, the concerning sensor channels of vision, audition, and haptics. The main research questions to be investigated in this project consider the interplay between processes in different layers (and sub-layers). The similarity of the two processing structures ─ the one leading to the cognitive representation of the environment and the other to the desired action ─ promises a deeper understanding of sensory and motor cortices. A robotic actor will plan and move while considering a moving human in a dynamic environment.
Acarturk, C., Habel, C., Cagiltay, K. & Alacam, O. (2008). Multimodal Comprehension of Language and Graphics: Graphs with and without annotations. Journal of Eye Movement Research, 1(3):2, pp. 1-15.
Baker, A., Porpora, K., http://www.nytimes.com/2009/05/02/nyregion/ 02elevator.html, May 1, 2009.
Cardin, S., Thalmann, D., Vexo, F. (2007) Wearable Obstacle Detection System for Visually Impaired People, The Visual Computer, Vol. 23(2), pp 109 – 118.
Chen, S., Li, Y.F., Zhang, J. (2008), Vision Processing for Realtime 3D Data Acquisition Based on Coded Structured Light, IEEE Transactions on Image Processing, Vol. 17(2), pp. 167-176.
Coughlan, J., Manduchi, R., Shen, H., (2006) Cell Phone-based Wayfinding for the Visually Impaired," 1st Int. Workshop on Mobile Vision.
Elmogy, M., Habel, C., and Zhang, J. (2011) Multimodal Cognitive Interface for Robot Navigation, Cognitive Processing, Vol. 12(1), pp. 53-65.
Elshaw M., Weber C., Zochios A., Wermter S. (2004) An associator network approach to robot learning by imitation through vision, motor control and language. In Proc. IJCNN 2004, pp. 591-596.
Habel, C., & Acartürk, C. (2007) On reciprocal improvement in multimodal generation: Co-reference by text and information graphics. In I. van der Sluis, M. Theune, E. Reiter & E. Krahmer (Eds.), Proceedings of MOG 2007: The workshop on multimodal output generation, pp. 69-80. University of Aberdeen, UK.
Habel, C., & Acartürk, C. (2011) Causal inference in graph-text constellations: Designing verbally annotated graphs. Tsinghua Science and Technology, 16(1), pp. 7-12.
Kosslyn S M. (1989) Understanding charts and graphs. Applied Cognitive Psychology, 3(3), pp. 185-226.
Laurent, B., and Christian, T. (2007) A sonar system modeled after spatial hearing and echolocating bats for blind mobility aid, International Journal of Physical Sciences Vol. 2 (4), pp. 104-111.
Lohmann, K., Eschenbach, C. & Habel, C. (2011) Linking Spatial Haptic Perception to Linguistic Representations: Assisting Utterances for Tactile-Map Explorations. In M. Egenhofer, N. Guidice, R. Moratz & M. Worboys (eds.) Spatial Information Theory, 10th International Conference, COSIT 2011. pp. 328–349. Berlin: Springer-Verlag.
Lohmann, K. & Habel, C. (2012) Extended Verbal Assistance Facilitates Knowledge Acquisition of Virtual Tactile Maps. In: Schill, K., Stachniss, C., Uttal, D. (eds.). Spatial Cognition VIII. LNCS. Springer, Berlin.
Loomis, J., Klatzky, R., Lederman, S. (1991) Similarity of Tactual and Visual Picture Recognition with Limited Field of View. Perception 20, pp. 167-177.
Mäder, A., Bistry, H., and Zhang, J. (2008) Intelligent Vision Systems for Robotic Applications, International Journal of Information Acquisition, World Scientific Publishing Company, 2008, Vol. 5(3), pp. 259-267.
Murray J., Erwin H., Wermter S. (2009) Robotic sound-source localisation architecture using cross-correlation and recurrent neural networks. Neural Networks 22, pp. 173-189.
Panchev C., Wermter S. (2006) Temporal Sequence Detection with Spiking Neurons: Towards Recognizing Robot Language Instruction. Connection Science 18(1), pp. 1-22.
Pulvermüller F., Fadiga L. (2010) Active perception: Sensorimotor circuits as a cortical basis for language. Nature Review Neuroscience 11(5), pp. 351-360.
Ran, L., Helal, S., Moore, S., (2004) Drishti: An Integrated Indoor/Outdoor Blind Navigation System and Service, 2nd IEEE Int. Conf. on Pervasive Computing and Communications (PerCom'04).
Rizzolatti G., Arbib M. (1998) Language within our grasp. Trends in Neuroscience 21(5), pp.188-194.
Wang, Z., Li, B., Hedgpeth, T., Haven, T. (2009) Instant Tactile-audio Map: Enabling Access to Digital Maps for People with Visual Impairment. Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility. pp. 43–50. ACM, Pitts-burg, PA.
Weber C., Wermter S., Elshaw M. (2005) A hybrid generative and predictive model of the motor cortex. Neural Networks 19(4), pp. 339-353.
Wermter S., Austin J., Willshaw D. (2001) Emergent Neural Architectures based on Neuroscience. Springer, Heidelberg.
Wermter S., Elshaw M. (2003) Learning robot actions based on self-organising language memory. Neural Networks, 16(5-6), pp. 661-9.
Wermter S., Weber C., Elshaw M., Panchev C., Erwin H., Pulvermüller F. (2004) Towards multimodal neural network robot learning. Robotics and Autonomous Systems 47, pp. 171-175.
Wermter S., Weber C., Elshaw M., Gallese V., Pulvermüller F. (2005) Neural grounding of robot language in action. In: Biomimetic Neural Learning for Intelligent Robots, pp. 162-181. Springer.
Wermter S., Page M., Knowles M., Gallese V., Pulvermüller F., Taylor J. (2009) Multimodal communication in animals, humans and robots: An introduction to perspectives in brain-inspired informatics. Neural Networks 22(2), pp. 111-115.
Yan W., Weber C., Wermter S. (2011) A hybrid probabilistic neural model for person tracking based on a ceiling-mounted camera, Journal of Ambient Intelligence and Smart Environments 3(3), pp. 237-252.
Zhang, J., Xiao, J., Zhang, J., Zhang, H., and Chen, S. (2011) Integrate multi-modal cues for category-independent object detection and localization. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 801-806.