Seminar Talks

Seminar/Oberseminar: Knowledge Technology

The seminar is open to interested students and staff.

News/Aktuelles

In the semester the group seminar is scheduled for Tuesday 14:15. If you want to receive the latest talk announcements to be informed about ongoing work of the Knowledge Technology research group, please write an email to: mostafa.kotb@uni-hamburg.de with your current email address and a short statement. You will be then included in our mailing list.

General Information/Allgemeine Informationen
LV-Nummer:	64-487
Lecturer:	Prof. Stefan Wermter
Period:	2 UE / Wöchentlich Tue 14:00-16:00 / 2:15pm
Room:	D-220
Language:	English/Deutsch

Contents/Inhalte

Talks, presentations and discussion of recent results from new research projects. Topics include Bio-inspired learning algorithms, Neural Networks, Hybrid Knowledge Technology Systems, Data Mining and Human-Robot-Interaction.

Diskussion aktueller Ergebnisse aus laufenden Forschungsvorhaben und Drittmittelprojekten. Themen sind u.a. Bio-inspirierte Lernalgorithmen, Neuronale Netzwerke, Hybrid Knowledge Technology Systems, Data Mining, Mensch-Roboter-Interaktion.

Objective/Lernziel

Get to know the research topics in the Knowledge Technology Group. Discussion, planning and immersion of current research projects. Possibility for choosing interesting studies in the Knowledge Technology Labs or Diploma-, Master Thesis topics.

Kennenlernen der Forschungsthemen der WTM Arbeitsgruppe. Diskussion, Planung und Vertiefung aktueller Forschungsprojekte. Gelegenheit für die Wahl interessanter Studien-, Diplom-, Masterarbeitsthemen.

Procedure/Vorgehen

Contributions and discussions of participants / Beiträge und Diskussionen der Teilnehmer:

Winter semester 24/25

Winter Semester 2024/2025
DATE	TITLE	NAME
ABSTRACT.
05.11	Benchmarking and Improving Inverse Kinematics for Endoscopic Inspection Tasks	Timon Engelke
In the context of robotic automation, inspection tasks are becoming increasingly important. In endoscopic inspections, a robot has to enter into a hollow cavity, such as a fuel tank. For motion planning within these constrained environments, inverse kinematics (IK) plays a crucial role. This thesis analyzes the deficiencies of current IK solvers in these environments. For this purpose, a benchmarking software and a benchmarking dataset are created. Based on two state-of-the-art solvers, BioIK and RelaxedIK, several solver improvements with a focus on endoscopic inspections are implemented and evaluated. The results show that directly including collision checking into the solvers is not feasible with the currently integrated collision detection library. The solve rates were significantly increased by implementing cost functions that instead make use of geometric properties of the problem. Additional analyses of the influence of optimization solvers and solver parameters reveal little room for improvement, while adapting the seed state shows greater potential.
22.10	Open-Vocabulary Robotic Object Manipulation using Foundation Models	Stig Griebenow
Classical vision-language-action models primarily rely on unidirectional commu- nication, limiting their capacity to interact naturally with humans. Recent ad- vancements in cross-modal AI aim to address this limitation by building models which also output language. One such model, CrossT5, successfully leverages a pre-trained large language model to perform language-based tasks, but it lacks visual generalization capabilities which restrict it to perform actions only on ob- jects seen during training. To address these limitations, this research proposes the OWL×T5 model, which integrates the OWLv2 open-vocabulary object detection model into the CrossT5 framework. The objective is to enable zero-shot general- ization of robot actions to novel objects. We train OWL×T5 on a small dataset generated in a simulated environment using the NICO humanoid robot. For evalu- ation, we introduce a new dataset, CLAEO, featuring everyday object interactions. Results indicate that the model successfully generalizes actions to new objects not seen during training, while preserving the original language-based capabilities and data efficiency of the CrossT5 architecture.
15.10	Analysing the Impact of Prompt Variability on Distilled Knowledge in Anomaly Segmentation	Daniela Winter
In the domain of marine bird segmentation, there is a lack of datasets with annotated bird segmentation masks using aerial images, and obtaining such annotated datasets is both, time-consuming and costly. To address this problem, this work presents a novel approach to marine bird segmentation in aerial images by utilizing a foundation model for anomaly detection called Segment Any Anomaly (SAA). SAA demonstrates very good zero-shot performance in detecting marine birds in aerial images because the task is similar to anomaly segmentation. We use SAA's predictions as teacher soft labels in a knowledge distillation process to train a more compact student model. This smaller student model can generate predictions more quickly and requires less memory and computational power. SAA requires some domain expert knowledge in the form of prompts to generalize well across different areas of anomaly segmentation. We explore how varying these prompts influences the performance of the student model. Through systematic experimentation, we adjust these prompts to optimize the student model's segmentation performance. Compared to standard methods, our approach improves training efficiency and achieves superior segmentation quality without the need for any annotated data. This is a significant advancement for automatic marine bird segmentation, given the lack of large public annotated aerial marine bird segmentation datasets. Moreover, this method of leveraging foundation models for specific tasks can be generalized to other areas, highlighting the potential of adaptive prompting in foundation models..
01.10	Interactive imitation learning through feedback from large language models	Jonas Wener
Recent advancements in Machine Learning provide tools to train autonomous agents capable of handling the increasing complexity of sequential decision-making in robotics. Imitation Learning (IL) is a prominent approach, where agents learn to control robots based on human-provided demonstrations. However, IL normally suffers from compounding errors from data mismatches, where the learned agent faces situations not covered in the demonstrations. Interactive Imitation Learning (IIL) improves IL on these issues by allowing agents to learn through interactive feedback from human teachers, improving its performance. Despite these improvements, both approaches still invoke significant costs due to the necessity of human involvement. Leveraging the emergent capabilities of Large Language Models (LLMs) in reasoning and generating human-like responses, this research introduces LLM-iTeach --- a novel IIL framework that utilizes an LLM as an interactive teacher to enhance agent performance while alleviating the dependence on human resources. LLM-iTeach uses a hierarchical prompting strategy to guide the LLM in generating a decision-making policy in Python code. This allows the LLM to provide corrective and evaluative feedback interactively. We evaluate LLM-iTeach against baseline methods, such as Behavior Cloning an IL method and CEILing, a state-of-the-art IIL method, on various robotic manipulation tasks. Our results demonstrate that LLM-iTeach surpasses Behavior Cloning in the success rate and matches that of CEILing. These findings highlight the potential of LLMs as cost-effective, human-like teachers in interactive learning environments, suggesting a promising direction for training autonomous agents.

Summer Semester 2024

Summer Semester 2024
24.09	Efficient Supernet Training with Informed Initialization	Emad Aghajanzadeh
The growing range of deep learning applications raises the importance of efficient deployment strategies to optimize accuracy across different hardware and latency conditions. Manual design and conventional neural architecture search (NAS) cannot scale well with the increasing number of deployment scenario. This is because they need to perform the training and search from scratch for each of scenarios. Once-for-all (OFA) training was proposed to reduce the NAS cost by decoupling the expensive training part from the more affordable search phase, and do the training only once. Although this idea significantly reduced the overall NAS time, still the training time to obtain a well-trained supernet is prohibitive for many use-cases. We propose to alleviate this problem by taking advantage of pre-trained models to make the training converge faster. This pre-trained model should effectively transfer its knowledge to the other networks while preserving its competence, which is challenging due to the weight-sharing nature of supernet training. In this work, we leverage neural adaptation methods to transfer the knowledge from one network to the other on-the-fly and before the training phase. As a result, the overall supernet training remains unchanged and only an initialization step is added. Our experiments on CIFAR100 and ImageNet showed 2$\times$ faster convergence and a 2.5\% accuracy improvement during full training. These results highlight how effectively integrating a pre-trained model into the supernet accelerates convergence and enhances performance.
10.09	From Synthetic Data to Real-World HRI: Leveraging 3D Vision for Humanoid Robotic Perception	Lukáš Gajdošech
This talk presents recent work at the intersection of Computer Vision, 3D perception, and Human-Robot Interaction (HRI). The author’s prior research has focused on object detection, classification, and localization in industrial settings. Due to the scarcity of available data from such environments, these problems were tackled by training models on synthetic data. However, this approach encountered significant limitations when applied to transparent objects like glass containers. The complexity of photorealistically rendering transparent materials led to poor generalization in real-world scenarios. Building on this experience, the current project shifts focus toward real-world data capture in the context of Human-Robot Interaction (HRI). The author collaborates with a robotics team to enhance a humanoid robot’s ability to collaborate with humans in the task of pouring liquids into transparent containers. This task presents unique challenges in perception, scene understanding, and real-time interaction. Additionally, the design of the data-capturing procedure enabling further research in different areas of computer vision will be discussed. These include advances in Neural Radiance Fields (NeRFs) evaluation, surface fusion, depth estimation, and transparent object detection..
10.09	Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning	Xiaowen Sun
The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation. However, detecting an object's state and generating a state-sensitive plan for robots is challenging. Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans. However, to the best of our knowledge, there is hardly any investigation on whether LLMs or VLMs can also generate object state-sensitive plans. To study this, we introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks. We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module (dense captioning model, DCM) and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM. To quantitatively evaluate the performances of the two methods, we use tabletop scenarios where the task is to clear the table. We contribute a multimodal benchmark dataset that takes object states into consideration. Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach..
03.09	LeoLM: Evaluating and Enhancing German Language Proficiency in Pretrained Large Language Models	Björn Plüster
This thesis investigates the effectiveness of continued pre-training in enhancing the German language capabilities of large language models (LLMs) initially trained predominantly on English data. We introduce LeoLM, a series of models derived from the Llama-2 family, adapted through continued pre-training on a large corpus of high-quality German texts. Our research addresses four key questions: the improvement of German language capabilities, effective evaluation methodologies for non-English LLMs, mitigation of English-centric bias, and the influence of model size on these aspects. Through comprehensive experiments and evaluations, we demonstrate significant improvements in German language tasks across various benchmarks, including MMLU-DE, ARC-DE, and HellaSwag-DE. Our models show enhanced performance in German-specific knowledge, definition generation, and translation tasks. Notably, we observe a clear shift in cultural frame of reference towards German, as evidenced by our novel Cultural Bias Evaluation. The study reveals that larger models (13B and 70B parameters) exhibit more balanced improvements across both German and English tasks, suggesting better retention of cross-lingual knowledge. Our approach proves effective in mitigating English-centric bias while largely maintaining English language capabilities, especially in larger models. This research contributes to the field of multilingual AI by demonstrating an efficient method for adapting LLMs to new languages without training from scratch. Additionally, all trained models, including base models and chat-tuned variants, have been made publicly available on the HuggingFace platform (https://huggingface.co/LeoLM), facilitating further research and practical applications in German language AI.
27.08	Compositional Generalization in Transformers	Imran Ibrahim
MSc Spot talk.
27.08	Explaining the Decision Process in Object Recognition Tasks using Natural Language Annotations	Christoph Wessarges
Traditional image annotation tools typically focus on the classification accuracy of labeling objects within an image, which has been sufficient as training data for object recognition models. However, such models hide the reasons for the labeling decision, which would be of great value for the transparency and credibility of a model’s results. Annotation tools lack the functionality to enable annotators to provide insights into their decision process. However, a supplementary explanation may highlight important details about the visible concepts of an object entity and which concepts are visible in a specific given image. Naturally, not every image depicts all the distinguishing features of that object’s class. Therefore, these explanations represent a relevant knowledge source about the descriptional features of a given image. Mining a large amount of explanations can give interesting insights into the dataset and its samples, like summarizing the features that are typically present in most instances of a particular class. These explanations can be further processed by applying text annotation methods. AI systems may profit from such to improve the expressiveness of its training features. This research pursues the question how to effectively design an annotation tool that supports the inclusion of explanations and their processing, and how such a tool can facilitate learning of mapping an image and a set of concepts to its correct class label to enable AI models to self-generate an explanation for their decision-making process. This work comprises of two main contributions: 1) An annotation system prototype is presented, that demands annotators to supply an additional formulated explanation, in order to describe the visible queues of an image, which represents a particular object that ought to be annotated. 2) The tool integration of an exemplary XAI strategy to analyse and process such explanation annotations in order to extract the key concept phrases automatically.
20.08	2D Object Detection and Segmentation on NICOL Robot	Laura Jouvet
Object detection has become indispensable in many fields of computer vision like autonomous vehicles, security or augmented reality. It is also one of the most important parts of the NICOL robot, as it is the starting point for almost all its actions, such as grasping objects or answering questions asked by humans about its environment. In this research, we compare the well-known object detectors YOLO-World, OWLv2 and ViLD on our own dataset to determine which is the most effective with NICOL robot. Then, as one of the most widely used methods for object detection is image segmentation, we try to add the Segment Anything Model (SAM) to our architecture in order to improve our detections.
20.08	Out-of-distribution Detection in Computer Vision based on Deep Metric Learning	Swetlana Shaban
This thesis develops and discusses Deep Metric Learning (DML) approaches for the problem of Out-Of-Distribution (OOD) detection. Our experimental studies focus on two benchmark datasets, CIFAR-10 and CIFAR-100. We implemented and trained both DML and pure classification models of varying sizes to ensure that DML-based neural network reaches the same quality in classification task. Subsequently, we implemented and compared two OOD detection algorithms. The results show that the ResNet-18-based DML model performs similar to its deeper modification ResNet-50 on CIFAR-10 dataset (84% vs. 89% accuracy). The performance difference was more significant on CIFAR-100 dataset (54% vs. 65% accuracy). Our OOD detection algorithms showed promising results, achieving AUROC scores of 79.5% for CIFAR-10 as In-Distribution and CIFAR-100 as Out-Of-Distribution, and 70.4% for CIFAR-100 as In-Distribution and CIFAR-10 as Out-Of-Distribution. Throughout the thesis we investigate the reasons for misclassifications made by algorithms. As a result, we raise a question which classes should one consider as OOD. Additionally, we observed that models tend to learn simple features sufficient to distinguish In-Distribution classes, leading to misclassification of OOD images based on background or object position..
13.08	Human Object Pointing Matching via Transformer-Based Attention	Luca Mueller
BSc Spot talk.
13.08	Extending NICO's Reach through Torso Movement	Hendrik Pfennig
We perform a simulated robotic tabletop grasping task on the humanoid robot NICO. The key innovation of this work is the addition of its hip joint to the kinematic chain, enabling more flexibility and an extended reach. We synthesize a grasping dataset from scratch using Inverse Kinematics and leverage Stereo Vision as well as proprioceptive data to predict full body poses in a supervised setting. To provide a stable and precise gradient descent, we employ a composed loss function to output an excess of information to ensure precise internal state representations. The simulation environment CoppeliaSim is used for both recording and evaluation. We successfully fit a model to solve the proposed object grasping task, showing feasibility of the extended kinematic chain. Not only do we increase the area of reachable object positions by 45%, we also show an increase in performance for short range grasping. To closely examine the architecture we propose, we conduct an ablation study giving further insights into the contributions of its parts.
06.08	Human Navigation Driven: Modeling Visual Localization with Cognitive Graphs and Local Scenes	Eric Bergter
We present a novel approach to create a modular nested topological model that mimics two essential human navigation strategies to process complex environments by simplifying them. The first strategy, the local scene approach, breaks down visually observed spaces into smaller, localized scenes. Within these scenes, features or landmarks serve as anchors to the internal model. The second strategy, the cognitive graph, connects the local scenes with each other using labeled graph-based topology. This internal representation treats each location as its own virtual space, where objects are connected by their relative size and position rather than absolute units. Graphs can be freely linked and stored inside other nodes. By nesting graphs it is possible to store multiple models side by side without interference, eliminating the need for a global error correction mechanism. This concept is designed to be a basic framework that can advance our understanding of the perception and navigation to handle large complex and dynamic environments. The resulting prototype can construct graphs and do localization within them. As a result of the graph creation process a new type of dataset called TileWorld is presented which is specialized for learning spacial structures from 2D images.
30.07	LatentTrace: Using the latent space of an encoded anomalous image to retrace the target image	Leon Koch
In the context of industrial processes, various neural networks (NN) can be trained and applied for anomaly detection. But AI does not always uphold the necessary standards of transparency and explainability necessary for the respective industry domain. In this research, we propose LatentTrace, a method that uses a neural network to encode an image containing an anomaly and retrace the corresponding anomaly-free image with the same background from an existing dataset. This allows us to calculate the difference between two images and thus extract the anomaly. To encode the images, we use an autoencoder trained on an unlabeled image dataset from the aviation domain. In this manner, we do use AI, but the output was retraced, not created by it, which helps partly bypass the problem of the inherent lack of transparency of AI and therefore increase user trust in this AI-powered anomaly detection method.
18.06	Progressing from gesture recognition to the transcription of sign languages	Lukas Braach
This thesis explores the applicability of transformer architectures, specifically Video-Vision-Transformers (ViViTs), in sign language recognition. Leveraging advancements in Gesture Recognition, the study aims to establish a performance baseline using the RWTH Phoenix 2014 dataset and assess pre-training knowledge transfer to the sign language recognition application field. Methodologically, a ViViT architecture, incorporating the recent DINOv2 model for spatial vision feature extraction, is proposed. A key point in the study is to measure the impact of recent Machine Learning breakthroughs—like ViViTs and unified pre-training—to the downstream task performance. By comparing the custom encoder architecture with a pre-trained VideoMAE in an ablation study, the efficacy of different approaches to sign language recognition is evaluated.
11.06	Evaluation of a Human-Like Artificial Intelligence Coach for Indoor Putting Practice	Berk Gungor
The advancements in open-source large language models have demonstrated the feasibility of executing relatively small-scale Large Language Models(LLMs) on personal computing devices while achieving satisfactory outcomes. With the help of the latest improvements in Generative AI, this study examines the role of LLMs in providing assistance in routine tasks, with a specific focus on indoor putting practice. We explored how generative AI can contribute significantly to proficiency in sports training and the study uses a human-like AI assistant integrated with innovative technologies such as speech-to-text and text-to-speech to emulate human coaching and offers insights into the potential of AI as a personal assistant. The research provides an in-depth evaluation of the AI-enabled tools using large language models, demonstrating their benefits in improving golfers’ performance. Consequently, the main focus of this thesis is not only to provide a cutting-edge tool for golfers seeking to improve their putting skills but also to contribute significantly to the intersection of AI and sports training.
28.05	Object detection with the Segment-Anything Model on NICOL robot	Laura Jouvet
Spot talk.
28.05	Explaining Continual Learning with Concepts	Priscilla Cortese
Spot talk.
14.05	Zero-Hero: Contextual Skill Discovery with Large Language Models	Xufeng Zhao
Language-conditioned robotic skills have become the bridge between high-level Large Language Models (LLMs) reasoning to low-level robotic control. Existing works aim to collect diverse fundamental skills at the beginning manually, either top-down to decompose a complex task into atomic robotic actions or bottom-up to bootstrap as many combinations as possible to cover a wider range of task possibilities. However, these decompositions or combinations are restricted to the initial skill library. For example, a ``grasping'' capability can never emerge from a skill library containing only diverse ``pushing'' objects. Traditional skill discovery with reinforcement learning discovers new skills but often exhaustively results in non-meaningful behaviors. In this work, we propose a fully automatic skill discovery framework, Zero-Hero, to acquire semantic skills from zero to hero, in which an LLM is utilized to continually propose new contextual tasks, as well as corresponding success determination functions and reward functions for a reinforcement learning agent to optimize with.
07.05	Domain Adaption as Auxiliary Task for Sim-to-Real Transfer in Vision-based Neuro-Robotic Control	Connor Gäde
Architectures for vision-based robot manipulation often utilize separate domain adaption models to allow sim-to-real transfer and an inverse kinematics solver to allow the actual policy to operate in Cartesian space. We present a novel end-to-end visuomotor architecture that combines domain adaption and inherent inverse kinematics in one model. Using the same latent encoding, it jointly learns to reconstruct canonical simulation images from randomized inputs and to predict the corresponding joint angles that minimize the Cartesian error towards a depicted target object via differentiable forward kinematics. We evaluate our model in a sim-to-real grasping experiment with the NICO humanoid robot by comparing different randomization and adaption conditions both directly and with additional real-world finetuning. Our combined method significantly increases the resulting accuracy and allows a finetuned model to reach a success rate of 80.30%, outperforming a real-world model trained with six times as much real data..
23.04	Using a Pre-Trained Open-Vocabulary Object Detection Model for a Cross-Modal Action-Language Model	Stig Griebenow
MSc Spot talk.
23.04	Aiding Sim-to-Real Robot Sequence Detection using Domain Randomization	Mattes Repnak
For machine learning, one of the most important processes is data generation or collection, which are time consuming. To cut down on this time, the technique known as sim-to-real transfer tries to train models on simulated data. This is helpful for tasks in the robotics domain, such as the recognition of simple objects. While the method has shown success, models trained on simulated data do not perform as well as models trained on the proper real world data. In an effort to overcome this gap, we examine a few methods. First, we investigate if the masking method improves the accuracy. This method works by using masks that tell the model what parts of the output to calculate the loss on. The hypothesis here is that the model may disentangle concepts and more clearly learn different concepts. Further, we look at the contrastive loss method, which uses multiple samples to contrast against each other for loss calculations. The method of contrastive loss has been shown to increase the accuracy of models and has yet to be used with domain randomization. Lastly, we examine the potential of not using only simulated data but mixing it with real world data for training to achieve a greater accuracy. The former two methods show small success, and both are able to perform better than the baseline but only by a small margin. The mixing method however shows promise. For the dataset mixing, we can see the model trained with a ratio of around 25% real world data and 75% simulated data achieves slightly better results in terms of overall and action accuracy above the model trained on only the real world data..
16.04	LeoLM: Evaluating and Enhancing German Language Proficiency in Pretrained Large Langauge Models	Björn Plüster
The release of Meta's Llama Models marked a significant milestone for the open-source language model community, providing access to powerful and practical large language models (LLMs) that can be used on consumer hardware. However, initial attempts to fine-tune these models on German resulted in models that produced simplistic German output and often reverted back to English due to the predominantly English pretraining data. To address these limitations and enhance German language proficiency, we perform large-scale continual pretraining on the suite of Llama-2 models on a corpus of 65 billion tokens of high-quality German texts. We make these pretrained models publicly available to the open-source community. Furthermore, we conduct a comprehensive evaluation of the models using a diverse set of benchmark tasks, including translated versions of common benchmarks and newly introduced German-specific benchmarks. Finally, we aim to assess the extent to which the continual pretraining improves the models' German language capabilities, investigate the effectiveness of shifting the models' US-centric perspective towards a German viewpoint, and explore the impact of tokenizer extension on downstream performance.".
16.04	Inverse Kinematics for Neuro-Robotic Grasping with Humanoid Embodied Agents	Jan-Gerrit
We extend our priorly proposed inverse kinematics (IK) method CycleIK, which was introduced for the NICOL robot, to various robot manipulator designs. The neuro-inspired IK method is utilized by a Bezier curve-based motion planner that allows the planning of trajectories directly in Cartesian space. The motion planner is deployed to the physical robot hardware of NICO and NICOL with an embodied agent application that enables a human-in-the-loop grasping scenario. We compare CycleIK to traditional numerical IK methods and state-of-the-art neural IK methods on five different robot platforms. In addition, the success rate for 100 grasp attempts on each robot is measured..
09.04	Cartesian Loss in End-to-End Neuro-Robotic Grasping	Thore Schönefeld
BSc Spot talk.
09.04	Imitation learning for robots	MSc group
Imitation Learning is a technique that aims to learn a policy that mimics the behavior of the expert from the given samples. The samples are shown in the form of observations i.e. current state and actions. One of the significant advantages of the approach is that unlike reinforcement learning it does not require a reward function and hence can be used in cases with sparse rewards or rather tedious real-world scenarios containing multiple factors to consider. But just like with every other scientific venture, Imitation Learning also has its challenges such as data collection for critical domains, low success rate in long-horizon manipulation tasks, incapability in generalization of tasks & inability to work in an environment with disturbances. This paper walks through the history of imitation learning, brings the most commonly adopted imitation learning techniques in robotic scenarios, and presents multiple pairwise and common comparisons among them in a well-structured tabulated format, all while discussing the future scope and the most promising imitation learning solutions to the problems in robot domain in recent days..

Winter Semester 2023/2024

Winter Semester 2023/2024
26.03.24	Impact of a Robot's Agreeableness on a User's Likeability using LLMs	MSc group
HRI Project.
27.02.24	Let It Grow: Modular Growing Neural Networks for Continual Learning	Daniel Speck
State-of-the-art models are often of high complexity, and thus, require a large number of parameters. Training such models is computationally expensive, requires large datasets as well as finetuning of hyperparameters, which is especially challenging if new tasks, new data or new domains are encountered and have to be learned. Since real-world applications often require adapting models to new tasks, data or domains, large models are not always a feasible solution. Models that start with a small set of parameters and grow as needed are a promising and biologically plausible approach to this problem. Moreover, real-world applications and adaptations to new tasks, data or domains emphasizes the need for continual learning, i.e., the ability to learn new tasks without forgetting previously learned ones and adapting existing knowledge to new tasks efficiently. There are real-world tasks that, by nature, require continual learning, e.g., when there are domain shifts in the input. In this work, we propose “GrowBlockNet”, a modular growing network and growing policy to grow the network as needed. Our network starts with a small number of parameters (1.2 million) and grows using a growing policy to evaluate model performance and grow when the network capacity seems to be insufficient, instead of using a large model that might overfit. The growing policy is using an exponential moving average based on the per-batch training loss, which is a novel approach and significant step towards autonomous growing and real-world applications, as our approach does not require any context or domain information. We evaluate our approach using the CelebA-HQ dataset as a multilabel classification task in a continual learning scenario: we compare our model in a domain-incremental learning scenario as well as baselines on the entire dataset. Our results show that our approach is able to outperform the state-of-the-art model EfficientNetV2 and EdgeNeXt in accuracy and deliver comparable results in F1 Score. Furthermore, we show that our approach is capable of dealing with domain shifts without any context or domain information and catastrophic forgetting is less severe than in EfficientNetV2. Apart from that, we also show the efficacy and modularity of our approach by using the same dataset in a generative task, namely image-to-image translation, where we set up a generative adversarial network (GAN) to translate images from one domain to another. Both networks, generator and discriminator, are using our growing models.
20.02.24	Accelerating Interactive Robot Learning through Feedback from Large Language Models	Jonas Werner
Reinforcement Learning (RL) has emerged as a powerful technique for decision-making tasks. Its reliance on long training periods has been addressed by various approaches, one of which is the exploitation of prior knowledge. This usually relies on humans providing demonstrations and/or guiding the learning phase, called Imitation Learning. Whilst this is effective in drastically accelerating learning, the requirement of human resources is still costly. This Master's thesis will research how this acceleration can also be achieved by guiding the learning phase with a Large Language Model (LLM). It will be examined to what extent its ability to provide evaluative and corrective feedback improves learning. The practical part and the experiments will be based on the CEILing architecture which uses a human to provide feedback interactively during the learning phase of an RL Agent. Using the Python library RLBench a series of experiments will be simulated with a Panda-Robot-Arm and their success rates will be compared with a baseline of behavior cloning.
13.02.24	Human-Robot Intention Understanding Using Non-verbal Communication	Hassan Ali
The interaction between humans and robots is expected to be natural and intuitive, hence resembling human-human interaction rather than just relying on explicit commands. Intention understanding is fundamental in Human-Robot Interaction (HRI) and involves perceiving subtle cues from human actions and gestures to infer their underlying goals, therefore, it is key for enabling effective and seamless communication. Nevertheless, interpreting human intentions is challenging since human behavior is complex and context-related. Additionally, human intentions are often conveyed with non-verbal cues, which are challenging to decipher. Recent research with Large Language Models (LLMs) in robotics have showcased their reasoning capabilities, which can go beyond mere language applications. In this research, we explore the possibility of utilizing LLMs as an intent reasoning framework for a collaborative task using objects between a user and the NICOL robot. We also look into improving NICOL’s understanding of non-verbal cues, especially hand gestures and body language.
6.02.24	Interpreting speech processing systems for goal-driven agents	Julia Gachot

30.01.24	Factors for Fine Geometry Representation of Neural Fields for Semantic Segmentation	Thilo Fryen
Recently, conditional neural fields for semantic segmentation of images have been introduced. They provide an interesting alternative to CNNs and transformers for this task. This thesis is a systematic analysis of different factors of the proposed neural field architecture that influence the fidelity of the produced segmentation masks. Multiple factors have been identified and investigated: The type and dimension of the pixel coordinate encoding, the conditioning strategy, the depth of the encoder, the feature map size, and the architecture of the neural field decoder. Furthermore, I explore the introduction of skip-connections between encoder and decoder. The extensive experiments lead to several findings about the factors and the resulting neural field architecture achieves an improvement of 1.4% IoU over previous results. However, for some factors, no clear guidelines can be derived. I attribute this to the characteristics of the Potsdam dataset used in the experiments. Therefore, I additionally conduct an ablation study with the GTA5 dataset.
23.01.24	Partial Model Merging	Korbinian Koch
It is commonly known that ensembling two models trained on the same task often outperforms each model individually, while doubling the inference cost. A novel stream of research on model merging explores whether it is possible to combine multiple models into one by interpolating their weights, and could serve as a more efficient alternative to ensembles. It is already viable to interpolate models fine-tuned from a shared base model with the same performance increase as when they are ensembled. However, the interpolation of models that were trained independently or on different datasets still poses a significant challenge, and results in merged models with a higher loss and lower accuracy that is only avoidable when the models are prohibitively wide. In this thesis we postulate that model merging and ensembling represent two extremes of a spectrum by either enforcing a complete overlap or keeping all parameters separate. We evaluate methods that are situated in between, overlapping and interpolating the parameters of both models just in parts, while keeping the remaining parameters unchanged. We demonstrate the ability of such partial model merging to gracefully eliminate any existing barriers, with loss and accuracy approaching those of ensembling as the width increase approaches 100%. We observe this gradual improvement across all architectures and model dimensions, even if the endpoints were trained on disjoint or biased datasets. We also show that specific layers of the models have a significantly higher contribution towards the occurring barriers, and that reducing their overlap first allows us to reduce the barriers quicker. Contrary to previous beliefs these layers cannot be identified just by looking at the average correlations between units. Using a simple baseline method inspired by our findings, we are able to achieve zero-barrier connectivity with respect to both test loss and test accuracy between two regular-width VGG11s independently trained on CIFAR10 and SVHN with a parameter increase of just 22% and 24% respectively, and without any additional training after merging. Similar interpolation-based connectivity on independently trained VGGs has never been achieved experimentally using full merging, even for high width multipliers.
23.01.24	Out-Of-Distribution Detection in Computer Vision based on Deep Metric Learning	Swetlana Shaban
This thesis considers the application of Deep Metric Learning (DML) techniques to the problem of Out-Of-Distribution (OOD) detection. The study aims at finding a small but efficient deep neural network in contrast to papers published by the community. The purpose of this network is OOD detection and multi-class classification. In the scope of this work, the computational experiments will be carried out on CIFAR10 and CIFAR100 datasets. This will involve the usage of state-of-the-art DML methods as well as a study of the influence of various algorithm parameters on the experiment result. I expect to reduce the model and embedding size significantly while preserving competitive classification accuracy and OOD detection quality. In conclusion, I plan to compare different neural networks created during the research and ones presented in scientific papers.
16.01.24	Enabling Action Crossmodality for a Pretrained Large Language Model	Anton Caesar
Recently, the Transformer architecture and large scale datasets have brought huge improvements to natural language processing and vision tasks. However, action and bidirectional action-language tasks are less developed, as these require more specific and labelled data. The aim of this project is to enable these robotic action capabilities for a pretrained LLM by combining it with a crossmodal architecture. Specifically, I split up a pretrained T5 LLM between its encoder and decoder and insert the Crossmodal Transformer component of the PTAE bidirectional action-language model. I then train the new model, called CrossT5, on a new dataset, consisting of unimodal language translation and crossmodal bidirectional action-language translation. The results show that it quickly reestablishes the capabilities of the original T5, requiring 5.7 million times less data than the original training. Furthermore, CrossT5 achieves high accuracy for the crossmodal PTAE language-action tasks, and leverages T5's natural language capabilities to act more robustly when tested with language commands not included in the dataset. The results demonstrate that the approach is successful and efficient in extending an LLM with the low-level robotic control skills of vision-action models.
9.01.24	Knowledge Transfer: Progressing from Gesture Recognition to the Transcription of Sign Languages	Lukas Braach
This thesis explores the applicability of transformer architectures, specifically Video-Vision-Transformers (ViViTs), in Sign Language Transcription. Leveraging advancements in Gesture Recognition, the study aims to establish a performance baseline using the RWTH Phoenix 2014 dataset and assess pre-training knowledge transfer to the Sign Language Transcription application field. Methodologically, a ViViT architecture, incorporating the recent DINOv2 model for spatial vision feature extraction, is proposed. A key point in the study is to measure the impact of recent Machine Learning breakthroughs—like ViViTs and unified pre-training—to the downstream task performance by performing ablation studies. We expect to match recent benchmark results while additionally providing rich resources for other researchers: Model checkpoints, source code, and datasets will be published to the HuggingFace Hub, ultimately accelerating further Sign Language Transcription ML research.
9.01.24	Imitating Humans with NICOL	Josua Spisak
Abstract: How can imitation learning work, when a robot has to imitate a human? Significant discrepancies exist between humans and robots in both body and mind. In this research talk we will explore how to overcome or circumvent these challenges. The talk will go over the basic mechanisms found in biology and how we can use machine learning to copy them. Our approach uses Action segmentation and object detection models to analyses the demonstrations and then replicates them with he robot through symbolic reasoning. This talk will focus on discussing the strengths and limitations of this approach and future ideas on how to improve it.
9.01.24	LatentTrace: Using the latent space of an encoded anomalous image to retrace the target image	Leon Koch
In the context of industrial processes, various Artificial intelligence models can be trained and applied for anomaly detection. But AI does not always uphold the necessary standards of transparency and explainability necessary for the respective industry domain. In this research, I propose a method I call "latentTrace". I use AI to encode the anomalous images and retrace the corresponding image without anomaly from an already existing dataset. To encode the images, I use an autoencoder trained on an unlabeled image dataset from the aviation domain. Like this, I do use AI, but the output was retraced and not created by it, which bypasses the problem of the inherent intransparency of AI. This paper analyzes how well this method works by looking at different autoencoder architectures, sizes of anomalies and comparison methods for the comparison of the latent spaces.
19.12.23	A toolbox for running machine learning experiments	Martin Gromniak
In every new machine learning project, there are recurring tasks - from structuring datasets to hyperparameter tuning. A workflow which decouples these engineering tasks from actual research promises efficient work. In this seminar, I will present a toolbox for running machine learning experiments across different projects which targets this goal. The toolbox is build around open source libraries, particularly Pytorch Lightning and Hydra. I will also demonstrate the associated workflow live.
19.12.23	Improved Active Speaker Detection Using Human Pose Estimation	Augustine Ekweariri
Active speaker detection refers to the task of identifying and distinguishing the primary speaker(s) in a given video. Recent techniques, which are based on audio-visual fusion, were done by analysing the facial patterns of each detected person in a given video frame. This technique may lead to poor performance in situations with varied illumination, low resolution, occlusion or non-speaking facial motions. This thesis presents a novel approach for Active Speaker Detection (ASD) which integrates human pose information and cross-attention mechanisms to enhance the accuracy of active speaker detection models. The research leverages pose keypoints obtained from MMPose libraries, allowing us to capture upper body movements and non-verbal cues of potential active speakers. We then use cross-attention mechanisms, inspired by TalkNet, to dynamically align audio and visual modalities, enabling our model to synchronise the temporal dependencies between these modalities. A series of experiments were conducted to evaluate our models with baseline and well-performing models like ASC, TalkNet, FaVoA, and OpticalFlow on the AVA-ActiveSpeaker dataset. The results demonstrate that the integration of pose features and cross-attention mechanism significantly improves our model's performance. Our model, named AVP, which fuses pose features into audio-visual embedding and implements the cross-attention mechanism, displayed a remarkable performance and outperformed the baseline model by 0.5% in terms of mean average precision (mAP). Through ablation studies, we explored the influence of context size on the effectiveness of our method. Our findings suggest that the number of speakers in the context significantly impacts the model's ability to capture the dynamics of inter-speaker interactions. We observed that increasing the number of speakers in the context leads to an improved ASD model. We examined the effect of each modality (audio, visual and pose). We found that our model performed better as we fussed more modalities and achieved the best results when all the modalities were fussed.
12.12.23	2D to 3D Human Pose Lifting with Backbone Feature Maps	Klaas Degkwitz
3D human pose estimation from monocular image data is often done in a 2D to 3D lifting approach. The 2D to 3D lifting approach leverages the performance of state-of-the-art 2D human pose detection models by first detecting the 2D pose from the input image to then lift the 2D output into 3D space. However, 2D to 3D lifting from 2D pose data alone is an ill-posed problem since there exists an infinite number of 3D poses that could be projected onto a given 2D pose. Possible depth hints contained in the input image that could solve ambiguous scenarios are discarded in the first step of the 2D to 3D lifting approach. This thesis investigates if the performance of a state-of-the-art 2D to 3D lifting model can be improved by feeding additional information to the model that might be contained in the input image. It can be demonstrated that additional input in the form of extracted and encoded feature maps from the backbone of the 2D detection model can improve the performance of the 2D to 3D lifting model for data from the Human3.6M dataset.
12.12.23	Explaining Robot Behavior Using Concept Bottleneck Models	Sergio Lanza
The integration of the Large Language model (LLM) into robots can enhance their speech proficiencies and language skills. However, these models are well known as black box models, i.e. no clear information on their internal reasoning. One potential solution is to use the Concept Bottleneck model (CBM), a popular approach to extracting information from neural networks. CBM consists of adding an extra hidden neuron layer at the top of the network in which each node corresponds to a human concept. So the output of the model can be interpreted as a linear combination of human-related concepts. Despite their success, these models have not yet been tested on LLM, but only for vision tasks. In this research, I will present my proposal for applying CBM to an LLM fine-tuned Visual Questioning Answer dataset. I will describe some of the implementation details and also some future problems that I may encounter and some possible solutions. The final objective of this work is to demonstrate that CBM can be useful for text and text-visual applications and show its advantages in a robotic context to achieve an explainable robot. Indeed, in the last phase of this application, this new LLM will be applied to the Pepper Robot.
05.12.23	Evaluation of a Human-Like Artificial Intelligence Coach for Indoor Putting Practice	Berk Gungor
Artificial intelligence applications in the sports industry are getting more popular based on sporters’ needs. Especially during the past few years, we have started seeing various mobile applications that help users improve their performance in different sports branches. On the other hand, there are limited AI trainer applications for other sports fields such as golf. In order to work with Golfers, various networks such as ball tracking to analyze performance, large language models to have a chat with users and speech signal processing to enable oral communication during the training must be implemented. A key feature of the AI Coach will be its ability to monitor the golfer’s performance over time, providing insights and strategies to elevate their skill level progressively. The highest point of this project will not only provide a cutting-edge tool for golfers seeking to improve their putting skills but also contribute significantly to the intersection of AI and sports training.
05.12.23	Reinforcement Learning with Large Language Models in Robotics	Kun Chu
Reinforcement Learning (RL) plays an important role in the robotic manipulation domain since it allows self-learning from trial-and-error interactions with the environment. Still, sample efficiency and reward specification seriously limit its potential. One possible solution involves learning from expert guidance. However, obtaining a human expert is impractical due to the high cost of supervising an RL agent, and developing an automatic supervisor is a challenging endeavor. Large Language Models (LLMs) demonstrate remarkable abilities to provide human-like feedback on user inputs in natural language. Nevertheless, they are not designed to directly control low-level robotic motions, as their pretraining is based on vast internet data rather than specific robotics data. In the paper that has been accepted by the CoRL 2023 Workshop, we introduce the Lafite-RL (Language agent feedback interactive Reinforcement Learning) framework, which enables RL agents to learn robotic tasks efficiently by taking advantage of LLMs’ timely feedback. Our experiments conducted on RLBench tasks illustrate that, with simple prompt design in natural language, the Lafite-RL agent exhibits improved learning capabilities when guided by an LLM. It outperforms the baseline in terms of both learning efficiency and success rate, underscoring the efficacy of the rewards provided by an LLM. In this turn of Oberseminar, I will introduce Lafite-RL, recent works on using LLMs to bootstrap RL in robotics, and my findings on some online spot talks in CoRL 2023.
28.11.23	Exploring the Capabilities of Large Language Models on Emotional Conversation Data Generation	Burak Can Kaplan
Emotion recognition in conversations (ERC) is a field of study that focuses on identifying and understanding human emotions expressed during interactions. Its main goal is to recognize and interpret emotions conveyed within dialogues, increasing the quality of human computer interaction. However, there are several challenges which directly decreases the performances of current ERC models, e.g, the lack of quality data. The majority of popular ERC datasets are highly biased, unnatural and unbalanced. Moreover, the existing datasets usually differ from each other in terms of language and labeling, which makes it even harder to use them together. The number of existing datasets is quite low and they are too costly to generate, particularly in terms of participant recruitment, copyright considerations, unbiased dialogue construction, and accurate emotion labeling due to the subjective nature of emotional interpretation. In this spot talk, a solution will be devised to the problem of generating unbiased ERC datasets by using a small large language model as a cheap and efficient solution to tackle this problem. This work aims to answer two research questions: "How well large language models generate consistent multi-party emotion recognition in conversation data that can be used to feed ERC models?" and "How can the quality of the ERC data generated by an LLM be evaluated?"
14.11.23	Continual Learning for Language-Conditioned Robotic Manipulation	Lennart Bengtson
Language is an essential and versatile medium for communicating needs and requirements. Therefore, natural language is becoming increasingly important for communication with artificial agents and desirable, especially for embodied agents interacting with their physical environment. This requires grounding a growing number of natural language concepts continuously with visual perception and other modalities (language grounding). This master thesis has two main objectives: it aims to bridge the gap between simplistic single modality setups for Continual Learning (CL) and more realistic scenarios, and it aims to explore how language can be used to improve methods for CL. To do this, this work investigates the problem of learning to solve a sequence of simulated object manipulation tasks based on visual input for which the goal is specified by natural language instructions. It explores different CL methods for this scenario and shows their suitability through experimental results with a focus on parameter isolation. Based on an analysis of a state-of-the-art parameter isolation method, this work introduces the new CL method of Language-Conditioned Sub-networks (LCS), which uses information from natural language to aid the prevention of catastrophic forgetting and the reuse of previously learned concepts for learning novel tasks. Experiments show that the method achieves competitive results in terms of average accuracy in the continual learning setup. Furthermore, the experiments indicate that new tasks are learned more efficiently using LCS (i.e., higher accuracy is achieved after seeing fewer data samples). While the generalisability to different CL setups and base models remains open, these results provide a proof-of-concept on how information from natural language can help to improve CL methods in terms of the reusability and transfer of knowledge.
02.11.23	Gesture Recognition for Human-Robot Communication Using Transformers	Chams Alassil Khoury
Gestures are crucial in Human-Robot Interaction systems, as they resemble a form of human-to-human communication. Since they involve movements over a relatively long period, gesture recognition requires the utilization of deep-learning models, such as Transformers, that can efficiently learn long-distance dependencies. Although originally introduced for natural language processing, Transformers demonstrate encouraging performance in real-time vision tasks. Motivated by this success, we introduce in this thesis GRHAM, a Transformer-based framework that integrates distinct arm and hand modalities. GRHAM’s hand extractor, based on a pre-trained YOLOv7 pose estimator, shows enhanced robustness against variations in lighting and skin colors compared to skin color-based methods. Our evaluation of GRHAM on the Montalbano V1 dataset demonstrates improvements in distinguishing highly similar gestures, surpassing a state-of-the-art framework called Snapture. In this thesis, we additionally present a real-time version of GRHAM designed for potential deployment on robots.

Summer Semester 2023

Summer Semester 2023
19.09.23	Visual Feature Extraction Enhancement in Paired Transformer Autoencoders	Berk Gungor
Innovative method for enhancing visual feature extraction by leveraging the capabilities of cross-modal learning tasks. MIRNet is employed to enhance the images that are under not ideal conditions and adapt within Paired Transformer Autoencoders framework. Different brightness conditions have been tested and compared to observe MIRNet’s performance on cross-modal learning applications that are performed with tabletop tasks.
05.09.23	Multivariate Normal Methods in Pre-trained Models for Continual Learning	Hans Hergen Lehmann
In this thesis, we present a novel methodology and insight for continual learning with high-dimensional, pre-trained embeddings. This approach utilizes multivariate normal methods, enhancing performance in targeted continual learning tasks compared to traditional techniques. Our experiments demonstrate the efficacy of these concepts, outperforming conventional methods on some downstream continual learning tasks. However, our study also highlights some limitations of our approach, such as ill-conditioned matrices or numerical instability. These findings advance our understanding of the strengths and potential challenges of incorporating classical statistical methods into pre-trained models for continual learning, opening new avenues for future investigations.
22.08.23	Social Attention Prediction in a Free-Viewing Eye Tracking Task	Maximilian Keiff
This work investigates the prediction of social attention in a free-viewing task using saliency prediction models. In particular, we study the effects of including binaural audio and experiment with aggregating scanpaths from multiple participants to combine different scanpath behaviors in a single model. For this, we create a new dataset of eye-tracking recordings from 41 participants watching videos of social interaction scenario on a curved wall-sized screen. To predict their social attention, we develop a saliency model with a fixation history. The model is evaluated using common scanpath metrics such as AUC and NSS. To investigate the explainability of the model predictions, we compared the transitions between Areas of Interest (AOI) with the original dataset. Our results showed good performance in detecting social cues and learning behaviors associated with social attention.
15.08.23	EXPLAINABLE LIFELONG STREAMING LEARNING	Prof. Chu Kiong Loo
Real-time on-device continual learning applications are used on mobile phones, consumer robots, and smart appliances. Such devices have limited processing and memory storage capabilities, whereas continual learning acquires data over a long period of time. By necessity, lifelong learning algorithms have to be able to operate under such constraints while delivering satisfactory performance. This study presents the Explainable Lifelong Learning (ExLL) model, which incorporates several important traits: 1) learning to learn, in a single pass, from streaming data with scarce examples and resources; 2) a self-organizing prototype-based architecture that expands as needed and clusters streaming data into separable groups by similarity and preserves data against catastrophic forgetting; 3) an interpretable architecture to convert the clusters into explainable IF-THEN rules as well as to justify model predictions in terms of what is similar and dissimilar to the inference; and 4) inferences at the global and local level using a pairwise decision fusion process to enhance the accuracy of the inference.
11.07.23	Learning Vision-Language Tasks while Allowing Reinforced or Supervised Feedback	Leonhard Rattay
Supervised learning and reinforcement learning are two different machine learning techniques with individual traits and therefore are often preferred in different use cases. However, for certain tasks, both techniques can have their advantages in different areas of the problem. Instead of picking one technique over the other or splitting the task into separate sub-tasks, we analyse if there are merits in a shared architecture. For this, we create a model that allows either supervised or reinforced feedback. We test our model in two different scenarios: In the first scenario the model is given a picture of three cubes on a table and the task is to correctly name the colors of all the cubes in the picture. In the second scenario the model is given a picture of a randomly placed cube on a table and the task is to simulate a simplified sweeping motion of the cube. Based on the color of the cube, the sweeping motion has to be either from left to right or from right to left. In both scenarios we split the dataset in half. For one half we give exclusively supervised feedback and for the other half exclusively reinforced feedback. During the training we combine both data halves by randomly mixing the samples together, and compare a combined training process to an exclusively RL or SL training process. As a result, we show that a combined model is able to drastically improve the training speed of reinforcement learning, but can decrease the training speed of supervised learning.
04.07.23	Improving Dataset Enrichment for Fuel Consumption Prediction	Vinh Ngu
His thesis presents that a reliable prediction model for fuel consumption with competitive accuracy (1.69 l/100km) is possible based on a feature-poor data set. The dataset used contains no vehicle-specific information other than the fuel level. The first part of the thesis deals with reconstructing the approach of Bousonville et al.. . Inspired by the existing approach, which considers topological and climatic data, road information and weather data from publicly available sources were subsequently added to the dataset. The second part of the thesis investigates open questions and evaluates possibilities for improvement experimentally, which results from part 1. Unspecific parameters, which are important, e.g. for the preprocessing of the data set, were determined empirically. It could be shown that the empirically determined parameters and procedures led to an optimisation of the prediction model. Due to the narrow range of essential features in each data set and the use of publicly available data sets (e.g. OpenStreetMap, MeteoStat, etc.), the reproducibility and integrability of this work is ensured.” The question I have is whether the abstract has to be identical to that of the thesis. Or whether I could formulate it a little differently. If the latter is possible, then I would phrase the abstract like this: “This work aims to create a machine-learning model for predicting the fuel consumption of trucks. In cooperation with the company Co2Opt, a data set was provided on which the work is based. This dataset, untypical for datasets in this field, contains no vehicle-related features like weight, vehicle model, etc. The thesis is divided into two parts. The first part deals with the question of to what extent a modern approach in the field is suitable for this data set. The mentioned approach is replicated as well as possible, and the results are compared. The second part of the paper investigates which approaches can be used to improve the model from the first part.
27.06.23	CycleIK - Neuro-inspired Inverse Kinematics	Jan-Gerrit Habekost
CycleIK is a novel neuro-inspired approach to the inverse kinematics problem (IK), which aims to align the 6D pose of the kinematic end-effector link of a robot with a given target pose by calculating a suitable configuration for the motor positions. Two neural architectures, a Multi-Layer Perceptron (MLP) and a Generative Adversarial Network (GAN), are presented that can solve the IK problem with high enough precision to be deployed directly to physical robot hardware. CycleIK is embedded in a neuro-genetic IK pipeline that allows for further optimization of the neural solutions with the genetic algorithm Gaikpy and sequential least squares programming. The pipeline is evaluated on the novel semi-humanoid NICOL robot, the Neuro-Inspired COLlaborator, which has two redundant manipulators with eight degrees of freedom. Three dense training datasets were collected from two differently-sized Cartesian workspaces of the right arm of the NICOL robot. Several experiments on the precision of the MLP and GAN for the different datasets are presented before the conclusions for further extensions to the neurorobotic framework are drawn.
20.06.23	Generalizing a Bidirectional Autoencoder Model in the Simulation to the Real World by Domain Randomization and Data Augmentation	Erik Alaverdyan
In machine learning and artificial intelligence, one major problem is the lack of training data for the model at hand. Recording the data in the real world is a painstaking, time-intensive effort, and sometimes costly endeavor. By recreating the parts from the real-world environment which are important for the task of the model in the simulator (CoppeliaSim), the missing and costly training data can be substituted. In order to transfer the model now to the real world the sim-to-real gap must be bridged. This gap describes the differences between the simulation and the real world (e.g. different lighting conditions, mismatched proportions, and much more). To cross this gap domain randomization seems to help. By having the simulated environment be less static but more random, the model can generalize better to different lighting, movement, geometry, etc. circumstances, that it could encounter in the real world. The shortcomings of this approach are the bigger size of training data and with that, the time to generate the datasets increases. This is why having some of the domain randomization achieved via data augmentation is more resource conscious. Domain randomization is most often applied by changing the environment in the simulator, but data augmentations on the other hand is an algorithm that runs over the dataset and modifies it (e.g. by applying a blur effect on an image sequence). By modifying the data produced in the simulation with different algorithms, some of the in-simulation computational heavy variations can be emulated with data augmentations. Reducing the time that is needed for the data creation. In this thesis, we show the impact of domain randomization and data augmentation on the basis of the bidirectional PTAE model proposed by Özdemir et. al. [8], by having the transfer from the simulation to the real world. This was achieved with an accuracy of 20.83%. Additionally, we show that the capability of the PGAE (the basis for the PTAE) to classify interactions by an opposite agent is present in the PTAE as well. With an accuracy of 85% with no augmentations and 97% with noise augmentation.
13.06.23	Gender Bias in Deep Learning-based Forecasting of Medical Parameters	Yannic Jänike
This thesis investigates the use of deep learning models for forecasting medical parameters, including vital signs, blood gas analysis, and laboratory measurements, and focuses on potential gender bias in the models. Specifically, we compare the performance of Long Short-Term Memory (LSTM) and Transformer models in predicting medical parameters for di↵erent genders. We also examine the external validity of our findings by validating the results on an external dataset called MIMIC. We find that LSTMs perform well at predicting di↵erent vital parameters and that the models can generalize to an external test set. Over and undersampling a specific gender in the training data does not improve the performance of the gender during testing. Furthermore, we only find a slight di↵erence in performance between genders.
06.06.23	Generalization of Transformer-Based Models on Visual Question Answering Tasks	Frederic Voigt
In this thesis the use of special transformer models for vision-language tasks is investigated. Special focus is on the performance of the model on evaluation data that require the model to have a deeper understanding of the task by differing in certain aspects from the training data. Transformers are used which implement depthwise recurrence (Universal Transformers) and have already been successfully used for similar tasks on text-based data. Visual Question Answering is used as an example task and the CLEVR dataset with the CoGenT split is examined. It is found that Universal Transformers can handle textual data well but show difficulties in processing visual data at least within the configurations, methods, and approaches studied. However, it is also noted that longer pre-training of the models, other techniques such as regularizations, or fundamentally different approaches may need to be tried in order to use Universal Transformers for vision-language tasks.
30.05.23	Improving Compositional Generalization By Learning Concepts Individually	Ramtin Nouri
Neural networks are powerful function approximators but often fail to generalize beyond their dataset. In this work, we investigate models and techniques for improved out-of-distribution generalization in the context of action recognition tasks. In particular, we focus on compositional generalization, where the data and target labels are compositionally made up of multiple properties. To improve compositional generalization, we propose two novel approaches, that use a ConvLSTM-based model. The first approach involves unsupervised pre-training on an extended dataset using next-frame prediction. We hypothesize, that this method allows the model to learn a more general representation of the data. Our second approach introduces a new mask, that ignores certain words in the target label. This mask is given to the model as a second input, allowing the model to focus on individual properties. We evaluated the performance of both approaches on Volquardsen et al.’s dataset, and our newly introduced ARCoGen dataset. The results of both approaches significantly outperform the baseline on Volquardsen et al.’s dataset on compositional generalization. While the baseline fails completely in the hardest configuration, achieving 0% compositional generalization accuracy, our masking approach could achieve 74.6%. However, the accuracies on our proposed ARCoGen dataset were not as strong. While the results of the masking approach were reasonable, given the more complex dataset, the unsupervised pre-training approach performed poorly and requires further work. These findings may pave the way for more effective and generalizable action recognition systems in real-world applications.
23.05.23	Composite Dataset Training for Improved Pose Detection in the Wild	Leon Trochelmann
Human Pose Detection is a fundamental computer vision task that allows computers to detect a human being in a much more sophisticated way than a simple bounding box. The field has seen great strides in recent years with models becoming more and more sophisticated and reaching better and better scores on common benchmarks. The underlying goal is to obtain more precise and robust detection on in-the-wild data: Data reflecting real world use cases which is free from any potential biases that the popular datasets may be subject to. In order to minimise such biases and improve performance on in-the-wild data, we present composite dataset training: A method to reconcile differences between datasets so that a model can be trained on all of their data. The performance of such models is evaluated on different validation sets and compared to that of models trained on the individual datasets. A comparative analysis is also conducted between models trained on a composite dataset and models trained on individual datasets of equal size. This analysis aims to explore the impact of dataset-external variety versus dataset-internal variety. The results show that a model trained on a composite dataset always outperforms models trained on the individual parts. Furthermore the composite-trained models outperform models trained on individual datasets of equal size during validation on external data, demonstrating that external variety can lead to better generalisation than internal variety.
25.04.23	Multi-Modal Representation Learning for Emotion Recognition in Continuous Domain	Navneet Singh Arora
Emotion recognition (ER) is vital, given the complexity and relation to human expression and perception across modalities. Emotion analysis involves identifying emotions using either discrete or dimensional labels through interlocutor utterances. Recent research with large-scale models, such as Wav2Vec 2.0, achieves excellent performance on emotion estimation, benefitting from the availability of real-world datasets. However, curating these domain-specific datasets is an expensive and time-consuming process. As contrastive-based representation learning emphasises negative samples, random sampling within a mini-batch is ineffective, leading to too-weak or too-hard samples, making representations less robust. These researches, moreover, do not utilise the strengths of the data modality and fail to establish any association between the modality and the dimensions of the emotion. In my research talk, I will present a novel approach to address these challenges, utilising non-contrastive representations and iteratively improving them through contrastive learning with pseudo-labelled negative samples.
18.04.23	Continual Learning for Robust Speech Recognition	Theresa Pekarek-Rosin
While Automatic Speech Recognition (ASR) models have shown incredible advances with the introduction of unsupervised or self-supervised training techniques, these improvements are still only limited to a subsection of languages and speakers. Older people or people with disabilities would benefit the most from reliable speech recognition and yet they are among the group of speakers state-of-the-art (SOTA) ASR models struggle with. Through transfer learning we can adapt large-scale ASR models to very specific domains, however fine-tuning the entire model is often not necessary and can even decrease performance for general speech recognition. This trade-off can be lessened with continual learning. In my research talk, I will present the results of my dataset collection of German elderly and disordered speech, as well as preliminary results of combining layer-specific fine-tuning with Experience Replay in three different SOTA architectures. Additionally, I will give an overview of future research directions.
11.04.23	Exploring Causal Explainability for RL Models: Insights into Reward Decomposition, State Disentanglement and Causality	Wenhao Lu
In this presentation, we first delve into the reward decomposition (RD) technique and its role in promoting transparency within RL models. Further, we explore the connection between RD and state disentanglement, highlighting how the latter can uncover causal relations among states, actions, and rewards when causality principles are incorporated into the learning loop. Through this process, we can form a meaningful clustering of the state space, which takes the form of a graph with concept-related states as nodes that are nested. Starting with this (state-action) graph, we examine how it can facilitate one-step transition explanation. Lastly, we briefly touch upon the possible formats of explanation artifacts that can be conveyed to target users, and how large language models like ChatGPT can leverage these artifacts to foster mutual trust between human and models.
04.04.23	Achieving Reliable Localization on the Pepper Robot using a SLAM based Approach in an Office Scenario	Eke Suichies
Reliable localization is a critical requirement for mobile robots to perform various tasks in different environments. This presentation discusses the relevance of reliable localization and how it could be achieved using a Simultaneous Localization and Mapping (SLAM) based approach. The proposed method involves creating two live SLAM maps from LIDAR and odometry data, which are then registered against a known ground truth map in real-time. The matching of the live and ground truth maps is done by identifying and matching features of the respective maps and performing a robust estimation method on them. When the maps are aligned properly this form of localization can be used for path planning outside of the live mapped area. The proposed approach can be extended to other mobile robots in similar environments to achieve reliable localization.

Winter Semester 2022/2023

Winter Semester 2022/2023
28.03.23	Continual Learning for Speech Accent Adaptation	Henri Kordt
Automatic speech recognition (ASR) is a large, ever-growing research field. With ASR systems being established in various day-to-day interactions with a variety of speakers, reliable and efficient performance is of great importance. While English is by far the most popular language to train speech models on, even the best-performing models struggle, when being confronted with English accented speech. That is mostly due to the fact, that accented speech data is not as available, especially when compared to the amount of data available for the standard accent. Transfer learning is a learning technique, that aims to use already obtained knowledge, i.e. from a pre-trained model, to transfer it to another domain. Transfer learning often causes phenomena like catastrophic forgetting or catastrophic interference, which result in performance decreases for the original domain. Continual Learning introduces methods to counteract this behavior, to progressively train a deep learning model. This thesis analyses different popular Continual Learning approaches and tests for their applicability in speech accent adaptation. I furthermore provide novel results for Continual Learning on state-of-the-art ASR models, and analyze performance, scalability, robustness, and lastly data requirements of the different Continual Learning approaches. The data shows, that Elastic Weight Consolidation is able to significantly decrease catastrophic forgetting but is unable to learn multiple accents sequentially. Experience Replay also mitigates forgetting when fine-tuning with limited amounts of data, but is not able to stop catastrophic forgetting on large amounts of data. Lastly, Learning without Forgetting reduces forgetting in all scenarios, but usually shows a worse performance than the other two approaches. Additionally, for approaches that require additional data, the choice and amount of data are directly correlated to the resulting performance.
21.03.23	Neural Fields for 3D Scene Understanding	Martin Gromniak
First, I will introduce neural fields as a representation for learning 3D geometry. Second, current research directions on neural fields and open research questions will be present. Third, I will show possible applications of neural fields for object-centric tasks and point out links to other research at WTM.
14.03.23	Capturing Long-Term Dynamics During Salient Video Representation Learning	Theodor Wulff
Recent years have witnessed a rise in transformer-based architectures across different fields of research. In this thesis, an approach is presented to learn a joint representation of video frames and past saliency information with respect to the task of video salient object detection (VSOD). The system first extracts spatiotemporal tokens from the stream of video frames and past saliency maps. This lets the model incorporate short-term information within the token while being able to make long-term connections between tokens in other layers. The core of the system consists of a dual-stream transformer encoder architecture to process the extracted sequences independently before fusing the two modalities in a later step. Additionally, in order to guide the learning process, a saliency-based masking scheme is applied to the input frames to learn a profitable embedding via attempting to recover the masked salient regions from the input. The model benefits from the embedding that resulted from the direct encoding of salient information in the video frame by being able to focus its prediction on the target regions. Using the saliency-based pretraining, the model is able to improve its performance and utilize the prior information more accurately. However, the additional prior information also poses a new challenge since it biases the model towards specific regions in the image.
28.02.23	Visually Grounded Human-Robot Dialog	Xiaowen Sun
First, We will introduce our latest research in the visually grounded human-robot dialog and discuss how to exploit existing LLMs (large language models) in our work. Second, we will show the overview of recent visual and language tasks and discuss the novelty task that is related to our domain (human-robot interaction). Third, we will briefly present SIMMC (Situated and Interactive Multimodal Conversations) and the current SOTA models, and our current results.
07.02.23	Using a multimodal explainable artificial intelligence to determine the content of containers	Josua Spisak
Robots are starting to become a more important part of our society. If they are to act in our daily lives, they will have to interact with containers of all kinds, and the content of said containers will change how they need to interact with them. In this thesis, we explore how to detect the fill level of a bottle. We detect the fill level from the perspective of the robot NICOL and use three senses for a multimodal approach. We gather the data from the robot that picks the bottle up, moves it around and puts it back down. During this process three senses are recorded. These senses are the visual sense, the tactile sense, and the proprioceptive sense. For each of the senses, we use neural networks to classify the fill level. We also present multiple multimodal architectures to combine the senses and manage to achieve an accuracy higher than any individual sense was able to provide. Throughout all our steps, we use grad-cam and provide confidence value to make the results easy to interpret and understand. This thesis makes contributions to XAI, multimodal machine learning and neurorobotics.
10.01.23	Predictive modelling with uncertainty for intrinsically motivated reinforcement learning	Philip Bradfield
This thesis presents model-based soft actor-critic (MBSAC), a novel model-based reinforcement learning framework based on the soft actor-critic (SAC). Using a forward model consisting of an ensemble of neural networks, MBSAC generates an estimate of the model’s predictive uncertainty for a given state-action and uses this as an intrinsic reward to direct the agent’s exploration. The framework is tested in simulation on a variety of robotics environments and the results of the experiments are presented and analysed.
20.12.22	Visual-Language Co-Attention for Disambiguation in Human-Robot Dialog	André Riedel
In the foreseeable future, natural language interaction with robots will likely become indispensable. This will require models that can process visual and linguistic input together. In their work, Sun et al. (2022) have presented a hybrid neural architecture for this scenario. They present a complex framework that combines object recognition (PR), dialog state tracking (DST), automatic speech recognition (ASR), text-to-speech (TTS), robot arm planner (RAP), and human-robot interaction policy (HRIP). Their approach has limitations, particularly with regard to generalizability. With the technological progress in machine learning in recent years, large Transformer models have become increasingly popular due to their success. In this work, VisDial-BERT is evaluated as a replacement of the core part of the previous architecture. This architecture enables end-to-end learning, combining the previous DST and HRIP. The new architecture is investigated in four experiments. First, VisDial-BERT is compared with the architecture of Sun et al. on a new comparison dataset analog to the original dataset. Here, VisDial-BERT performs comparably, attesting to its general usability. Second, the capacity of the pretrained VisDial-BERT is investigated without further training on a slightly modified dataset that contains direct dialog rounds and thus covers some more complex domains without any fine-tuning. Here, the results are not yet satisfactory but better than random guessing. In the third experiment, the previous experiment is extended to include fine tuning. This gives us a much better result with 80% recall@1 and 98% recall@3. Most remaining errors originate from the wrong identification of positions. In the fourth experiment the unlearning of the VisDial dataset is evaluated at different fine tuning checkpoints to evaluate the remaining proficiency. Here we find that after 1 fine tuning epoch, most of the added value (72% absolute recall@1 gain) is gained for the problem at hand with a loss of only 3% for the mean reciprocal rank on the VisDial dataset. Further research could build upon this work and further address problems related to the identification of positions.
13.12.22	Visual Feature Extraction Enhancement in Paired Transformer AutoEncoders	Berk Güngör

06.12.22	A Short Survey on Interactive Reinforcement Learning	Kun Chu
Reinforcement learning (RL) enables agents to self-perform tasks by only interacting with the environment. However, the problem of sample efficiency has largely limited the application of RL to real-life situations. Interactive RL has been proven as a powerful technique that enables agents to learn to solve tasks from corrective or instructive feedback provided by human users. In this way, the agents could leverage an understanding of such human feedback to improve their learning efficiency. However, more research on this topic is required to better understand how diverse human feedback enhances the agent's learning. In this oberseminar, some latest interactive RL methods will be investigated from the perspective of human feedback sources, like explicit or implicit, unimodal or multimodal, and so on. In the end, some open questions and potential directions will be discussed.
29.11.22	Merging Supervised Learning with Reinforcement Learning for Vision-to-Language Tasks	Leonhard Rattay

22.11.22	Extracting a Speech Preprocessor from a Conformer for Automatic Speech Recognition	Patrik Eickhoff
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have reported state-of-the-art performance on various benchmarks. These systems intrinsically learn how to handle and remove noise conditions from speech. Previous research has shown, that it is possible to extract the denoising capabilities of these models into a preprocessor network. However, the proposed method was limited to specific fully convolutional architectures with deep residual connections. In this thesis we propose a novel method to extract the denoising capabilities from a Conformer ASR model, that can also be applied to other architectures. We evaluate our preprocessor for both pretrained ASR models and training models from scratch. The models were trained and evaluated on LibriSpeech and the Noisy Speech Database. Our results show, that the preprocessor improves the Word Error Rate of downstream ASR models on noisy data and leads to faster training convergence.
08.11.22	Learning Concepts - A Developmental Lifelong Learning Approach to Visual Question Answering	Ramin Farkhondeh
The combination of lifelong learning and visual question answering (VQA) is largely unexplored, although both are essential for an independently acting agent. Lifelong learning is based on the task of learning a constant flow of data and tasks. This can lead to problems, especially with regard to forgetting data that has already been learned and transferring knowledge onto new tasks. To counteract this forgetting and to check the knowledge transfer, MDETR-CL was introduced, a fusion of MDETR and REMIND. Since a concept-oriented VQA dataset has already provided reduced data usage and faster convergence of a model in curriculum learning, we pick up the idea to test the model. The results of these tests show that MDETR-CL achieves better results than the REMIND model, but the hoped-for knowledge transfer can not be measured.
01.11.22	Developing a Language Model based on Reinforcement Learning	Tom Herrmann
The field of artificial intelligence has attracted much attention in recent years, including in the area of natural language generation. For humans, it may seem natural to speak and find connections. For machines, this does not come naturally. This thesis takes a previously developed model and examines how well the model can be extended to include natural language generation. First the preparation and then the different expositions and the reconstruction of the network are discussed. The experimental results show that the model is able to generate meaningful questions in natural language with comparable performance and accuracy. In addition, the performance improvement aspect of the network will be addressed. We explain our results and compare them with the results of the previous model and explain several possibilities for improvement and approaches for future work.
25.10.22	Navigating in Partial Observability using Hierarchical Reinforcement Learning with Hindsight Experience	Jose Sanchez
Navigating through an environment is a fundamental capability for many service, warehouse or domestic robots. Traditional navigation approaches involve using constant localization, mapping and planning under the assumption of full knowledge of the environment. This thesis focuses on expanding current reinforcement learning approaches to solve a navigation task in a partially observable environment. The approach makes use of hierarchical recurrent deep reinforcement learning, with the addition of hindsight experience replay in order to enhance learning by learning from failed attempts. Such architecture is evaluated in maze-like environments designed. The results show that most of the times hindsight experience effectively helped the model to learn from failed experiences even in partial observability.
18.10.22	Benchmarking Faithfulness: Towards Accurate Natural Language Explanations in Vision-Language Tasks	Jakob Ambsdorf
Whereas most current explanation methods for deep neural models are not intuitively understandable for lay users, natural language explanations (NLEs) promise to enable the communication of a model's decision-making in an easily intelligible way. Current models successfully generate convincing explanations, but it is an open question how well the NLEs actually represent the reasoning process of the models — a property called faithfulness. Although the development of metrics to measure faithfulness is crucial to designing more faithful models, current metrics are either not applicable to NLEs or are not designed to compare different model architectures across multiple modalities. Building on prior research and a theoretical rationale, we address this issue by proposing three faithfulness metrics: Attribution-Similarity, NLE-Sufficiency, and NLE-Comprehensiveness. The efficacy of the metrics is evaluated on the e-ViL benchmark for vision-language tasks and NLE generation by systematically applying modifications to the performant e-UG model. We show that in certain cases, the removal of redundant inputs to the explanation-generation module of e-UG successively increases the model's faithfulness on the linguistic modality as measured by Attribution-Similarity. Further, our analysis demonstrates that NLE-Sufficiency and -Comprehensiveness are not necessarily correlated to Attribution-Similarity, and we discuss how the two metrics can be utilized to gain further insights into the explanation generation process.
11.10.22	Graph Neural Networks for Facial Expression Recognition	Brenda S. Gutierrez Torres
The traditional problem of facial expression recognition (FER) has been addressed with multiple approaches varying from standard computer vision techniques to state-of-the-art deep learning networks. However, only few research articles have been focused on applying graph learning representation to this task. Graph neural networks have gained attention in the last few years, mainly because of their power of abstraction and the versatility to model complex real-world scenarios. In this thesis, we propose the use of graph representation and graph neural networks to model and classify facial expressions from humans. We have trained the PointNet++ method proposed by R. Qi et al. (2017) to recognize the extreme facial expressions point cloud samples from the CoMA data set. The model was evaluated in terms of precision, recall, accuracy and f1-score. Further experiments with five different point scales were performed in order to measure the variability of the model when the resolution of the data decreases. Aiming to analyze the features learned by the model to distinguish between the different facial expressions, instance-level explanations from the model were generated for five facial expressions using the LIME-3D approach. Our results demonstrate an overall high performance of 94.6% for the model trained with +3k points per class and a constant performance above 90$\%$ when the resolution is decreased down to 512 points. The highest f1-score was obtained for the high smile class, while the lowest value corresponds to the mouth middle expression. From the visual explanations, the model has learned to focus on different face sections according to the class being processed and no specific area was outstanding among the others. Although it has been proved by different authors that images-based approaches succeed in the task of recognizing facial expressions, with the results grounded above, graph neural networks should be considered a promising alternative research field for this problem.
04.10.22	Conceptual Prototype-based Explanations in Image Classification	Bjarne Jessen
This thesis gives contributions in the area of visual XAI. An existing neural model that gives intrinsic prototypical explanations of its decisions is investigated. To extract conceptual knowledge, a transformation sensitivity analysis is performed. The analysis compares how strongly matches between image parts and piece of prototypical information change when a particular transformation (for instance color shift or image blur) is applied on the image. A real dataset as well as a synthetic dataset is considered. The analysis confirms intuitive assumptions about the conceptual information. For the real dataset a lower test accuracy of the trained model is observed, however the prototypes learned in the model are better separated conceptually.

Summer Semester 2022

Summer Semester 2022
30.08.22	Sim-to-Real Neural Learning with Domain Randomisation for Humanoid Robot Grasping	Connor Gäde
Collecting large amounts of training data with a real robot to learn visuomotor abilities is time-consuming and limited by expensive robotic hardware. Simulators provide a safe, distributable way to collect data, but due to discrepancies between simulation and reality, learned strategies often do not transfer to the real world. This paper examines whether domain randomisation can increase the real-world performance of a model trained entirely in simulation without additional fine-tuning. We replicate a reach-to-grasp experiment with the NICO humanoid robot in simulation and develop a method to autonomously create training data for a supervised learning approach with an end-to-end convolutional neural architecture. We compare model performance and real-world transferability for different amounts of data and randomisation conditions. Our results show that domain randomisation improves the transferability of a model and can mitigate negative effects of overfitting.
28.06.22	The variance of the critics as an additional intrinsic reward to soft actor critic reinforcement learning	Franziska Eder
Reward sparsity is a common problem in deep reinforcement learning. Therefore, I developed an algorithm that implements the soft actor-critic approach and adds the variance of the critics to the q-value to promote exploration. Evaluating this algorithm in various test environments results in the conclusion that adding the variance of the critics can especially improve the performance of the SAC algorithm in complex physical environments.
14.06.22	Quantifying the Quality of the Visual Network Explainability Approaches	Paul Striker
Convolutional neural networks are broadly used for image classification. The networks have proven the capability of classifying images with a high accuracy. But the reason for the decision of the neural network isn't clear to both the developers of the network and the users. The field of explainability is a growing concern due to the increased usage of neural network in our everyday life. Explainability can help the users to trust a model and could help satisfy governmental regulators in the admission procedure of new products using neural networks. There are approaches for the visualization of the areas in an image that are being used for the classification of the image. In our work we will implement two of the approaches and use them to derive bounding boxes for the objects in the images. We will also compare the generated bounding boxes with the objects to receive a numerical value for the quality of our bounding boxes. The main contribution of our work are four different approaches for deriving the bounding boxes from the heatmaps. We also provide a comparison of the approaches for deriving the bounding boxes.
24.05.22	Tackling The Binding Problem And Compositional Generalization In Multimodal Language Learning	Caspar Volquardsen
The goal of artificial neural networks is to generalize outside of known training data, which means successfully solving tasks for novel data. But artificial neural networks still fall short of human-level generalization and require a very large number of training examples to succeed. Model architectures that further improve generalization capabilities are therefore still an open research question. We created a multimodal dataset from simulation for measuring the compositional generalization of neural networks in multimodal language learning. The dataset consists of sequences showing a robot arm interacting with objects on a table in a simple 3D environment, with the goal of describing the interaction. Compositional object features, multiple actions, and distracting objects pose challenges to the model. We show that a Long Short-Term Memory (LSTM) encoder-decoder architecture jointly trained together with a vision-encoder surpasses previous performance and handles multiple visible objects. Visualization of important input dimensions shows that a model that is trained with multiple objects, but not a model trained on just one object, has learned to ignore irrelevant objects. Furthermore we show that additional modalities in the input improve the overall performance. We conclude that the underlying training data has a significant influence on the model’s capability to generalize compositionally.
26.04.22	Evaluating the Lottery Ticket Hypothesis for Pruning Neural Networks	Vincent Rolf
We investigate the Lottery Ticket Hypothesis and its associated neural network pruning technique and evaluate it on a ResNet-56 architecture with the CIFAR-10 dataset. It is shown that with regards to computational speedup, the technique outperforms all other pruning techniques which reported data for the same architecture/dataset pair. We further investigate this by presenting a practical evaluation of the technique which runs the models as JavaScript in the web browser. The evaluation shows that in practice, the computational speedup is even greater, achieving a 40 to 100 times reduction in computation time with little loss of test accuracy. This is especially useful for deploying compressed versions of neural networks to mobile web browsers, since computational power on mobile devices is often limited. As part of our investigation, we present a compiler which translates pruned ResNet models to JavaScript in a sparse manner.
19.04.22	Effective and Sample-Efficient Planning for Robotic Manipulation Using Demonstrations	Mostafa Kotb
Planning in robotic manipulation settings is challenging due to the high-dimensional continuous action space which leads to an extremely hard exploration problem. For a robot to solve a manipulation task, the learned world model needs to effectively plan in areas that are related to the task execution. Unfortunately, these task-related areas are difficult to discover by exploration alone, and a world model may not receive sufficient training data from these areas to learn from, which will lead to bad convergence towards the desired behaviour. In order to bootstrap world model learning in task-related areas, we propose the use of demonstrations - given for instance by a human teacher - as initial training data, which is a common technique in model-free reinforcement learning. First, pretraining the model-based algorithm with given demonstrations to boost model learning, then further fine-tune by interacting with the environment to generalize beyond the demonstrations. Evaluation with two simulated manipulation tasks demonstrates that pretraining with demonstrations leads to a faster and more sample-efficient fine-tuning phase than learning from scratch.
12.04.22	Impact Makes a Sound and Sound Makes an Impact: Sound Guides Representations and Explorations	Xufeng Zhao
Sound is one of the most informative and abundant modalities in the real world, while being robust to sense without contacts by small and cheap sensors that can be placed on mobile devices. Although deep learning is capable of extracting information from multiple sensory inputs, there has been little use of sound for the control and learning of robotic actions. For unsupervised reinforcement learning, an agent is expected to actively collect experiences and jointly learn representations and policies in a self-supervised way. We build realistic robotic manipulation scenarios with physics-based sound simulation, and propose the Intrinsic Sound Curioisty Module (ISCM). The ISCM provides feedback to a reinforcement learner to learn robust representations and to reward a more efficient exploration behaviour. We perform experiments with sound enabled during pre-training and disabled during adaptaion, and show that representations learned by ISCM outperform the ones by vision-only baselines; pre-trained policies can accelerate the learning process when applied to downstream tasks.
05.04.22	Reinforcement Learning for Goal-Oriented Visual Dialogue	Yuliia Lysa
The Visual Dialogue task includes the agent holding a conversation referencing question history, putting the current question into context, and processing visual content. The focus of this thesis lies on the compositionality of the dialogue questions. The combinations of the visual features are not explicitly defined in the data set, hence the model has to independently arrange them to achieve compositionality. The objective is to generate an optimal question sequence by reducing the total sequence length, adapting questions to the presented visual content, and constructing complex compositional questions for target prediction on a given image. For this task, we apply the novel Recurrent Attention Model architecture in combination with reinforcement learning to the modified MNIST data set.

Winter Semester 2021/2022

Winter Semester 2021/2022
01.03.22	Deep Planning Network for Robotic Object Existence Prediction by Navigation	Henry Apostle

21.12.21	Sim-to-real transfer of an end-to-end visuomotor grasping task with a humanoid Robot	Connor Gäde
In order to manipulate and interact with our environment, humanoid robots require a number of different visuomotor skills such as grasping. Collecting large amounts of training data with a real robot to learn these abilities is often time-consuming and requires prolonged access to expensive hardware which in turn can get worn out and damaged due to heavy use. Simulators provide a safer and easily distributable alternative to collect data. However, as simulations do not accurately represent a real environment, it can be difficult to transfer a learned strategy from simulation to the real world. In this thesis, we examine the effects of domain randomization to increase the transferability of a grasping strategy trained in simulation to the real world. We replicated a grasping experiment with the NICO robot within the CoppeliaSim robot simulator and developed a method to autonomously create training data for a supervised learning approach with an end-to-end convolutional architecture. Using this method we collected three datasets of 2000 samples, two of which were randomized in different visual characteristics. We optimized a neural architecture for each of these datasets to solve the grasping task and compared their performance to a fourth model trained with real-world data. Our results show that domain randomization increases the transferability of a strategy trained in simulation to the real world. We show that without randomization, increasing amounts of training data can negatively impact the performance of the model when applied to the real world.
14.12.21	Neural Network Learning for Robust Speech Recognition (PhD Thesis Defense)	Leyuan Qu
Recently, end-to-end architectures have been influential for modeling of automatic speech recognition (ASR) systems, which aim to directly map acoustic inputs to character or word sequence and to simplify the training procedure. Plenty of end-to-end architectures have been proposed, for instance, Connectionist Temporal Classification (CTC), Sequence Transduction with Recurrent Neural Networks (RNN-T) and an attention-based encoder-decoder, which have accomplished significant success and performance on a variety of benchmarks or even reached human level on some tasks. However, although advanced deep neural network architectures have been proposed, in adverse environments, the performance of ASR systems suffers from significant degradation because of environmental noise or ambient reverberation. This thesis contributes to the robustness of ASR systems by leveraging additional visual sequences, face information and domain knowledge. We achieve significant improvement on speech reconstruction, speech separation, end-to-end modeling and OOV word recognition tasks.
14.12.21	Explaining Deep Reinforcement Learning Agents through Reward Decomposition and Hierarchical Goals	Finn Rietz
Deep neural network-based artificial intelligence systems are achieving outstanding results across many domains, including reinforcement learning. However, deep neural networks are black box models that are very hard to interpret. To fully explain the network’s input-output mapping is even more challenging; an unfortunate truth that is amplified by the increasing size and complexity of these networks. Thus, to enable the employment of deep neural network-based reinforcement learning systems in real-world, potentially safety-critical domains, the interpretability of these systems must be improved. This thesis aims to improve the explainability and interpretability of deep reinforcement learning systems by contributing a novel, jointly hierarchical and reward-decomposed reinforcement learning architecture that leverages local action explanations and the more abstract contextualization that hierarchical goals provide, for increased explainability. We demonstrate the feasibility of such an architecture and conduct a qualitative evaluation of its explanation capabilities on an industry-plausible stochastic warehouse environment. Additionally, we investigate entropy-based weighting techniques to simplify the resulting explanations. Our qualitative analysis shows that hierarchical goals are strictly necessary for the correct interpretation of local explanations, given reasonably complex environments. This finding reveals an entirely new aspect under which local explanations have to be analyzed, which is underlined when considering that many worthwhile, real-world reinforcement learning use-cases feature hierarchical characteristics. Furthermore, we empirically demonstrate that entropy-based weighting techniques can prevent irrelevant factors from contributing to the decision, which significantly simplifies explanation complexity.
16.11.21	Deep Learning for Prostate Cancer First Incidence Prediction	Sebastian Zörner
Prostate cancer is the fourth most commonly occurring cancer overall and the fifth leading cause of death from cancer for men. Due to its slow-growing nature and occurrence at a late age, treating prostate cancer is not just a matter of the most effective intervention. Instead, there is a trade-off between disease curement and quality of life. Therefore, early and precise risk assessment is crucial to minimise uneasing invasive procedures and the psychological effects of false-positive detection. This thesis evaluates state-of-the-art knowledge extraction methods for medical decision support to detect the risk for prostate cancer using tabular data from Electronic Health Records. A new benchmark dataset is created from the 'Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial' survey data to determine which types of machine learning models can learn from tabular data. A systematic comparison of five supervised and five unsupervised machine learning models is conducted, including Deep Learning models such as Stacked Denoising Autoencoders and TabNet and classical machine learning algorithms such as Gradient Boosting Decision Trees and Support Vector Machines. The models are evaluated for prediction performance using accuracy, AUROC, precision and recall, as well as reliability diagrams and the Integrated Calibration Index for model calibration. In addition, we propose a model-agnostic nomogram generator based on local surrogate modelling, designed to support patient-level explanations on arbitrary complex models. Evidence was found that the addition of predictor variables from health data increases prediction performance. The comparison results also show that a Multi-Layer Perceptron can match the baseline Logistic Regression in accuracy and AUROC. In contrast, unsupervised pretraining using autoencoders show no significant improvement over traditional approaches. Consequently, we can confirm that linear models remain competitive on learning from tabular data, while neural networks promise to improve prediction performance. Using the individualised nomograms proposed, physicians can benefit from novel neural architectures while retaining the ability to conduct patient-level risk assessments.
11.11.21	Elements of intelligence in cognitive robotics via cross-modal learning	Prof. Igor Farkas
In order to interact intelligently with the world, the cognitive robots must acquire a number of abilities typically linking different modalities. The abilities include vision, motor skills, sense of touch or language understanding, to name a few. In the talk, I will present an overview of several examples of simple models based on artificial neural networks. The models are trained by supervised, unsupervised or reinforcement learning, all of which are relevant in human learning. Each model is inspired by a concrete neural or cognitive phenomenon and tested in a simulated humanoid robot iCub.
09.11.21	Building a Multimodal Dataset for Transfer Learning of an Automatic Lipreading System in Neural Networks	Gerald Schwiebert
In this Thesis we present the dataset we created called GLips (German Lips) consisting of 250,000 videos of the faces of the speakers of the Hessian Parliament. The data was copied from the official YouTube channel with the permission of the Hessian Parliament and processed into a format similar to the LRW (Lip Reading in the Wild) dataset using an automatic pipeline. Our goal is to use transfer learning on both datasets to show that lipreading has language-independent features, so that datasets of different languages can be used to improve lip reading models. Furthermore, we address the difficulties regarding copyrights and DSGVO in the automated acquisition of large video datasets for the creation of datasets in Germany.
02.11.21	Snapture - a Hybrid Hand Gesture Recognition System	Hassan Ali
Hand gestures are essential in everyday human communication and have become an integral part of Human-Robot Interaction. They enable an intuitive form of interaction and provide a seamless HRI experience. Powered by deep learning, many dynamic gesture recognition frameworks have been proposed in recent years. Despite their impressive results, they fall short in recognizing some unique characteristics of each hand gesture. Some gestures are driven by motion, while others require a precise interpretation of the hand details. Additionally, many of these approaches rely on multimodal data, which might limit their robustness for robot applications. In this thesis, we investigate the integration of the motion and hand pose aspects of gestures using RGB data only. Our proposed hybrid gesture recognition framework called Snapture is an extension of the CNNLSTM model and considers the hand posture through an additional static channel. Furthermore, our architecture can determine whether the specific hand and fingers arrangement is needed to recognize a gesture based on its motion profile. Our evaluation of Snapture on the GRIT and Montalbano datasets shows promising results of the framework and that it can surpass the performance of the CNNLSTM architecture. Moreover, our analysis on a gesture class basis demonstrates that integrating the hand details improves the classification of subtle and indistinctive movements, which can be vital in critical scenarios.
26.10.21	A taxonomical foundation for explainable AI and the research-reality gap	Florian Woeste
Artificial intelligence (AI) and machine learning continue to produce breakthrough sand set record benchmarks in computer vision, natural language processing and more. Yet, industrial applications often struggle to reproduce research results or implement models successfully due to a lack of stakeholder trust and acceptance. Explainable AI is an emerging field that offers the potential to provide in-sights to remedy these issues. So, why isn’t everybody utilizing explainability? It turns out that a discordant research landscape has produced inconsistent definitions and disagreements about best practices and standards, starting even at the definition and use of explainability versus interpretability. The importance of target audiences and the granularity of explainable AI (XAI) goals is another point of disagreements, e.g., whether a relevant distinction exists between trust and confidence. This thesis funnels current XAI research into a workable taxonomy and presents a summary of available XAI methods. Evidence for a research-reality gap is presented by combining existing studies with new expert interviews. Findings are that a lack of successful use cases, limited knowledge about XAI in practice, and unclear applicability criteria, i.e., when to apply what model, are main drivers. To aid in closing the research-reality gap we suggest a set of systemic improvements for XAI research and propose a flowchart designed to aid decision makers in selecting appropriate XAI methods for their goals and target audiences.
12.10.21	Lifelong learning of NLP tasks with a dual-memory architecture	Jasper Gerwers
The growing dual-memory (GDM) was successfully used on the lifelong learning vision dataset CORe50. We adapt it for use on lifelong natural language processing (NLP) tasks and discuss the different aspects of the GDM in a lifelong NLP context.
05.10.21	Knowledge Extraction of Class Activation Maps	Filipe Silvestrim
Class Activation Maps (CAMs) is a XAI technique designed to expose the internal neural activation from CNNs through heatmaps. My thesis examines CAM’s ability to generate comprehensible explanations. A technique revolving around CAMs is proposed, aiming to extract textual explanation for a CNN classification task. In our work, different network architectures, datasets, and CAMs extraction techniques were investigated. Early results show a positive trend towards producing plausible textual explanations. Subsequent experiments reveal the underlying problems in extracting interpretable information from CAMs.

Summer Semester 2021

Summer Semester 2021
28.09.21	Lifelong Language Learning of Isotropic Sentence Embeddings with Self-Organizing Networks	Theresa Pekarek-Rosin
Continual lifelong language learning (LLL) has only recently started to gain traction with the introduction of pre-trained language models like BERT. These models produce contextual word embeddings, which can be adapted to many different tasks with little architectural changes and fine-tuning. However, a well-known problem with these embeddings is that they suffer from anisotropy. This leads to high degrees of cosine similarity between random words, which in turn leads to meaningless distances between word and sentence representations. On the other hand, dynamically expanding network structures like the SOINN have shown great potential for lifelong learning scenarios, due to their ability to adapt to shifting data distributions, while firmly retaining previously attained knowledge. To bridge the gap between the language representations a model like BERT produces and a SOINN structure requires, a generative flow-based network is used to translate the BERT sentence embeddings into latent Gaussian feature vectors. These isotropic embeddings are then used to train a new SOINN architecture (LB-SOINN+), which combines advantages from the Load-Balancing SOINN (LB-SOINN) and the SOINN+, in a continual learning sentiment analysis task. The results show that the LB-SOINN+ and the LB-SOINN are able to keep model performances high and prevent catastrophic forgetting over multiple task iterations. A visualization of the embeddings shows that the flow network produces similar and stable embeddings for all tasks in the setup. However, to properly introduce isotropy into the embeddings, the network needs to be trained on the dataset it will be used on, which implies that a flow-based approach to anisotropy might not be fitting for all LLL scenarios.
21.09.21	Pruning Neural Networks with Supermasks	Vincent Rolfs
We present a novel method for reducing neural network size and enhancing performance based on the Lottery ticket Hypothesis and Supermasks. We evaluate the method on the MNIST dataset and achieve an n-percent reduction compared to established grid search methods.
14.09.21	Object-Oriented Hierarchical Reinforcement Learning	Kemal Adibelli
Reinforcement learning enables intelligent agents to learn and adapt to sequential decision-making tasks. However, current reinforcement learning algorithms require a huge amount of training data, which is impractical in real-life applications. One way of increasing the sample efficiency is the hierarchical abstraction, hierarchical reinforcement learning. In hierarchical reinforcement learning, a complex decision-making task is decomposed into smaller, simpler sub-tasks. In this thesis, I will implement an object-oriented goal representation in hierarchical reinforcement learning by modifying the hierarchical actor-critic algorithm to evaluate the influence of an object-oriented representation regarding sample efficiency.
31.08.21	Explainable Continual Learning for Robotics	Prof. Chu Kiong Loo
Robotic agents have to learn to adapt and interact with their environment using a continuous stream of observations. Continual learning would then be effective in an autonomous agent or robot, which would learn autonomously through time about the external world, and incrementally develop a set of complex skills and knowledge. Since robotic agents are employed in a growing number of domains, making them explainable is an impelling priority. Understanding the motivation for a robotic agent’s behavior is important for successful collaborations between robots and humans. In this research, we develop an explainable continual learning architecture that combines reasoning and learning in a synergy.
10.08.21	Robot Reaching For Moving Target Using Reinforcement Learning With A Learned Model	Lennart Clasmeier
Model-based reinforcement learning with learned world models has shown great success in traditionally challenging tasks like Go, Chess, or Atari games in the last years. In this study, we apply the MuZero algorithm to a dynamic robotic reaching task to determine how well the algorithm performs in a continuous multidimensional, dynamic setting. We created a robotic reaching task in the simulation environment CoppeliaSim to teach a robotic agent to intercept a rolling ball before it rolls out of reach of the robot. We manage to train an agent to reach the ball with a high success rate and high trajectory quality. Instead of chasing the target, the agent learns how to find an interception course similar to the analytic solution.
06.07.21	Multimodal Feature Visualisation for XAI	Tom Weber
Feature visualisation techniques like Grad-Cam or Layerwise-Relevance-Propagation are explainable artificial intelligence (XAI) techniques that trace neural network decisions from the top-layer back into the input space. This enables practitioners to extract relevance values of each individual feature to interpret their importance in the task-at-hand. Through the pioneering work in the computer vision community, these methods are usually employed in unimodal visual recognition tasks. However, real-world, robotic tasks are often multimodal and require the integration of information from different sources. This research talk focuses on extending feature visualisations to multimodal tasks, the accompanying problems and future directions.
29.06.21	Learning Bidirectional Translation between Robot Action and Linguistic Description	Markus Heidrich
Learning to translate between multiple modalities is necessary for robots that interact with humans in everyday situations, as linguistic instructions have to be transformed into corresponding robot actions, and conversely the robot needs to be able to inform humans about what it is doing. Recent studies have proposed the use of recurrent autoencoders (RAEs) which learn to bidirectionally between robot action and linguistic instructions through the addition of a binding loss that forces corresponding actions and description to be close in the learned feature representation. The main goal of this thesis is to investigate the capabilities of such a model to generate qualitative actions. We adapt a proposed model by removing the binding loss with a fully connected neural net that binds the two modalities and additionally provides the decoders with a command token that informs what translation is to be executed. We show that accurate action generation can be achieved in simple scenarios. When testing in a continuous action space we show that the model struggles to perform well, sufficiently.
22.06.21	Speech Recognition with Pre-trained Deep Neural Networks Aided by Visual Context	Julian M. Behrensen
With speech recognition growing in real-life relevance and the possibility to include visual information to strengthen the recognition process, we explore two approaches in the multi-modal speech recognition domain. Work assignments or questions for a robot occurring in human-robot interaction are often tied to the surrounding of the interacting parties. This leads to the opportunity to utilise an object recogni- tion neural network to extract the additional information. Our two multi-modal automatic speech recognition implementations rely on audio preprocessing through a speech recognition model to score the generated utterance candidates. The first approach, Word-Treshold-Rescoring, shifts the candidates’ scores based on their correlation to the visual data, whereas the second approach modifies the masking mechanism of the GPT-2 base scorer, named GPT-2 FU, to incorporate the visual information. The modified GPT-2 approach in combination with the visual factor rescoring, we call Word-Threshold-Rescoring, leverages the additional information further. This thesis aims to show the capabilities of these two approaches for multi-modal automatic speech recognition developed and tested on the How2 dataset. Our results show that the current implementation of the Word-Threshold-Rescoring cannot, on average, improve the speech recognition performance, whereas the GPT-2 FU model shows an improvement.
08.06.21	Learning to Autonomously Reach Objects with NICO and Grow When Required Networks	Nima Rahrakhshan
The act of reaching is a fundamental yet complex skill for a robotic agent, requiring a high degree of visuomotor control and coordination. Considering dynamic environments, a robotic agent capable to autonomously react and adapt to novel information is desired. This thesis adopts a developmental robotics approach by developing internal models embodied in the humanoid robot NICO, through which visuomotor coordination is autonomously learned to enable the act of reaching an object. The internal models are implemented with Grow When Required Networks, which learn and build the associations between the temporally correlated sensorimotor information acquired through body-environment interaction. The humanoid robot NICO learns more complex motoric behavior increasingly, by first learning how to direct the gaze towards a visual stimulus, then to control the arm, and finally to learn eye-hand coordination through which the reach action emerges. Furthermore, an environmental change, introduced by a morphologic change in NICO’s body, shows the adaptability of the proposed model to an unexpected event. The humanoid robot NICO is able to reach objects with a 76% success rate, with most of the error stemming from NICO reaching too far or too short.
01.06.21	Embodied Language Learning with Paired Variational Autoencoders	Ozan Özdemir
Language acquisition is an integral part of developmental robotics, which aims at understanding the key components in human development and learning to utilise them in artificial agents. Similar to human infants, robots can learn language while interacting with objects in their environments and receiving linguistic input. This process, also coined as embodied language learning, can enhance language acquisition in robots via multiple modalities including visual and sensorimotor input. In this work, we explore ways to translate a simple action in a tabletop environment into various linguistic commands based on an existing approach which exploits the idea of multiple autoencoders. While the existing approach focuses on strict one-to-one mappings between actions and descriptions by implicitly binding two standard autoencoders in the latent space, we propose a variational autoencoder model to facilitate one-to-many mapping between actions and descriptions. Additionally, for extracting visual features, we employ channel-separated convolutional autoencoders to better handle complex visual input. The results show that our model outperforms the existing approach in associating multiple commands with the corresponding action.
25.05.21	GASP: Gated Attention for Saliency Prediction	Fares Abawi
Saliency prediction refers to the computational task of modeling overt attention. Social cues greatly influence our attention, consequently altering our eye movements and behavior. To emphasize the efficacy of such features, we present a neural model for integrating social cues and weighting their influences. Our model consists of two stages. During the first stage, we detect two social cues by following gaze, estimating gaze direction, and recognizing affect. These features are then transformed into spatiotemporal maps through image processing operations. The transformed representations are propagated to the second stage (GASP) where we explore various techniques of late fusion for integrating social cues and introduce two sub-networks for directing attention to relevant stimuli. Our experiments indicate that fusion approaches achieve better results for static integration methods, whereas non-fusion approaches for which the influence of each modality is unknown, result in better outcomes when coupled with recurrent models for dynamic saliency prediction. We show that gaze direction and affective representations contribute a prediction to ground-truth correspondence improvement of at least 5% compared to dynamic saliency models without social cues. Furthermore, affective representations improve GASP, supporting the necessity of considering affect-biased attention in predicting saliency.
18.05.21	Efficient Robotic Object Existence Prediction by Occlusion Reasoning	Mengdi Li
Reasoning about potential occlusions is essential for robots to efficiently predict whether an object exists in an environment. Though existing work shows that a robot with active perception can achieve various tasks, it is still unclear if occlusion reasoning can be achieved. To answer this question, we introduce the task of robotic object existence prediction: when being asked about an object, a robot needs to move as few steps as possible around a table with randomly placed objects to predict whether the queried object exists. To address this problem, we propose a novel recurrent neural network model that can be jointly trained with supervised and reinforcement learning methods using a curriculum training strategy. Experimental results show that 1) both active perception and occlusion reasoning are necessary to successfully achieve the task; 2) the proposed model demonstrates a good occlusion reasoning ability by achieving a similar prediction accuracy to an exhaustive exploration baseline while requiring only about 10% of the baseline's number of movement steps on average; and 3) the model generalizes to novel object combinations with a moderate loss of accuracy.
11.05.21	Variational autoencoder for speech enhancement with a noise-aware encoder	Huajian Fang
Recently, a generative variational autoencoder (VAE) has been proposed for speech enhancement to model speech statistics. However, this approach only uses clean speech in the training phase, making the estimation particularly sensitive to noise presence, especially in low signal-to-noise ratios (SNRs). To increase the robustness of the VAE, we propose to include noise information in the training phase by using a noise-aware encoder trained on noisy-clean speech pairs. We evaluate our approach on real recordings of different noisy environments and acoustic conditions using two different noise datasets. We show that our proposed noise-aware VAE outperforms the standard VAE in terms of overall distortion without increasing the number of model parameters. At the same time, we demonstrate that our model is capable of generalizing to unseen noise conditions better than a supervised feedforward deep neural network (DNN). Furthermore, we demonstrate the robustness of the model performance to a reduction of the noisy-clean speech training data size.
11.05.21	Target Speaker Text-to-Speech Synthesis from a Face Image Reference using Global Style Embeddings	Björn Plüster
The existence of a learnable cross-modal association between a person’s face and their voice is recently becoming more and more evident. This provides the basis for the task of target speaker text-to-speech (TTS) synthesis from face reference. In this thesis, we approach this task by proposing a cross-modal model architecture combining existing unimodal models. We use a Tacotron 2 multi-speaker TTS model with speaker embeddings based on Global Style Tokens that model a speaker’s voice from a short voice reference. We teach a FaceNet face encoder to predict these embeddings from a static face image reference and thus predict a speaker’s voice and vocal characteristics. The use of proven and readily available models enables quality speech synthesis and an easily extensible model architecture. Experimental results show good matching ability while retaining better voice naturalness than previous models. We examine these results and our architecture's limitations and discuss possible avenues of improvement for future work.
04.05.21	Lifelong Learning from Event-based Data	Vadym Gryshchuk
In this thesis, we address lifelong learning from data generated by event cameras. Contemporary methods for incremental learning are predominantly based on the frame-based data recorded by conventional shutter cameras. An event camera delivers high dynamic range, low power consumption, and high temporal resolution, thus making it suitable for dynamic environments, in which knowledge must be accumulated incrementally. We propose an architecture for lifelong learning that is composed of two modules: a feature extractor and an incremental learner. The feature extractor is utilized as a self-supervised sparse convolutional neural network, which processes event-based data. The incremental learner uses a habituation-based method that works in tandem with other existing techniques to mitigate catastrophic forgetting. The conducted experiments show that our proposed method is capable of effective incremental learning without forgetting previously learned experiences.
28.04.21	Disentanglement, Compositionality, Specification: Representation Learning with Generative Adversarial Networks (PhD thesis defense)	Tobias Hinz
Learning good data representations is one of the most important tasks for machine learning. Generative adversarial nets (GANs) offer a powerful framework for learning image representations and offer functionalities such as image generation and editing. This thesis introduces three approaches to improve the learned image representations and their usage for several downstream tasks. Based on these approaches we show how we can use GANs to learn disentangled, compositional, and highly specific representations and how these can be applied to various tasks such as image generation, image-to-image translation, text-to-image synthesis, and image editing.

Winter Semester 2020/2021

Winter Semester 2020/2021
30.03.21	Human Pose Estimation with LSTM Confidence Boosting (MSc thesis defense)	Sachin Sharma
Determining human pose from images is a challenging task that has been of great interest in Human-Computer Interaction studies. It is an essential feature for various applications involving human tracking, activity recognition, human motion, and augmented reality applications. While the traditional convolution neural network (CNN) based approaches have shown impressive results, it falls short in complex or occlusion scenarios. This thesis addresses 2D human pose estimation for a single person in the view of novel feature extraction using a boosting network to ameliorate the human pose estimation task. We focus on getting a real-time human pose using the heat maps for each joint in the human body using a neural network approach focusing on the Convolution Neural Network (CNN) and the Convolution Long Short Term Memory (ConvLSTM) units. CNN-based architectures are good at feature extraction from images; therefore, we propose a two-fold process. First, we propose feature extraction using a state-of-the-art network and condense them to relevant features through an hourglass architecture. Second, using convLSTM to enhance those features. We intend to exploit the graphical spatial correlation among different body parts, enabling the network to learn the joints' inter-dependencies and improve network performance. Our proposed method achieves an accuracy of 86% with 50% fewer parameters compared to the state-of-the-art network. Using our convLSTM module validates the convLSTM improves the network's prediction capability and boosts the prediction's confidence. Based on network and results, we present insights about training neural networks for real-world applications.
24.03.21	Periodicity, Surprisal, Attention: Skip Conditions for Recurrent Neural Networks (PhD thesis defense)	Tayfun Alpay
Recurrent neural networks are powerful models used for sequence learning. This thesis addresses the typical limitation of recurrent neural networks to fully process and update for every timestep of an input sequence, even if the presented information is redundant or not relevant to the given task. Based on the conditional computation framework, we introduce and investigate periodicity, surprisal, and attention as constraints to model effective skipping. We evaluate the presented methodologies in the context of natural language processing and explore how we can balance potential trade-offs between model performance and skipping.
23.03.21	A Novel Approach to Flight Phase Identication using Machine Learning (MSc thesis defense)	Emy Arts
Flight phases are essential for many applications of aviation research. This project presents a novel machine learning model for the identification of flight phases. The identification is performed on aircraft trajectory data, which in contrast to other flight data, is a publicly available resource obtained through the Automatic Dependent Surveillance Broadcast (ADS-B) concept. With the help of supervised simulation data, a model that aims at improving state-of-the-art flight phase identification on trajectory data is developed. The model combines K-means clustering, allowing the segmentation to capture transitions between phases more closely, with a Long Short-Term Memory (LSTM) network, able to learn the dynamics of a flight. The improvement of this work, compared to the state-of-the-art model by Sun et al.[1] based on fuzzy logic, comprises: increasing the accuracy by more than 2%, adhering to the International Civil Aviation Organisation (ICAO) standard, and increasing the number of flight phases to include take-off, landing, and others. The latter shows potential, considering the performance on simulation data, however, requires more realistic training data to achieve similar performance on actual flights. [1] Junzi Sun, Joost Ellerbroek, and Jacco Hoekstra. Large-scale flight phase identification from ads-b data using machine learning methods. In 7th International Conference on Research in Air Transportation, 2016.
02.03.21	Speed-Reading the Input of an Audio-Visual Speech Recognizer (MSc thesis defense)	Fabio Wendt
Audio-visual Speech Recognizers (AVSR) have to deal with many different multimodal fusion challenges, like the different representations of the video and audio data, the different sampling rates, and missing alignments. Additionally, many AVSRs show a low influence of the video modality on the results and making the additional computational cost by adding the video modality to an Automatic Speech Recognizer (ASR), a lousy tradeoff. A possible solution for some of the challenges can be integrating speed-reading techniques like a Skip-RNN into an AVSR. This architecture learns to skip input tokens and lower the computational cost and help identify the essential tokens in a sequence. This work focuses on analyzing the effects of integrating a Skip-RNN into an AVSR, like the system's update rates and performance. For this work, the video encoder, the audio encoder, and the decoder get equipped with a Skip-RNN. The results show success in reducing the number of states that need to be updated and getting models as good as the baseline models, CMA AVSR and DAM AVSR. Additionally, the experiments found models that showed an increased cross-modal alignment, making the models better at dealing with noisy data and the video channel's influence on the overall decoders' output.
09.02.21	Evaluating Generative Teaching Networks As A Means To Accelerate Neural Architecture Search (BSc thesis defense)	Lennart Bergmann
Generative Teaching Networks (GTN) are a new method to quickly train and evaluate neuralnetworks first introduced in 2019 by Such et al.[1] GTNs are able to accelerate the training of neural architectures by condensing the training data into much fewer images by learning to generate batches of images that are optimized to train the model more efficiently than real data would, thereby reducing the data and therefor also time needed to train a model. One of the most exciting use cases is the automated search for better neural architectures where thousands of architectures need to be trained & evaluated. This work largely focuseson reproducing and expanding upon the results achieved by Such et al. Results show that GTNs work on MNIST and CIFAR10 and can be used to find well performing architectures via random search. A novel contribution of this paper is the combination of GTNs with Neural Architecture Optimization (NAO). I show that using GTNs to train and evaluate the architectures in NAO can lead to a 75x speed-up while still finding architectures of comparable performance. [1] https://arxiv.org/pdf/1912.07768.pdf
26.01.21	Haptic Object Classification with Self-Organizing Neural Networks (BSc thesis defense)	Sophie Gräfnitz
In contrast to the visual domain, there have been few object recognition experiments with unsupervised self-organizing artificial neural networks in the haptic domain. Based on the latest related approaches, we conduct a series of experiments with a realistic 16-object data set to evaluate if unsupervised self-organizing approaches are a suitable choice for haptic object clustering and classification. Therefore we implement the self-organizing unsupervised models SOM, GrowingGrid and Growing Cell Structure (GCS). All of these approaches were used in the haptic domain before. The most promising model (GCS) is optimized and evaluated whether it can compete with the prior best classifier on the 16-object data. Because we are working with unsupervised learners we get the opportunity to test their ability to generalize object properties in order to approximate continuous learning. Hence we conduct a generalization experiment with the GCS to test its generalization capabilities with the given data set.
19.01.21	Neurorobotic Audio-visual Object Classification (MSc thesis defense)	Glareh Mir
Audio-visual object recognition is a natural human ability. Recent advances in the field of neurorobotics have made it possible to replicate this behavior in humanoid robots. This study aims to examine multi-modal and uni-modal approaches to achieve audio-visual object classification via deep neural networks. We first created a dataset in a physical robotic scenario. The dataset is composed of synchronized image frames and raw audio samples. In the dataset 16 objects, which were also used in haptic datasets, are represented. For each object, 50 samples are collected of the robot dropping these objects onto a slope to gain audio-visual information. The objects are mostly soft toys, and some are visually or audibly similar. The data collected is then used to train and evaluate different neural models for cross-modal object recognition. The models aimed at studying the effects of integrating information from different modalities into multi-modal models: Two different approaches for each uni-modal and two multi-modal approaches are compared. The results show a clear advantage of using multi-modal information for the classification task. Furthermore, we show that using pre-trained uni-modal models in multi-modal models yields better results and higher classification accuracy than training from scratch.
12.01.21	Learning then, learning now and every second in between: Lifelong Machine Learning with a Humanoid Robot (MSc thesis defense)	Aleksej Logacjov
Lifelong learning (LL) is a long-standing challenge in machine learning due to catastrophic forgetting. The catastrophic forgetting problem states that continuously learning from novel experiences leads to a decrease in the performance of previously acquired knowledge. Two recently published LL approaches are the Growing Dual Memory (GDM) and the Self-organizing Incremental Neural Network+ (SOINN+). Both are growing neural networks that create new neurons in response to novel sensory experience. The latter approach shows state-of-the-art LL clustering performance with low memory requirements regarding the number of nodes. However, classification capabilities are not investigated. This thesis proposes an associative SOINN+ approach for classification tasks. Three main properties of the GDM model are adopted to enable this. (1) Context learning is used to facilitate the consideration of previous network activations, (2) each neuron learns a histogram of class labels from the input, and (3) additional constraints for node creation and weight adaptation are introduced. The proposed model and the GDM approach are evaluated on a novel LL object recognition dataset recorded in a nearly real-world virtual environment. In this dataset, a virtual humanoid robot is manipulating 100 different objects belonging to ten categories. To simulate different environments, real-world and artificially created background images, grouped into four different complexity levels, are utilized. The proposed associative SOINN+ shows similar state-of-the-art classification accuracy results as the best GDM architecture of this work while consisting of at least approximately 100 times fewer neurons. This reduced number of nodes results in lower memory requirements, making this approach more suitable for mobile robots.
05.01.21	Model-Free Surprise-Based Curiosity by Heuristic Value-Function Approximation in Hierarchical Reinforcement Learning (MSc thesis defense)	Armin Schaare
Artificial curiosity is often used in reinforcement learning to generate additional feedback, resulting in a more efficient and successful training of the agent, especially in sparse rewards environments. In this thesis, I introduce Intention Driven Curious HAC (IDCHAC) -- a novel approach to artificial curiosity. While many previous approaches use a forward model of the environment, comparing its predictions to actual outcomes to generate intrinsic rewards proportional to their difference, IDCHAC is model-free. By exploiting the policy's hierarchical structure and having a heuristic for the goodness of state-actions, it can obtain intrinsic rewards without a model for the environment. I evaluated IDCHAC against CHAC and vanilla HAC in four different continuous actions and states, sparse rewards environments. While IDCHAC shows an apparent performance increase over HAC in the simplest of these environments, they perform similarly in the two moderate ones, in which CHAC is better than both. However, in the most complex one, IDCHAC and CHAC fail while HAC manages to train a policy successfully. I show that a bias towards the end-goal, introduced through the heuristic, is responsible for the failure of IDCHAC in this environment. Future work needs to focus on solving this issue and expanding the possible benefits of IDCHAC to more complex environments.
22.12.20	Commonsense Explanations withNatural Language Generation (BSc thesis defense)	Christian Rahe
A major advantage of generalized language models built on the Transformer architecture is that they can adapt to tasks through fine-tuning alone. Still, many applications in task-solving contexts involve direct manipulation of the model architecture, which can be challenging without sufficient domain knowledge.In this thesis, I develop and compare different ways of representing a multiplechoice commonsense task in plaintext for a generative Transformer. To do this, I train multiple instances of GPT-2355M on the SemEval 2020 Task 4, each with a different plaintext representation of the question set, without making any modifications to the language model. I show that in an answer replication setting,embedding a natural language description of the problem specifications has a pos-itive effect on performance. However, the best overall accuracy is achieved when numbering the multiple choice options and expressing the correct option throughits corresponding number. In such a setting, I do not find any performance benefit from adding natural language prompts.
15.12.20	Model-Based Hierarchical Actor-Critic Reinforcement Learning (MSc thesis defense)	Frank Röder
Reinforcement learning agents suffer from sample efficiency. The demand for training data grows proportionally with the difficulty of the task. Model-based reinforcement learning agents address this problem of sample-efficiency by acquiring reusable knowledge. They learn a model of their environment to e.g. mimic itsproperties as a simulator or calculate the best immediate actions to take by planning ahead. Recent work of the past few years has shown that these approaches can reduce the number of environment interactions and improve overall learning. Another promising approach is hierarchical reinforcement learning, which decomposes a potentially hard task into a set of simpler subtask. Latest publications showcase the advantage of generating subtasks with short sequences of decisions to accelerate learning using interdependent policies. While most model-based approaches consider flat agents, they disregard the potential of hierarchical abstractions. As most of the hierarchical methods are model-free, it is feasible to apply model-based policy optimization for multiple levels of policies. Cognitive science and neuroscience provide strong evidence that humans continually use their internal model to simulate at different abstraction levels mentally; hence it seems promising to use this for reinforcement learning agents. This thesis proposes a novel computational architecture that expands a curiosity-driven hierarchical actor-critic framework. We introduce a stronger dynamics model to simulate entire environment episodes by predicting long action horizons to augment the agent’s experiences. Next, we demonstrate that our approach can effectively learn a dynamics model in a curiosity-driven manner to achieve a high sample-efficiency. As a result, we experimentally show that the total amount of environment episodes is significantly reduced by reaching the same or higher performance as the baseline.
08.12.20	Generalization in Multi-Modal Language Learning from Simulation (BSc thesis defense)	Aaron Eisermann
Neural Networks can be powerful function approximators, able to model high dimensional feature distributions from a subset of examples drawn from the target distribution. Naturally, they are very good at generalizing within the limits of their target function, but they often fail to generalize outside of the explicitly learned feature space. It is therefore an open topic of research whether and how neural network-based architectures can be deployed for systematic reasoning. Many studies have shown evidence for poor generalization, but they often work with abstract data or are limited to single-channel input. We know that humans learn and interact through a combination of multiple perceptions and rarely rely on just one. To investigate compositional generalization in a multi-modal setting, we automatically generate an extensible data set with multi-modal input sequences from simulation. We investigate the influence of the underlying training data distribution on compositional generalization in a minimal LSTM based network trained ina supervised, time-continuous setting. We find compositional generalization to be limited to training data with certain properties, such as data with lots of color overlap between objects. We also determine that multi-modality can strongly improve compositional generalization in settings where the visual input alone struggles to generalize.
01.12.20	Sample Efficient Reinforcement Learning in Sparse Reward Environments with Planning-Integrated Policy (MSc thesis defense)	Christoper Wulur
Model-free algorithms are well-known for their capability of learning without requiring any prior knowledge. They learn solely from their experiences to produce an optimal policy. However, it requires model-free agents to have a vast amount of samples. This condition is aggravated when dealing with sparse reward environments where most states contain zero rewards. This thesis developed a model-based approach to tackle the high sample complexity problem in sparse reward settings. A particle swarm optimization (PSO) planner is employed as the action selection mechanism, aiding agents to discover rewards more. Consequently, agents learn more efficiently, reducing sample complexity.
24.11.20	We need to go Deeper? Constructing a Preprocessor based on what Speech CNN Layers hear (MSc thesis defense)	Matthias Möller
In this work, we address the question if we can clean speech through an automatic speech recognition system. Therefore, we introduce a new preprocessor architecture that is tasked to clean spectral representations. Our preprocessor architecture is divided into encoder and decoder. The encoder consists of the hidden layers of an already trained automatic speech recognition system called Jasper. The decoder consists of two separate highway networks. The decoder reconstructs logarithmized Mel spectrograms that are normalized along the temporal axis. We train three different preprocessors that differ in the amount of reused Jasper layers. We train the preprocessor to predict clean speech from noisy speech. Furthermore, we only train the batch normalization in Jasper and the decoder. Thereby, we aim to let our preprocessor learn which kernels in Jasper support the reconstruction of clean speech. The outputs of our three preprocessors indicate that background noise was removed in each preprocessor. Further, we see tendencies that deviations to the clean spectrogram can correlate to the behavior of an automatic speech recognition system. We observe that the lower Mel bins are less reconstructed with a female speaker. We speculate that this could relate to the aspect that these bins are below the fundamental frequency. Further, we feed the output of our preprocessors back into Jasper. We notice that the preprocessor that reuses most Jasper blocks reduces the word error rate in most conditions. Additionally, we train a smaller Jasper model with the preprocessor that reuses most Jasper blocks. We observe a lower training curve than a Jasper model that was trained with a data augmentation techniques. On the other hand, we observe that the training curve is less stable. We speculate that this results from the removal of areas that are irrelevant to speech. Further, we observe that the validation curve of the model that trains with our preprocessor is lower. After training, we observe that the Jasper model that uses the output of our preprocessor outperforms the other model in clean and noisy speech.
17.11.20	Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision (BSc thesis defense)	Julien Scholz
Using a model of the environment, reinforcement learning agents can plan their future moves and achieve super-human performance in board games like Chess, Shogi, and Go, while remaining relatively sample efficient. As demonstrated by the MuZero Algorithm, the environment model can even be learned dynamically, generalizing the agent to many more tasks while at the same time achieving state-of-the-art performance. Notably, MuZero uses internal state representations instead of real environment states for its predictions. In this thesis, we introduce two additional, independent loss terms to MuZero’s overall loss function, which work entirely unsupervised and act as constraints to stabilize the learning process. Experiments show that they provide a significant performance increase in simple OpenAI Gym environments. Our modifications also enable self-supervised pretraining for MuZero, meaning the algorithm can learn about environment dynamics before a goal is made available.
10.11.20	Multimodal Integration for Empathy Recognition (MSc thesis defense)	Nicole Xirakia
Empathy is one key ingredient in human-human interactions that enhances our communication and understanding of each other and improves our interpersonal relationships. Considering the rapid pace within which social agents have been entering our lives over the past years, the need to improve our social interactions with them becomes more urgent. Agents that can predict human affective behavior can result in more natural human-robot interactions. This work proposes a model that integrates data from multiple modalities, audio, vision, and language, to recognize valence-based affective behavior in a dyadic interaction setup. We investigate the complexity of integrating and processing asynchronous multi-sensori data by exploring the capabilities of two novel recurrent neural network architectures, the SkipRNN and Phased LSTM, designed to ignore irrelevant and repetitive input data. We compare the performance of the two architectures to the conventional LSTM. We also examine the eGeMAPS acoustic set’s performance, designed to be applied to valence-based affective speech recognition tasks, by comparing it against the state-of- the-art MFCC acoustic set. Our model is trained and evaluated on the OMG-Empathy datasets that employ valence-based annotations, and it is set around semi-scripted dyadic human interactions. Our results show that valence-based tasks require higher volume datasets to properly generalize the task, where most of our experiments suffered from overfitting the training set. In terms of synchronizing the different modalities, the SkipRNN achieved the highest performance across all other experiments, where the baseline LSTM achieved a comparable performance. In contrast, Phased LSTM achieved the worst performance across all experiments, which we believe is due to the small size of the OMG-Empathy dataset, combined with customizing the network to ignore any repetition in the input data. Finally, the acoustic set performance results showed that eGeMAPS could increase the valence recognition accuracy when combined with the LSTM model. At the same time, it achieved a significantly worse performance in the SkipRNN and Phased LSTM models compared to the MFCC acoustic set. This demonstrates that eGeMAPS require high volume valence-based datasets to increase the learning performance in tasks that aim to recognize affective behavior.
03.11.20	Using the Reformer for Efficient Summarization (BSc thesis defense)	Yannick Wehr
The Transformer model has become a popular architecture for natural language processing tasks, but models have become so large that they are only feasible to run in very large and costly settings. To address this problem, the Reformer, a more efficient version of the Transformer, has been created. To advance the research of automatic summarization systems, we apply the Reformer to two summarization datasets. We run two experiments to investigate the individual techniques used by the Reformer: Our first experiment compares chunked feed forward networks and reversible layers to a baseline Transformer. Our second experiment compares LSH attention and full attention.

Summer Semester 2020

Summer Semester 2020
27.10.20	Towards Optimizing Dialogue Actions For Improved Sentiment Receival (BSc thesis defense)	Hendrik Pfennig
Changing features of language, such as Sentiment or Tense, are a difficult task for neural networks due to the complicated structure of human language. Previous work has tackled this problem by shifting representations of language in latent space. In this work an alternative approach to optimize quantifiable features of language is presented. The main difference to state-of-the-art approaches is that optimization will take place in vector space, rather than latent space. The approach uses an LSTM-based binary Sentiment Classifier combined with a Word2Vec Word Embedding. Results indicate that the dimensionality of the Word Embedding is a key factor for successful optimization. A Word Embedding with 300 dimensions was not suitable for the optimization, while an Embedding with 20 dimensions yielded acceptable results. The reason why this approach is novel is the possibility to combine it with a Language Generation Model, potentially enabling dialog agents to manipulate the Sentiment of interlocutors. As for the Language Model, further research is required to design an architecture that produces outputs in a differentiable way.
20.10.20	Navigation with Pepper Robot with ROS and Octomap (BSc thesis defense)	Florian Kock
The robot Pepper of Softbank Robotics is a humanoid mobile robot developed to get in contact with humans socially. Due to its mobility, size and agility, the robot needs to know its surroundings in a three-dimensional representation for advanced tasks. By evaluating and merging different sensors, an accurate mapping of the environment can be created to generate plans for actions. This thesis gives an approach to implement ROS, OctoMap and potentially a navigation system on Pepper's local computer. Furthermore, it indicates distinct technical limitations with the hardware and software environment.
13.10.20	Improvement of the Inference Time of a German State-of-the-Art Automatic Speech Recognition System (MSc thesis defense)	Victoria Wagner
In current automatic speech recognition (ASR) systems the focus is on the best possible performance, measured in word error rate ( WER), of the models. However, it has been shown that inference time is also an important factor in the adaptation of ASR systems. It is important because a long inference time leads to uncertainty on the user side. Therefore, in this thesis, we address the issue of training a current approach to speech recognition in German and improving it with respect to floating operations per second (FLOPs). To this end we train Jasper, an end-to-end ASR system that uses convolutional layers and introduces a novel optimizer called NovoGrad and residual topology. Jasper is trained in German to extend the little research on non-English speech recognition systems. We use an automated approach called MorphNet, which offers the possibility to directly optimize a neural network with respect to specific resources, to optimize our trained model. We successfully optimize our developed model. We find that the pruning of neurons with low activations in networks trained with NovoGrad is sufficient to reduce FLOPs and even improve the networks performance. Thus, we find that time and computationally intensive steps of the MorphNet algorithm are not necessary.
06.10.20	Attention Mechanism for Dynamic Weighting between Audio-Visual Inputs for Speech Recognition (MSc thesis defense)	Mariela Sanchez
Speech recognition is a challenging task in artificial intelligence. The idea of using audio and visual cues, working in conjunction, could enhance the process of recognizing speech. Previous research was mainly about using either one of these cues or combining both to produce recognized words or sentences. In this thesis, a dynamic weighting between audio inputs and visual inputs is proposed. This weighting mechanism, known as attention, computes audio and video streams' weights based on their reliability under noisy and clean conditions. Several methods have been proposed before to improve recognition performance, fusing audio and visual modalities. The technique used in this thesis is a modality attention mechanism that relies on the features of each modality, either audio or video, to perform word recognition. Experimental results indicate that the proposed model could enhance audio-visual speech recognition under noise conditions for both streams. However, further research would be necessary to refine the mechanism and execute experiments on a larger scale.
22.09.20	Localization and orientation detection of Humanoid Robots (BSc thesis defense)	Eric Bergter
The RoboCup is annual international robotics competition with the goal to imitate human like soccer playing. One of many challenges is the coordination between robots, for example when they want to pass the ball to another player. For that, a robot needs to be able to detect and determine the location and orientation of other robots. To accomplish this task, the robots have to rely almost completely on the visual sensor, a camera located in it head. In this project the problem of the pose estimation is approached by simplifying and reducing it into object classification problem. This is then trained and tested with the YOLOv3 model. Furthermore, the hole process from start to finish should be as simple and as automated as possible. Different ways to generate, train and test the models are explored. My research findings indicate that it is possible to approach simple pose estimation as object classification, and partially automate the data generation with CGI and AprilTag Software. The models are capable of detecting the location of the robots correctly in ~80% of the tested cases, and are able to predict the rotation with an average error of ~60°. After testing and comparing the results, the models trained only with CGI data were able to compete in a real world scenario with models trained with manually labeled data.
15.09.20	Combining Monte Carlo Localization with a Neural Network based Particle Filter Seeder for solving the kidnapped robot problem	Massimo Innocentini
The aim of this paper is to present an augmentation to the particle filter by introducing a neural network based seeder of particles. The seeder is based on a neural network classifier which is trained on 2D LiDAR data. The neural network leverages the concept of place fields to represent specific locations in space which are then used as reference by the particle filter to sample additional particles. Particle filters are often used as basis for mobile robot localization, a common application is called AMCL (Adaptive Monte Carlo Localization). The utilization of the seeder results in a more robust localization, which is capable of recovery from robot kidnapping and globally localize while reducing the amount of particles required compared to the standard AMCL.
25.08.20	Effect of Demonstrations in Reinforcement Learning	Ankit Srivastava
Reinforcement Learning (RL) has a persistent problem while dealing with sparse rewards. In the past, many techniques and methods have been introduced to overcome this problem. One of the techniques called Hindsight Experience Replay (HER) has been introduced over Deep Deterministic Policy Gradients (DDPG) algorithm to allow learning from sparse rewards. However, HER based RL has an exploration problem with the environment due to which Nair et al. used demonstrations which builds on top of DDPG and HER to overcome this exploration and speedup over RL on simulated robotics tasks. Their work can accelerate learning in sparse environments but RL needs lots of samples for complex tasks. In our work, we tried to extend the work of Nair et al. to approx the best number of demonstrations needed to solve the complex task FetchPickAndPlace, avoiding exploration in sparse environments. For this, we made an additional assumption that we can collect a large set of demonstrations. Furthermore, we trained our model with a different subset of demonstrations (10, 100, 500, 1000, 2500, 5000, 7500 and 10000) by controlling the number of samples to be used from the demonstration buffer per batch of size 1024. We observed, passing 512 samples from demonstration buffer per batch of size 1024 would achieve the best performance. Also, we concluded that using demonstrations set of 2500 for FetchPickAndPlace task would enhance the agent learning.
18.08.20	A Comparison of CNN-Based Methods for Real-Time Visual Object Recognition (BSc thesis defense)	Jiayi Wang
Object detection is an essential ability for a robot system operating in populated environments. Since objects are often embedded in complex scenes, the detection results are strongly influenced by changing environmental conditions. This thesis provides a comparative evaluation of object recognition models based on convolutional neural networks (CNN) and investigates the effects of changing background, varying object size and distance, and occlusion on the accuracy and reliability of object recognition. Three architectures commonly used in modern object detectors, which are You Only Look Oncev3 (YOLOv3), Single Shot MultiBox Detector(SSD) and Mask R-CNN, were selected for the experiments. The focus is on the detection of commonplace objects in household and the Autonomous Robot Indoor Dataset was chosen to model the natural indoor scene. With this constellationYOLOv3 offers the highest accuracy according to the metrics Average Precision(AP). Both with occluded and small or distant objects YOLOv3 achieved an ac-curacy of more than 0.85 AP on average. In addition, YOLOv3 also outperforms its competitors with an inference time of 29 ms, showing a good balance between accuracy and speed. The Mask R-CNN achieved second place with a slightly lower performance especially in occlusion. SSD, on the other hand, always detects smaller hidden objects worse than YOLOv3 and Mask R-CNN.
11.08.20	Real-time indoor collision avoidance through monocular depth estimation (BSc thesis defense)	Ramtin Nouri
When using robots a functioning collision avoidance algorithm is very important, as the consequences can be very expensive if any damages occur. Existing collision avoidance methods are usually based on sensor data like SONAR, infrared, LIDAR or RGB-D cameras. Even though inexpensive sensors exist and some have a good accuracy each one has its disadvantages. Some sensors generally only measure the distance in a very restricted angle and comparably small range in front of them. LIDAR sensors can detect a great horizontal field of view but not a vertical one, so that it might not detect objects under the sensor, like objects on the floor, or above it, like the top of a table. The robot perceives the space as passable but will collide with the objects on the floor or might collide with the table. RGB-D camera's measuring distance is limited and surfaces, that are transparent, glossy, or absorbing, very often can not be measured correctly. In this paper, I present a neural network approach of estimating the depth using only a single independent RGB frame. Standard cameras already are mostly available and I show how utilizing them can lead to fewer collisions. I train a deep residual convolutional neural network using data gathered from a simulator and calculate when the robot should stop based on the predicted depth map.
04.08.20	Gesture Recognition with Neural Networks and OpenPose (BSc thesis defense)	Lennart Uhrmacher
Gestures and body language are not only an important part of non-verbal communication between humans, but also a popular method of communication between humans and robots. There are many approaches to recognize gestures, but many of them have spatial limitations or are not suitable for real time. As part of the RoboCup, we present a new approach to the visual detection of gestures, to allow the first communication between the soccer playing robots and the human coach. Our approach is the combination of the multi-person pose estimation framework OpenPose, which allows the detection of body joints and limbs in images, and neural networks. We also present two datasets for gesture recognition, which were created to test the performance of our approach.
28.07.20	On the Influence of Document Length for Sentiment Analysis (BSc thesis defense)	Erkan Karabulut
The present bachelor thesis investigates the so-called attention mechanism on the task of sentiment analysis. Since feedback for products and services in form of customer reviews are very common today, we will use german product reviews as our dataset. The objective is to further investigate the performance of our models with two kind of datasets which distinguish in the part that one dataset has less, two to four, sentences per document and the other dataset has at least seven sentences per document. The experiments show that we get two clusters in which the two datasets have end up for our three models. One cluster, which builds from the ”short” dataset, is around 54% accuracy and the cluster, which comes from the ”long” dataset, is located at around 64% accuracy. On both datasets, the models with attention mechanisms are superior then our baseline LSTM network.
07.07.20	Auditory Perception for Deep Reinforcement Learning in Robotics (BSc thesis defense)	Thilo Fryen
Over the last decades, deep reinforcement learning has been a very successful approach to machine learning. However, there is a lack of research that combines reinforcement learning with auditory observations. This thesis explores the use of auditory perception for deep reinforcement learning. An agent learns to play the theremin with a 6 DOF robotic arm in a simulated environment. I train it with deep deterministic policy gradients and hindsight experience replay and provide only auditory observations for the task. To enable the agent to learn to play a sequence of notes, I introduce the time-dependent goal, an extension of using a 'goal' as an additional parameter for the value function. I compare the influence of the following time- to frequency-domain transforms on the performance of the agent: Constant-Q transform, short-time discrete Fourier transform STFT, and STFT with mel scaling. My evaluation also investigates the impact of actuating the robotic arm with inverse kinematics versus direct joint manipulation. Additionally, I examine the agent's robustness, specifically towards auditory noise. I deploy a policy that was pretrained in the simulation on a real robot with 1 DOF. I conclude that auditory perception is useful for deep reinforcement learning in robotics. I provide a video of the experiment results at https://cloud.mafiasi.de/s/yQCbyC7kqE9PRLt
16.06.20	Detection and Evaluation of Whole-body Posture and Movement	Nicolas Duczek
To detect the posture of a person and being able to identify the specific body parts like limbs or the head is among the easiest tasks for human, while it is among the harder tasks for an artificial intelligence. However, it is hard for both man and machine to evaluate the quality of specific sets of posture and movement for example in weight training or physiotherapy without proper knowledge about the exercise to be performed. The question arises how an artificial intelligence can gain this knowledge and process it in a way to give reasonable feedback about the correctness of the performed exercise to a person. Therefore, in this talk different approaches of posture detection will be discussed and previous works on scoring possibilities explored.
09.06.20	Towards Robust Visual Reasoning for Embodied Agents	Jae Hee Lee
Reasoning about the properties of physical objects and their relations seems to be one of the basic skills that humans possess. For example, a three-year old child can easily answer questions like "Is there a white swan in the lake?" or "Is there a black cat on the roof?". What is remarkable here is that the child can also answer a question like "Is there a black swan on the roof?", although most likely she has not seen a black swan in her life, not to mention a black swan on a roof. An AI agent, however, is not yet able to successfully handle such unseen combinations of learned concepts. In this talk I want to discuss the aforementioned problem in more detail and then lay out how we want to tackle the problem within the CML C4 project.
26.05.20	Overview of Compression Methods for Neural Networks	Lasse Westphal
In recent years state-of-the-art networks in domains like speech-recognition and image classification have immensely grown in size. This makes them hard to train, expensive to run and on some embedded platforms even impossible to use. That is why the compression of existing models and the development of smaller networks, lately gained more attention. In this paper, four groups of compression methods are presented: pruning, that removes unimportant weights, low-rank approximation, that substitutes layers by layers with smaller rank, quantization methods that reduce the storage requirement per weight and knowledge distillation, where a teacher offers a student network information about the domain additionally to the ground-truth labels. Furthermore, we show how these approaches can be combined and what else besides compression can be done, to attain small, efficient models.
19.05.20	Neural Networks for Detecting Irrelevant Questions during Visual Question Answering	Mengdi Li
Visual question answering (VQA) is a task to produce correct answers to questions about images. When given an irrelevant question to an image, existing models for VQA will still produce an answer rather than predict that the question is irrelevant. This situation not only shows that current models do not truly understand images and questions but also can be misleading in real-world application scenarios. Based on our hypothesis that abilities required for detecting irrelevant questions are similar to those required for answering questions, we study what performance a state-of-the-art VQA network can achieve when trained on irrelevant question detection. Then, we analyse influences of reasoning and relational modeling on the task of irrelevant question detection. Our experimental results show that a VQA network trained on an irrelevant question detection dataset outperforms existing state-of-the-art methods by a big margin on the task of irrelevant question detection. Ablation studies show that explicit reasoning and relational modeling benefits irrelevant question detection as well as the VQA task. At last, we investigate a straight-forward idea of integrating the ability to detect irrelevant questions into VQA models by joint training with extended VQA data containing irrelevant cases. We find that joint training has a negative impact on the model's performance on the VQA task, while the accuracy on relevance detection is maintained. In this paper we claim that an efficient neural network designed for VQA can achieve high accuracy on detecting relevance, however integrating the ability to detect relevance into a VQA model by joint training will lead to degradation of performance on the VQA task.
12.05.20	Multimodal Target Speech Separation with Voice and Face References	Leyuan Qu
Target speech separation refers to isolating target speech from a multi-speaker mixture signal by conditioning on auxiliary information about the target speaker. Different from the mainstream audio-visual approaches which usually require simultaneous visual streams as additional input, e.g. the corresponding lip movement sequences, in our approach we propose the novel use of a single face profile of the target speaker to separate expected clean speech. We exploit the fact that the image of a face contains information about the person's speech sound. Compared to using a simultaneous visual sequence, a face image is easier to obtain by pre-enrollment or on websites, which enables the system to generalize to devices without cameras. To this end, we incorporate face embeddings extracted from a pretrained model for face recognition into the speech separation, which guide the system in predicting a target speaker mask in the time-frequency domain. The experimental results show that a pre-enrolled face image is able to benefit separating expected speech signals. Additionally, face information is complementary to voice reference and we show that further improvement can be achieved when combing both face and voice embeddings.
05.05.20	Evaluating unsupervised GANs on synthetic datasets of increasing complexity (MSc thesis defense)	Marco Jendryczko
Much focus is put on GANs to produce high-resolution, realistic images from highly complex datasets of human faces. Only very little research engages with the learning behavior of GANs to explore theories that minimize the current trial and error nature in the field. This paper provides the new synthetic dataset Sort-of-Shapeworld, which should serve as a mutual foundation for more systematic research in this area. Aside, a new metric is introduced to measure the distance between frequency distributions, which helps to quantify GANs modeling capabilities. We found, on feature-level, different distributions to influence GAN performance positively and negatively, depending on the architecture. Further, we propose different approaches to extract self-defined features from GAN generated images and discuss their practical value, as well as their impact on evaluation.
28.04.20	Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Identity- and Emotion Recognition from Speech (BSc thesis defense)	Hussam Almotlak
Unsupervised learning is based on the idea of self-organization to find hidden patterns and features in the data without the need for labels. Variational autoencoders (VAEs) are unsupervised learning generative models that create low dimensional representations of the input data and learn by regenerating the same input from that representation. Recently, VAEs were used to extract representations from audio data, which possess not only content-dependent information but also speaker-dependent information such as gender, health status, and speaker ID. Speaker-dependent information does not change with time, unlike the content-dependent information. VAEs with two timescale variables were then introduced to disentangle speaker-dependent from content-dependent information. This work introduces a third, i.e. medium timescale into a VAE. So instead of having only a global and a local timescale variable, this model holds a global, a medium, and a local variable. We tested the model on three downstream applications: speaker identification, gender classification, and emotion recognition, where each hidden representation behaved better on some specific tasks than the others. Speaker ID and gender were best reported by the global variable, while emotion was best extracted when using the medium. Furthermore, we conducted experiments on gender transformation using the high-level features encoded inside the global and medium timescale variables to verify the effectiveness of the variables. We also visualized the variables using t-SNE plots.

Winter Semester 2019/2020

Winter Semester 2019/2020
03.03.20	Using Deep Neural Networks andTransfer Learning for Body GestureLocalization in Multi-Person Scenarios	Marco Lengua
Gesture recognition in multi person scenes poses a difficult challenge for state ofthe arts neural network architectures. Most approaches to this challenge use hand-crafted features and use the skeleton modality of depth cameras. Prior research hasshown that an end-to-end architecture is capable of localizing salient body motionin such scenes. This thesis aims to improve this approach by applying transferlearning to overcome disadvantages due to the limited dataset size. Building onthis existing work we ask: To what extend is transfer learning applicable andcan it improve the baseline models (CNN2D, CNN3D)? To answer this question,two models are applied reusing the VGG19 and InceptionResNetV2 architectures,respectively. These were pre-trained on the ImageNet dataset although for a quitedifferent task. Our two models were optimized and then compared thoroughly withthe baseline models. We conclude that the InceptionResNetV2 model outperformsthe baseline models in some categories but the VGG19 model achieves generallyworse results except for scenarios where only passive persons are present.
18.02.20	Skin Radiance Classification with Deep Neural Networks	Felipe Arrivabene
Skin Radiance is one of the most desired parameters in cosmetic products and can be perceived differently by different people. It can be understood as the result of young and healthy skin, but since this parameter depends on human perception of physical interactions between skin and light, it cannot be measured by standard methods. In order to quantify such subjective parameters, human interpretation has to be taken into account. For this reason, in this work, we try to model the perception of skin radiance based on consumer input. Deep Learning models are fairly used to model human perception in different fields, such as face emotion analysis in computer vision or sentiment analysis in the text. In this work, we compare deep learning methods, such as CNN, ACGAN and CAAE for classification with the most used method, the (human) specialist assessment. The models and the specialist analysed the same dataset from women faces labelled with three classes: ?under average?, ?average? and ?over average? skin radiance, prepared at Beiersdorf AG. The best performance came from a pre- trained model, reaching over 65% accuracy, while the specialist assessment had only 39%. However, we could notice promising results using the CAAE architecture as well. The results also show the necessity of a better understanding of skin radiance in order to prepare a better-labelled dataset for this task, with more images and better-defined term.
04.02.20	Ensemble-Based Multi-View Lipreading	Sönke Behrend
Lipreading is the ability to transcribe speech solely from the movement of lips. Recent advances in Deep Learning and new large-scale datasets made significant leaps in the recognition rate of automatic lipreading possible. However, most of these improvements have been demonstrated on near frontal lipreading. Generalizing to the full spectrum of possible head poses a speaker may adapt throughout the length of their monologue still proves a significant challenge. In this thesis, we propose an expert model approach to apply a divide-and-conquer strategy to multi-view lipreading, allowing to almost close the gap in performance between frontal- and non-frontal views of faces. Lipreading tasks can be split into two main approaches, where one tries to classify a limited set of words and phrases and the more complex task of lipreading full sentences with large vocabularies up to tens of thousands of unique words. To tackle this challenge, we will start with the simpler tasks of classifying words from video sequences. Our main contributions are an expert model for lipreading words, which is compared against a non-expert baseline model, to show that our approach can close the performance gap for different view angles. Furthermore, we propose an attention mechanism to analyze and visualize learned expert behavior to assist in interpreting expert models. Finally, we extend the word lipreading model to the more complex task of lipreading full sentences, exceeding professional human lipreaders by a large margin.
28.01.20	Adaptive Model Mediated Control Using Reinforcement Learning	Hadi Bek
Due to similarities in learning techniques, Reinforcement Learning (RL) is the closest alternative to human-level intelligence. Teleoperation systems using RL can adapt to new environmental conditions and deal with high uncertainty due to long-time delays. In this thesis, we propose a method that takes advantage of RL capabilities to extend the human reach in dangerous remote environments. The proposed method utilizes the Model Mediated Teleoperation (MMT) concept in which the teleoperator interacts with a simulated setup that resembles the real environment. The simulation can provide instant haptic feedback where the data from the real environment are delayed. The proposed approach enables haptic feedback teleoperation of high-DOF dexterous robots under long time delays in a time-varying environment with high uncertainty. In existence of time delay, when the data is received by the remote system the environment may change drastically, therefore, the attempt for task execution will fail. To prevent failure, an intelligence system is realized in two layers, the first layer utilizes the Dynamic Movement Primitives (DMP) which accounts for certain changes in the environment. DMPs can adjust the shape of a trajectory based on given criteria, for example, a new target position or avoiding a new obstacle. But in an uncertain environment, DMPs fail, therefore, the second layer of intelligence makes use of different reinforcement learning methods based on expectation-maximization, stochastic optimal control and policy gradient to guarantee the successful completion of the task. Furthermore, To ensure the safety of the system, and speed up the learning process, each learning session for RL happens in multiple simulations of the remote system and environment, simultaneously. The proposed approach was realized on DLR's haptic hand-arm user interface/exoskeleton, Exodex Adam. It has been used for the first time in this work as the master device to teleoperate a high-DOF dexterous robot. This slave device is an anthropomorphic hand-arm system combining a five-finger hand (FFH) attached to a custom configured DLR lightweight robot (LWR 4+) more closely fitting to the kinematics of the human arm. An augmented reality visualization implemented on the Microsoft Hololens fuses the slave device and virtual environment models to provide environment immersion for the teleoperator. A preliminary user-study was carried out to help evaluate the human-robot interaction capabilities and performance of the system. Meanwhile, the RL approaches are evaluated separately in two different levels of difficulty; with and without uncertainty in perceived object position. The results from the unweighted NASA Task load Index (NASA TLX) and System Usability Score (SUS) questionnaires show a low workload (27) and above-average perceived usability (71). The learning results show all RL methods can find a solution for all challenges in a limited time. Meanwhile, the method based on stochastic optimal control has a better performance. The results also show DMPs to be effective at adapting to new conditions where there is no uncertainty involved.
21.01.20	Research Talk	Henrique Siqueira
TBD
14.01.20	Neural collision detection for robots via current prediction	Björn Sygo
Currently, the number of robots that interact with humans is on the rise, and one of the most important features for these robots is to detect collisions with itself, other objects or humans. Mostly this is done by using touch or torque sensors, but since the NICO robot, developed by the Knowledge Technology Group at the Universit\"at Hamburg, has neither of these, a different approach is needed. Since other research has already shown that it is possible to use the current values of the robots motors to detect collisions, this thesis takes this approach and advances it for the entire reachable space of the left arm of the robot, since it was only done for a predefined area yet. To do this, first an MLP was developed to ensure the robot doesn't collide with itself and second an MLP was developed to predict the currents for the robot motors. These predictions was then used to detect collisions. The results show, that it is possible to use this approach but the method to finally detect the collisions could certainly be enhanced.
07.01.20	BSc thesis defense	Josua Spisak
In this thesis, we explored the uses of neural networks to help robots with grasping. Grasping is a skill that almost every human possesses. We use it constantly in our daily life, eating, washing, working, most things we do rely on it. We can hardly imagine not being able to grasp as we learned it at a very young age without any difficulties. Robots exist for a multitude of reasons and tasks. One of these tasks would be to help and support us. To fulfil this task, grasping is often a must. Sadly robots aren't able to learn this skill just as easy as we are. We separated grasping in 3 sub-tasks for this thesis, recognizing the object, locating the object and the actual movement. We did all of these tasks in a simulation so that its easier to control the environment and detect potential problems. For the Simulation we used VREP. The robot we used is called NICO (Neuro-Inspired COmpanion) which is a child-sized humanoid robot. His eyes are used to take a picture of a table in front of him on which an object is placed. The coordinates of the object relative to the picture are then detected and forwarded to an artificial neural network. This neural network uses the 2D-Coordinates to calculate the 3D-Coordinates of the object relative to our simulation. Afterwards, the 3D-Coordinates are used by an inverse kinematic algorithm to make NICO extend his arm towards the Object and grasp it. All of this is done in a modular approach so that it is easy to test and detect errors as well as potential risks. In this thesis, we mainly focus on the ANN that is used to transform the 2D Coordinates into 3D-Coordinates. Because of the use of a simulated environment we were able to collect as much data as we needed and even managed to optimize it with a novel approach. At first, we trained the ANN on the area that we wanted to grasp in. Unfortunately, the network wasn't precise enough until we used a larger area to train the network. This was surprisingly successful and helped us to greatly reduce the loss. We continue with the calculated coordinates to let the inverse kinematic algorithm define in which way the robot needs to move to reach these coordinates. In the end, the robot successfully reaches the object.
17.12.19	Bayesian Learning for Neural Attention (MSc defense)	Thomas Hummel
Neural attention plays a central role in many state-of-the-art models for tasks like neural machine translation, image captioning and (visual) question answering. By allowing the model to focus on the part of the input which it deems relevant for the output, attention can help with modeling of long sequences in sequence-to-sequence learning, but also with reducing the complexity when the input is large. Soft attention is the most-widely used attention mechanism as it is easy to implement and end-to-end differentiable. Probabilistic attention approaches like hard attention are often much harder to train but have been shown to outperform its deterministic counterparts when trained well. However, beside their better performance, other advantages of probabilistic attention models have not been extensively explored until now. A recent probabilistic approach, variational attention, models attention as a latent alignment problem and utilizes amortized variational inference for learning. In this thesis, we investigated variational attention as well as two other related probabilistic methods regarding their generalization capabilities and interpretability of learned attention distributions in comparison to deterministic soft attention on visual question answering tasks, namely VQA 2.0 and ShapeWorld. Our results confirm that probabilistic attention methods have the ability to outperform soft attention. Depending on the task, the performance of variational attention differs for which we reveal potential causes. Moreover, we observe that probabilistic attention methods can improve generalization when training data is scarce. Our results suggest that learning attention in a probabilistically principled way does not necessarily improve the interpretability of attention. However, our findings indicate that sampling of well-learned attention distributions of high entropy can improve reasoning about model decisions.
10.12.19	Generative Text-to-Image Synthesis	Tobias Hinz
We will introduce our latest research in the area of generative text-to-image synthesis, specifically in the context of challenging, complicated scenes from a heterogeneous domain. We will show current SOTA approaches and what they currently can and can not do. Furthermore, we will discuss current evaluation metrics for this task, their weaknesses, and introduce a novel evaluation metric specifically for text-to-image synthesis.
03.12.19	Investigating the impact of domain specific fine-tuning for transfer learning in sentiment analysis (BSc thesis defense)	Viktor Eckhardt
In Deep Learning the amount of data used for training plays an important role. The larger the dataset, the better a model can generalize and the less it tends tooverfit. Since task specific data is often limited a lot of research is aimed at TransferLearning. For Natural Language Processing (NLP) a popular approach is to train amodel on a large text corpus to capture syntactic and semantic meaning and applythis knowledge to a downstream task. For the last couple of years, extracting only the first layer of the pretrained model has been the common way for transferring knowledge. However, recent research has shown that using the entire model and fine-tuning it improves results in different NLP tasks. Since task specific fine-tuning has been shown to improve results, this thesis provides an extensive analysis of fine-tuning a model with data from the same domain of the target task. An extension to Universal Language Model Fine-tuning (ULMFiT) is proposed and examined on the target task of Sentiment Analysis.The results show that Transfer Learning in Sentiment Analysis profits from domain specific fine-tuning, especially for similiar domains and a large amount of domain specific training data.
12.11.19	Generation of Waveform Audio from Mel-spectrograms (BSc thesis defense)	Hauke Dau
Recent research introduced new ways to convert text to high-quality speech. New models for this task, specially Tacotron introduced in 2018 generate in a first step mel spectrograms out of text. In a second step, these spectrograms are converted to audio. There are several methods to do this conversion. The recently developed neuronal models, WaveGlow and WaveNET Vocodercan generate high-quality audio out of the mel spectrograms. The pretrained models are trained on single english speaker datasets. To generate audio with other speakers or languages, a retraining of these models is necessary. The training of these models require several weeks on a single GPU (GTX 1080Ti). This makes experiments in which it is necessary to retrain these models very time consuming. Our approach to solve this problem is a deep neural network, which converts mel spectrograms to full spectrograms. These contain more informations than mel spectrograms and can be converted to high quality audio. This is shown in this thesis that with a reduced dataset the introduced model achieves a much faster training time compared to current models. These results are yet limited to small training datasets.
05.11.19	NICO weight prediction based on current values using neural networks and regression (BSc thesis defense)	Thomas Hofmann
Robots are used in various contexts. For example, they are used for research purposes in order to close the gap between capabilities of humans and humanoid robots. This work focuses on the perception of weight. Humanoid robots are equipped with multiple sensors, motors and peripherals, that allow the interaction with the environment and its objects. We introduce an approach to extend perception of a robot to determine the weight of an object by lifting it using internal motor values. The humanoid robot used for this thesis does not support weight estimation out-of-the-box. To achieve our goal, we use the properties of the motors in order to solve the problem. Recurrent neural networks will be trained in order to predict current values in the robot motors for lifting sequences with different join angles without weight. These real measured current values are then subtracted from the predictions, with the aim to feed the current difference with a set of further information to a linear regression. The regression then concludes on the weight lifted by the robot. We conduct two experiments. First we explore current value and weight predictors for a simple lifting task with an extended arm. The second experiment builds on the results of the first experiment and expands the task complexity by using varying angles in joints to simulate different torque on the shoulder joint. In order to find a good performing regression model for the second attempt, different approaches for feature selection are compared.
22.10.19	Humanoid NICO-Robot learning Visio-Motor-Skills using Self-Organizing-Map (BSc thesis defense)	Patrick Wittke
Die Bachelor Arbeit "NICO Robot learning Visio Motor Skills using Self-Organizing Map" ist in Zusammenarbeit mit dem Fachbereich WMT der Universität Hamburg entstanden. Diese Ausarbeitung beschäftigt sich mit dem Lernverhalten von Self-Organizing Maps und deren Brauchbarkeit für das Erlernen von Greifaufgaben mittels des humanoiden NICO Roboters. Die Self-Organizing Map ist in der Lage, verschiedene Freiheitsgrade zu lernen und zu organisieren ohne dabei "reversed kinematics" zu gebrauchen. Es wird versucht, mit modifizierten Neuronen vorhersagen über das Ergebnis zu erlangen, um zuverlässigere Ergebnisse generieren zu können. Im Vorfeld wird anhand von Beispielen, in denen Farben sich selber organisieren, das Verhalten der Self-Organizing-Map beschrieben und analysiert. Hier konnten sehr anschauliche und vielversprechende Ergebnisse erzeugt werden. Weiterführend werden Beispiele mit der Robotersimulation in V-REP gezeigt und deren resultierende Ergebnisse ausgewertet. Insgesamt konnte die Self-Organizing Map beim Erlernen der Freiheitsgrade keine guten Ergebnisse erzielen.
15.10.19	Deep Reinforcement Learning for Real-World Autonomous Driving (MSc thesis defense)	Julius Mayer
Autonomous driving, attempting to model human driver behaviour, is an evolution of advanced driver assistant systems into self-driving cars, requiring little to no human assistance. In this work, deep reinforcement learning is employed to train a data-driven driver model (D3M) for the trajectory planning task of real-world autonomous driving. To investigate how machine learning and rule-based expert systems can be combined into a hybrid-system, a new reinforcement learning framework was implemented and integrated into an existing autonomous driving system. D3M's layered system architecture is discussed with respect to the goal of applying such a system in the real-world. Furthermore, the necessary elements that need to be given for such a system to work and the main challenges that were encountered are presented. The 'Deep Deterministic Policy Gradients' Algorithm (DDPG) is used to train the actor-critic agent with apprenticeship learning for continuous control.
08.10.19	Reinforcement Learning as a Strategy for Engaging Task-Oriented Dialogue Systems (MSc thesis defense)	Viet Cuong Nguyen
The objective of a dialogue system or conversational agent is to converse with a human using natural language in order to fulfil a specific purpose. There are a different number of ways to categorise these dialogue systems. In this work, we focus on task-oriented (also known as goal-oriented) and non-task-oriented dialogue systems, as well as a framework that can interleave the contents generated by these two types of systems. The framework utilises reinforcement learning as a dialogue management strategy to induce the behaviour required for goal-oriented applications while keeping the user engaged with non-task contents when the task module fails or is deemed uninteresting. This thesis will focus on evaluating the effectiveness and soundness of using reinforcement learning to introduce such behaviour by comparing the proposed framework to a pure task-oriented system.

Summer Semester 2019

Summer Semester 2019
25.06.19	TBD (BSc thesis defense)	Timo Kreuzer
TBD
18.06.19	Evaluation of Grayscale Hand Gesture Segmentation with Fully Convolutional Neural Networks (BSc thesis defense)	Lukas Uzolas
Static hand gestures can convey information through the posture of the hand alone. However, problems arise when gestures are presented in front of complex backgrounds. Segmentation represents one method to counteract this problem by extracting the hand from the image. These methods are often based on hand-crafted algorithms using skin color, which limits their application under natural conditions. We approach the segmentation tasks by utilizing two Fully Convolutional Neural Networks, namely the Light Weight RefineNet and DeepLabv3+, and evaluate their influence on grayscale gesture classification tasks. We find that the Light Weight RefineNet has an overall better performance, and a fine-tuned version can improve recognition accuracy on most of the gesture data sets. We further explore how gesture classification could be exploited to learn an intermediate segmentation in a Convolutional Neural Network, but this method fails to yield satisfying results at its current state.
11.06.19	Adaptive Goal-Directed Behavior using Planning with Reinforcement Rewards (BSc thesis defense)	Joshua Schimmelpfenning
This thesis explores look-ahead planning in a continuous, open-horizon task aimed towards reinforcement learning algorithms. Look-ahead planning optimization uses a world-model to predict future states given a sequence of actions and uses reinforcements from those states to optimize the actions with gradient descent. The goal of this thesis is to show the algorithms capabilities for reinforcement learning, represented by the pendulum swing-up task. Our results show that the current algorithm is not capable of solving the swing-up task reliably. However, it can hold the pendulum once it is upright. Further, we found that the algorithm is not suited for agents with constrained action-spaces, except when the goal can be reached directly.
28.05.19	Approximate Bayesian Inference in Recurrent Neural Networks (MSc thesis defense)	Robert Brown
Traditional Neural Networks (MLPs, CNNs, and RNNs) all exhibit the unfortunate characteristic of over-confidence in uncertain environments, and lack reliable representations of uncertainty to counteract this tendency. In this work, we provide evidence of these claims and present an existing solution to the problem: Monte Carlo Dropout (MC Dropout). MC Dropout approximates Bayesian inference in Neural Architectures by using Stochastic Regularization Techniques (such as dropout) to perform multiple stochastic passes through a network in a process analogous to posterior sampling. In the context of Recurrent Neural Networks, the underlying theory behind the MC Dropout algorithm depends on a very specific implementation of recurrent dropout, and in this work we investigate the ability of alternative Regularization Techniques (such as Zoneout) to perform similar forms of approximate Bayesian inference using the MC Dropout technique. Namely, we compare three variations of MC Dropout which utilize three different Stochastic Regularization implementations (Recurrent Dropout without Memory Loss, Zoneout, and the original Variational RNN implementation), and demonstrate an extreme level of similarity among all of them. Our results show that the particular flavor of Stochastic Regularization used in the MC Dropout algorithm has relatively little effect on the algorithm's behavior, and implies that different Stochastic Regularization Techniques can safely be used to perform Approximate Bayesian Inference in RNN. In practical terms, this means that MC Dropout can be performed with confidence using any of the popular deep learning frameworks, without having to be concerned with which specific flavor of dropout is implemented in that particular framework.
21.05.19	Learning Speech Characteristics in Neural Encoder-Decoder Architectures (MSc thesis defense)	Sebastian Springenberg
Unsupervised feature representation learning with neural networks facilitates the discovery of explanatory factors underlying data without the need for annotated datasets. Recently, neural encoder-decoder architectures have been shown to extract useful representations by encoding data into a latent intermediate representation and then decoding this representation again to obtain a reconstruction. Variational autoencoders (VAEs) follow a similar architecture but further constrain the latent representation to follow a distribution. This can increase efficiency in information encoding, resulting in more expressive representations. This thesis will introduce two VAE models to obtain speech characteristics from audio recordings. The first architecture is a simple predictive VAE which encodes input into a latent variable at each step in time to predict the next speech-frame. With the intention of disentangling global from local information, we will further derive an auxiliary lower bound. This lower bound is used to develop a model that we will call the multi-timescale Aux-VAE. The Aux-VAE operates similar to the simpler model but includes an additional auxiliary variable which is used to extract information at the global sequence level. Learned representations will be evaluated with respect to their ability to cover speech characteristics. Here, we will assess how well repesentations are separated in the latent space with regard to speaker identity and gender. In order to investigate transfarability, they will further be applied in a subsequent speaker identification task.
14.05.19	Using Aspect-based Sentiment Analysis on Restaurant Reviews (BSc thesis defense)	Tom Kastek
With aspect-based sentiment analysis, a sentence is analyzed by estimating the sentiment in a text in relation to an aspect, like a word in the text. In this bachelor?s thesis, the principle is used to estimate the sentiment in relation to a term in a sentence as well as the sentiment in relation to a category the sentence is about. The terms are extracted in a subtask beforehand. For the term analysis, two existing long short term memory models are used, whereas for the category analysis for this thesis a long short-term memory model is implemented, which in a first step estimates the category of a sentence in the same model.
07.05.19	Supplementation of Training Data for a Robotic Arm using a Parameterized Generative Neural Network (BSc thesis defense)	Klaas Degwitz
Training a robot arm for a grasping task with supervised learning requires appropriate training data. Producing this training data in sufficient amounts can often be a difficult task. It would therefore be useful to develop a method, that manages to generate additional training data from a small available data set. With the recent attention on Generative Adversarial Networks (GANs) through projects like the generation of fictitious celebrity faces by NVIDIA [11], GANs seem to provide a promising solution to this scenario. This thesis evaluates if training results can be improved by supplementing training data with images generated using a Boundary Equilibrium Generative Adversarial Network (BEGAN). The generated data is created by swapping objects in the available training images for other objects. This is a challenge in contrast to the known BEGAN application of face manipulation, since the objects are relatively small and their position is changing from image to image. Swapping was successfully executed by empirically determining certain hyper-parameters and specific preparation of the BEGANs training data. The usage of BEGANs for robotic visuomotor learning not only contributes to the field of neurorobotic learning, but also contributes a way of quantifying the success of a BEGAN application beyond the mere aesthetic assessment by humans.
30.04.19	PhD Thesis Defense	Jorge Davila-Chacon
TBD
23.04.19	Scale Estimation in Visual Object Tracking	Finn Rietz
Robust visual object tracking is a key requirement in the field of robotics, as it enables advanced robot environment interaction. In the past, Convolutional Neural Networks (CNNs) have been shown to be powerful feature extractors and are viable for visual object tracking, based on the extracted features. While most state-of-the-art trackers achieve strong results in controlled settings, significant scale variations of the target object during tracking poses a challenging problem, because it requires online learning of new visual features, corresponding to the object at a different scale. As a consequence, scale variation requires specific and sophisticated care. This thesis provides thorough analysis of the available algorithms for handling scale variations, from which two algorithms have been carefully selected and implemented in the HIOB tracking framework. Both algorithms are extended to support independent scaling across the x and y-axis, additionally, update strategies are developed that control on which frames the algorithms are executed. The TB100 dataset is used for evaluation on a broad and diverse range of sequence. Additionally, evaluation on the NICO dataset reveals performance of the two algorithms when facing typical robot environment interaction challenges. The results show strong scale estimation capabilities of one algorithm, while the second algorithm that has been developed shows promising potential.
16.04.19	Intermediate Representations in Deep Multimodal Neural Networks (MSc thesis defense)	Fares Abawi
Artificial neural networks designed to learn multiple tasks were shown to outperform those with a single objective. We explore the effectiveness of introducing multiple auxiliary tasks to improve the performance of a multimodal neural network. We create a synthetic dataset explicitly designed for a robotic grasping task, with the goal of grasping objects and relocating them based on natural language commands. We design a neural model combining multi-sensory input through intermediate fusion. The multimodal integration process combines individual deep neural models specializing in specific tasks, composing a single network for grasping objects given an image and a natural language command. The vision and natural language modalities perform domain-specific tasks with independent output heads branching from both neural models. We refer to these output heads as intermediate representations. We use a symbolic Robot Command Language (RCL) as an intermediate representation between the language network and the fusion network. The vision network has two intermediate representations for localizing objects in images and for classifying them. We ablate the intermediate representations forming all possible combinations. Our experiments show that certain intermediate representations result in an overwhelming loss contribution to the entire model, distorting the main task's objective. However, other losses contribute positively to the model's overall performance and act as regularizers. Our results also indicate that choosing RCL as an intermediate representation outperforms natural language as an intermediate representation.
09.04.19	3D Viewpoint Optimization via 2D Image Matching (MSc thesis defense)	Paul Anton
The 3D viewpoint estimation task is relevant for a wide range of computer vision applications since it provides a way to model the 3D environment of intelligent systems. The ability to estimate the position of an agent relative to the entities in its environment is as fundamental to the agent's functioning in that environment as the detection and recognition of those entities. The task has therefore gained increased research interest and has been tackled in a vast landscape of scenarios, with current solutions varying in terms of representation, problem formulation and methods. After extensively surveying the existing literature on the topic, we formulate our approach for solving the 3D viewpoint estimation task in a clinical context. The scenario requires 3D spatial guidance of a mobile imaging device in order to precisely reposition it at a perspective previously used to scan a patient's body part, having as reference an image acquired from that same viewpoint. By comparing the reference image to a second image acquired from a new viewpoint, we succeed in extracting 3D information corresponding to the change between the two viewpoints. The first method we investigate is a feature-based method, that uses SIFT keypoints identified in the two images to register them and deduce the 3D transformation between their corresponding viewpoints. We develop a second method which is based on a popular CNN architecture trained to regress the 6-DoF change values from one viewpoint to another by learning on pairs of their corresponding images. We find that the deep learning model is performing better than the feature-based approach and we provide a comparative discussion of their performance.

Winter Semester 18/19

Winter Semester 2018/2019
26.03.19	Introduction Talk	Dong Hai Phong Ngyuen
Dong Hai Phuong Nguyen finished his PhD thesis at the Istituto Italiano di Tecnologia (IIT) in Genova and joined our group 01.03.2019. He will be working with Manfred in the DFG project "IDEAS".
05.03.19	Object Classification by Haptic Weight Exploration with a Humanoid Robot (BSc Defense)	Connor Gaede
While interacting with an object, a human is able to determine the objects physical characteristics in relation to previously explored ones through haptic perception. To adapt such a system for a humanoid robot, the sensory information of multiple different motors and specialized sensors has to be integrated into a singular model. In this thesis, we examine the capabilities of a humanoid robot to distinguish a set of 16 objects based on their weight. We recorded a haptic dataset with a total of 19200 haptic measurements, collecting arm motor currents and positions as well as tactile feedback from three fingertip sensors, produced by a NICO robot lifting each of these object 100 times. With this dataset, we optimize three neural architectures to classify different subsets of the recorded objects with increasing minimum weight distances between individual objects. We then compare the influence of both tactile information and proprioception on the classification result by evaluating the isolated performance of these two types of sensory information individually for each set of objects.
26.02.19	A Deep Autoencoder for Unsupervised Learning of Place- and Head Direction Cells from Image Sequences (BSc Defense)	Oliver Heidmann
In this talk we examine the capabilities of auto-encoders to generate place- and head direction cells. First we will introduce supervised encoders which are capable of producing place and head direction through supervised training, then combine our results with a simple decoder and examine the properties of the resulting auto-encoders and their necessary alterations. The unsupervised training of the auto-encodes will be done with a loss function that enforces slowness on the middle layer to generate the wanted cell properties. We found that auto-encoders hold promise for the generation of place- and head direction cells but that the mechanism we chose is hard to control and possibly need revision and more in depth research.
29.01.19	An Investigation Into Hector SLAM Using A 360° LIDAR Sensor (BSc Defense)	Dominik Scherer
Simultaneous Localization and Mapping, or SLAM for short, is one of the funda- mental computational problems in robotics research. Many applications, ranging from simple factory robots to self-driving cars, rely on the implementation of an effective SLAM-solving algorithm. One such implementation originally developed for Urban Search and Rescue scenarios is called Hector. This work will provide an introduction to the theoretical aspects of SLAM, as well as a presentation and discussion of the results gained from a series of experiments investigating the Hector system. These experiments involved the mapping of an indoor environment with different sets of parameters utilising a 360° laser scanner, and also subsequent navigation tests on the maps generated thereby. Positive results in terms of noise reduction and obstacle avoidance could be gained. The information provided within this thesis may prove useful to anyone looking for a concise overview of the Hector algorithm and the theoretical foundations pertaining to SLAM.
29.01.19	Laser Based Gmapping Experiments In An Indoor Environment (BSc Defense)	Nele Lips
Gmapping is one of the most widely used methods to solve the problem of simul- taneous localization and mapping (SLAM). As an implementation of FastSLAM, it is based on the Rao-Blackwellized particle filter and allows for the construction of accurate 2D grid maps of arbitrary environments. The algorithm is available in ROS as open source and comes with a set of user definable parameters. In this thesis, the results of a series of experiments are presented and discussed that were aimed to analyse the impact that these parameters have on the mapping quality of the algorithm. To this end, several routes were mapped and navigation with the commonly used AMCL algorithm was performed on the generated maps. The outcome of the experiments proved to be unexpected and revealed some interesting aspects about the mapping process. Furthermore, the thesis gives an overview of the theoretical aspects of SLAM and Gmapping.
22.01.19	Attention based Image Captioning with Recurrent Neural Networks (BSc Defense)	Lasse Westphal
Attention based image captioning systems can generate textual descriptions of images and have the ability to focus on different parts of the images at different time-steps. In the last years, gradually better attention based image captioning systems have been published. However, it is often unclear, which changes of a model had which share in the enhancement of the observed performance. With the goal of a better understanding of the influence of different parts of a model on its performance, we implement and optimize two attention based image captioning models, examine the quality of the attention and highlight observed difficulties in the training of such systems.
18.12.18	Hierarchical Reinforcement Learning in Sparse Feedback Environments with the help of Intrinsic Motivation (BSc Defense)	Luc Baracat
Reinforcement Learning has shown a lot of promise for tasks in which agents need to perform a series of actions only by learning from feedback from the environment. In large environments with sparse feedback, it is difficult for agents to learn robust value-functions due to lack of exploration. In this thesis, we will present the concept of Hierarchical Reinforcement Learning (HRL), a type of Reinforcement Learning making use of the hierarchical structure of tasks. HRL defines tasks as a series of goals, which in return are defined as a series of actions. This allows for the sequential achievement of each goal in order to achieve the given task. Modified variations of the hierarchical Deep Q Network (h-DQN) architecture, inspired by Kulkarni et al. [TDK16] will be presented. These architectures attempt to simplify the original architecture by several processes, such as the sharing of the input and fully-connected layers, in order to obtain a structure that is more comparable to the humanbrain. Parallels to biology will be drawn, by mapping the architecture of HRL to the brain with the help of the findings established in neurobiology and psychology. The advantage of HRL over regular RL is going to be shown for tasks in environments with sparse feedback.
04.12.18	Socrates Project Presentation	Antonio Andriella
Every 3 seconds someone develops dementia worldwide. Brain-training exercises, preferably involving also physical activity, have shown their potential to monitor and improve the brain function of people affected by Alzheimer Disease (AD) or Mild Cognitive Impairment (MCI). The main goal of my research is to propose a symbolic representation to model Human-Robot interactions involving different actors that enables high-level planning, reasoning about the goals, and allows to adapt the robot to the learned user behaviour in assistive scenarios, like brain-training exercises. I will present a Cognitive Framework designed in collaboration with Fundacio ACE, a non-profit organization dedicated to diagnosis and treatments of people with AD and other dementias. The aim of the framework is to assist mild dementia patients during brain-training sessions using Syndrom KurzTest neuropsychological exercises (SKT). The system is able to perceive and adapt to the user's behaviour and is composed of two main modules. The adaptive module based on representing the human-robot interaction as a planning problem, that can adapt to the user performance offering different encouragements and assistive actions using both verbal and gesture communication in order to minimize the time spent to solve the exercise and the number of mistakes. The safety module, which constantly monitors the possibility of physical contact and selects the most appropriate action to react to an unsafe event. Background: Antonio Andriella has started his PhD thesis on robot personalisation in March 2017 at the Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, within the joint SOCRATES project. His research interests include perception, robot manipulation, and human-robot interaction He is currently visiting the KT group as a guest researcher.
27.11.18	Student Thesis Spottalks	WTM Students
Finn Rietz - Scale Estimation in Visual Object Tracking Timo Kreuzer - Robots on the Move: An Evaluation of the Combination of a Deictic Gesture Interface and Object Recognition with the NICO robot Tom Kastek - Using Aspect-based Sentiment Analysis on Restaurant Reviews Vincent Dahmen - Evaluating domain control mechanism on NMT using real life data sets Julian Lorenz - Real-time control of a robot using EMG-classification with neural networks
20.11.18	Student Thesis Spottalks	WTM Students
Fares Abawi (MSc) - Intermediate Representations In Deep Multimodal Neural Networks Timo Kreuzer (BSc) - Robots on the Move: An Evaluation of the Combination of a Deictic Gesture Interface and Object Recognition with the NICO robot Klaas Degwitz (BSc) - Enhancement of training data for a robot arm with a generative network Edna Teich (MSc) - SLAM using Associative Growing When Required networks Andre Schledermann (BSc) - Improving Fall Detection Using Growing When Required networks
06.11.18	An Unsupervised Hybrid Neural Network for Novelty Detection in Acoustic Event Detection (MSc Defense)	Jonas Papmeier
The problem to detect and classify acoustic events in an audio stream is tackled with increasing success in the research field of acoustic event detection. The approaches so far assume a closed-world, where they train and test their system on a limited set of event classes. In a realistic setting events of classes not known the model can appear at any time. Yet, the detection and adaption to new previously unknown classes as done in novelty detection remains an open problem. We propose a hybrid network which uses a variant of self-organizing networks, the grow when required network (GWR), to deal with the detection and incremental adaption of novel inputs within acoustic event detection, and an autoencoder to leverage the power of deep neural networks for feature extraction. We expect to increase the performance of the GWR with an appropriate learned representation. We, therefore, evaluate multiple autoencoder variants to find the most suitable one. As research into the direction of novelty detection is missing so far in acoustic event detection, we develop an evaluation protocol based on the publicly available DCASE 2016 Challenge Task 2 dataset for event detection in synthetic acoustic scenes. The creation of artificial scenes allows us full control over the classes appearing in the training set. At the same time, we have no control over the test set with the DCASE dataset and reduce the chance to overestimate performance. For our evaluation, we incrementally train our model on each class and test on all classes after each new class. All untrained classes are treated as novel. In our evaluation, we can estimate the effect of the incremental adaption and the effect of unknown events on detection performance, as well as the novelty detection performance of our model. Our results show that our approach is capable to adapt incrementally to new classes and that our novelty detection mechanism helps to improve performance when multiple unknown classes appear. Yet, the learned representation is so far not powerful enough as it does not group or separate all classes well. Hence the performance of our approach does not reach state-of-the-art approaches in a closed-world setting. With respect to the performance under the appearance of unknown classes a clear gap between our empirical ideal performance for our approach and our approaches performance shows that the problem is not solved so far and improvements can be made with better representations and novelty detection mechanisms.
30.10.18	Adapting to User Context in a Reinforcement Learning-based Dialogue System (MSc Defense)	Mehran Sheikholeslami
Personalized dialogue systems provide a tailored conversation with their target audience according to their preferences. In a goal-oriented dialogue system, a user may have preferences for available options at each step as fixed values. Knowing these values leads to having personalized dialogues also with shorter length. But in a long-term interaction, user preferences may change over time or be context dependent, due to various external reasons. Also, they may have different dynamics in their preference and preference changes. Therefore, this master thesis proposes a dialogue system which is able to adapt to user preferences. Built upon PyDial, a dynamic preference user simulator is created to simulate different user behaviors in a long-term relationship. By providing a memory module for the dialogue system, reinforcement learning is applied as policy in order to control the dialogue with information from the memory and current dialogue. Although the system is domain-independent, the experiments are done in Cambridge Restaurant domain in order to explore its behavior.
23.10.18	Scene Description using Affective Labeling (BSc Defense)	Merlin Sewina
The problem of describing the visual content of an image is a trivial task for nearly any human. Even explaining what kind of emotional effect the image has seems trivial. However, this task of affective scene description proves to be a huge challenge within the field of computer science. In this thesis, the problem of affective scene description is explored via the use of convolutional and recurrent neural networks. A convolutional only approach is proposed to solve this task on a dataset of images, labelled with continuous arousal and valence values, using the raw data as input. Additionally, temporal processing capabilities are added to the network to explore the performance change of such a modification to the network. This work shows that no significant difference in performance by training one network for both, arousal and valence values at the same time, or training two networks with either arousal or valence value output can be found. The results of this thesis also show that using the dataset "LIRIS-ACCEDE" for single frame convolutional neural networks is not effcient. At last this thesis explores the selflearned features of the convolutional network.
12.10.18	Social robots for children and rats: The OPAL and iRat projects	Scott Heath
Robots require social abilities in order to operate in human-occupied environments. For robots to be used in applications such as health and education, they need to be able to effectively interact with people. Creating robots that are capable of social interaction requires considering: form, motion, interaction dynamics and safety. Designing and implementing form and motion rely on feedback from trials with children and insight from psychologists, interaction designers and linguists, in tandem with careful engineering of hardware and software. Suitable social interaction dynamics require a robot to be able to perform fast motions, while safety in close proximity of children requires limiting the forces and torques that a robot's limbs can exert. The OPAL project aims to develop a child-friendly robot and use it to investigate child-robot interactions. The project follows an iterative design process where new versions of the robot (each named Opie) are created and then trialed with children as soon as possible. Three different Opie versions are currently involved in studies: i) a foam prototype, which is used in imitation and storytelling studies, ii) a version with actuated arms that is used in piloting, and iii) a flat-packable version that has been co-designed with the Ngukurr Language Centre (an centre supporting revitalisation of indigenous Australian languages).
09.10.18	Multi-Attribute Face Generation Using Conditional BEGAN (MSc Defense)	Ahmed Elshinawi
Automatic synthesis of faces from visual attributes with Artificial Neural Networks is among the most popular researched topics in the last few years. The conditioned face generation based on visual characteristics is a complex task and has a wide range of applications. Training neural networks to generate natural looking images is a hard problem to solve with conventional machine learning concepts because the problem is that the generated images do not appear natural to a human eye; hence Generative Adversarial Networks (GAN) have been introduced as a solution in recent years. GANs are a new class of generative models that are widely used for performing such generative tasks. However many state of art generative adversarial models either focused only on image generation or image-to-image translation or the images quality was not as high as expected. In this work, we aim at conditional image generation based on a novel, two-stage reconstruction architecture that learns to associate conditions with the disentangled representation of images attributes and generates novel conditioned images at the same time. Our proposed model is also capable of performing simultaneous attributes editing. It can be conditioned on any vector when conditioned on class labels from CelebA dataset; the model was able to generate diverse, realistic facial images with various facial attributes. This work further proposes two evaluation metrics to evaluate a conditional generative adversarial model objectively. We proposed to perform evaluation based on quantitative and qualitative metrics. We utilized this method since it leverages the incorporation of results from an objective and subjective point of view. Our novel conditional adversarial autoencoder architecture is trained to extract high-level features from natural images associating them with their respective condition and to reconstruct the images from those conditioned features.
02.10.18	Haptic Object Classification with Convolutional Neural Network using Humanoid Sensor Hand (BSc Defense)	Emil Gasanov
By living in a complex and constantly changing environment, the human being is dependent on gathering as much surrounding information as possible. In most physical exploration tasks, humans use their hands to obtain a variety of physical information about a particular object. In this thesis, we research the influence of physical attributes on haptic recognition. Therefore, a dataset of 16 different objects by using a multimodal sensory humanoid hand is recorded. Starting from a basic network, we optimize this systematically to find out how the individual parameters interact with this data set. We then evaluate different sensor-subsystem combinations to analyze their impact on object property classification.

Summer Semester 2018

Summer Semester 2018
25.09.18	Weakly-supervised Learning For Object Classification And Localization In Robotic Grasping	Hwei Geok Ng
A robust grasping ability in cluttered environments is an important ability for assistive humanoid robots. This thesis presents a weakly-supervised object localization approach for visually-guided object grasping tasks on the Neuro-Inspired COmpanion (NICO) humanoid robot. The weakly-supervised approach generates object localization maps by requiring only object class labels as ground truths for model training. The realized approach consists of a vision and a grasping module. The vision module is a classification Convolutional Neural Network (CNN) combined with the weakly-supervised Gradient-Weighted Class Activation Mapping (Grad-CAM) approach. It takes an input image and a given class label to generate object localization maps. The grasping module is a regression CNN, which takes object localization maps as inputs and generates respective joint predictions for object grasping. The vision module yields 100% object localization rate in both the single and the multiple objects evaluations; its performance is as good as the fully-supervised Faster R-CNN baseline system for object localization. The grasping module yields an overall grasp accuracy of 82% for the single and the multiple objects evaluations. The grasp accuracy in the single object evaluation is lower compared to 86.1% of the baseline system, due to more samples being used for training in the baseline system. The 82% grasp accuracy in the multiple objects evaluation is higher compared to 61.1% of the baseline system. Additionally, due to its more stable model, the variance in the grasp accuracy of the grasping module is lower depending on the grasped object. Comparing the offsets of the single and the multiple objects settings, the current system has 0% offset and the baseline system 25%. The grasping module shows no adversarial effect on the grasping accuracy from single to multiple objects, achieving 20.9% (for four object categories) and 36% (for seven object categories) higher grasp accuracies than the baseline system in the overall multiple objects settings. The end-to-end, biologically-plausible approach saves time and effort from bounding box annotations, which also prevents human annotation bias. The realized approach contributes to the grasping task by providing transparency of its decision-making process, pointing out image regions that it uses to make a grasping prediction. Furthermore, the approach enables amendments to be made for ambiguous predictions based on user`s input, which offers more flexibility to its learning process and at the same time enhances the human-robot interaction.
11.09.18	Exercising Companion NAO	Maham Tanveer
The personality of a robot serves the social behaviour expected from a Socially Assistive Robot (SAR) in the long run. Mood and emotions contribute to 'personality' and they are means of exhibiting necessary and required social behaviour. The designed personalities for a SAR can be demonstrated in the example of the social context of an exercising companion. It has been shown in previous studies that the presence of co-active robot can enhance motivation and engagement. But what effect does the personality of a SAR have on motivation and engagement? Is personality even perceived by users as intended by robot designers? This independent study provides a system and HRI experiment design for a SAR with three conditions representing personalities with the NAO robot acting as an exercising companion. The focus of the independent study is not to monitor the exercise performance but to present an experimental design that can analyze how the trainees perceive the personalities and social abilities of the exercising coach NAO.
04.09.18	Transfer Learning of Face Representations Using Semi-supervised Strategies	Erik Fliesswasser
A robot's visual perception of a human in the context of social interactions is a key to a beneficial behavior. Deep neural networks that are trained to perceive attributes like the identity, the age, the gender, and the emotion from the counterpart's face have a high potential and belong to the state-of-the-art in this regard. However, the success of deep learning is dependent on a large amount of task-specific labeled data. Transfer learning and semi-supervised learning are two approaches to tackle the lack of labeled data in the field of facial recognition tasks and are subject to this thesis. A two-stage procedure is proposed which uses unsupervised learning of general face representations with a generative adversarial network and subsequent supervised learning of task-specific features and similar facial recognition tasks. Furthermore, the transfer of high-level features with progressive neural networks is explored in order to share task-specific knowledge. The proposed networks are compared to purely supervised baseline models as well as to state-of-the-art benchmark results. Additionally, the learned face representations and task-specific high-level features are visualized in order to perform a deeper analysis of the results. The performance analysis shows that the use of pre-trained face representations yields similar results to the baseline models while converging much faster and being robust against reduced amounts of training data. Furthermore, state-of-the-art benchmark results are achieved for the tasks of gender and age recognition. The high-level feature transfer improves the performance of identity recognition, whereas the transfer of identity features consistently decreases the performances of the other tasks. The approach of using progressive neural networks to transfer high-level features has shown to be beneficial for networks with suboptimal features but has a negative influence if the learned features already extract distinctive information. The analysis of the feature visualization shows very general facial features for the shared face representation layer while high-level layers exhibit distinctive task-specific features confirming the basic idea of learning low- and mid-level representations separately from high-level features. The results show that the proposed network and its training procedure are useful in situations where the amount of labeled data is very small. This thesis contributes a network architecture and a training procedure which are a competitive way to train deep neural networks with small amounts of labeled data in the context of facial recognition tasks.
28.08.18	Learning to Recognize Objects through Observation and Active Exploration with the Robot NICO	Vadym Gryshuk
In this talk, we address the problem of lifelong learning on humanoid robots. The contemporary methods for object recognition are mostly based on batch learning and do not consider the image sequences of videos that are crucial for incremental learning. Therefore, we propose the architecture for lifelong learning, which learns incrementally from the image sequences. The proposed architecture consists of two modules. The first module is based on a Convolutional Neural Network (CNN), initially pretrained offline on a large dataset for the feature extraction of the input data. The second module is the Recurrent Grow When Required (RGWR) network for continuous object learning and recognition. The conducted experiments and the results on classification tasks show the capability of our model to effectively learn and recognize objects in a lifelong manner. Furthermore, we collected a new dataset, NICO Vision, for continuous object learning
21.08.18	Continuous Deep Reinforcement Learning with Normalized Advantage Functions for Robot Grasping	Joan Pape
The research on robots together with machine learning are currently very relevant topics. At the university of Hamburg, most research in this area is done using the custom developed NICO robot. However, the use of a real robot creates the need for cautious approaches thus hindering the learning process. This work instead attempts to produce better policies by executing the learning in a simulated V-Rep environment. To measure this improvement, objective score functions are presented to evaluate various aspects of the task. This thesis then focuses on, various increasingly difficult tasks with the common objective of grasping or reaching for an object. Those are utilized in a way similar to curriculum learning, to find the important aspects of an optimal policy. The tasks begin with a simple one degree of freedom arm and range to a six degree of freedom model of the NICO robot. The continuous deep reinforcement learning algorithm Normalized Advantage Functions is used for this. Additionally, further optimization attempts like asynchronous learning are applied along the way.
14.08.18	Implementation and Evaluation of a Deictic Gesture Interface with the NICO robot	David Biertimpel
In everyday interactions, people intuitively reference entities in their environment by pointing at them. These so-called deictic gestures allow directing other people's attention to a desired referent. In the field of Human-Robot-Interaction deictic gesture are of frequent interest as they enable people to apply familiar behavior to shift the robot's focus. However, despite being intuitive, deictic gestures possess an inherent ambiguity. Depending on the perspective and the target's proximity to other entities, the actual target of a deictic gesture may sometimes be diffcult to identify, even for a human interaction partner. To this end, this thesis investigates whether we can create a natural deictic gesture interface with the humanoid robot NICO that is capable of recognizing a gesture's target also in ambiguous object constellations. In order to address this task we introduce two approaches: First, we approximate a pointing array from the hand posture of the experimenter. Subsequently, we predict the gesture's target by using a Growing When Required network (GWR). Finally, we create experimental set-ups to evaluate our approaches.
07.08.18	Facial Expression Editing using a Conditional Adversarial Autoencoder	Alexandra Lindt
Recently introduced Deep Generative Models have achieved impressive results in the field of automated facial expression editing. However, the models presented so far are trained solely on images annotated with discrete emotion categories. Therefore, they are limited in the modelling of continuous emotions to the extent that they must always represent them as a composition of basic emotions. To overcome this limitation, we investigate how a continuous emotion representation can be used to modulate the automated manipulation of facial expressions. For this purpose, we modify a Deep Generative Model so that it can be used to edit facial expressions in face images according to continuous two-dimensional emotion labels. One dimension represents the value of an emotion, the second describes its degree of excitement. The effectiveness of our proposed model is demonstrated in a quantitative analysis using classifier networks as well as in a subsequent qualitative analysis.
31.07.18	WTM Student Spottalks	Thorsten Ploß Lasse Westphal

24.07.18	Independent Study Presentations	WTM Students
Thomas Hummel - Uncertainty with Recurrent Neural Network Models Robert Brown - Representing Uncertainty with Bayesian Neural Network
17.07.18	Question Answering using Recurrent Neural Networks with Attention Mechanisms (BSc Defence), Student Thesis Spottalks	Michael Nelskamp, WTM Students
Michael Nelskamp Title: Question Answering using Recurrent Neural Networks with Attention Mechanisms Abstract:Recurrent Neural Networks (RNNs) have recently been applied to natural language processing (NLP) like Question Answering (QA). To solve a QA problem a RNN processes a text as well as a question referring to the text in order to find an answer to the question. We propose two new models for QA, which are HAN-doc-vec and HAN-ptr. These models implement concepts of existing NLP models like the Attention Sum Reader Network (ASRN) or the Hierarchical Attention Network (HAN). In the thesis we explore whether the hierarchical structure of HAN-doc-vec and HAN-ptr enhances the performance of existing models on a selected QA task, namely the Children?s Book Test (CBT). Nele Lips - Investigation of SLAM on TurtleBot3 with a 360° Laser Dominik Scherer - Investigation of SLAM on TurtleBot3 with a RealSense 3D Sensor Merlin Sewina - Scene description using affective representation
10.07.18	Generating place cells on a real robot platform (BSc Defence), Feature Learning on EEG Recordings (BSc Defence)	Philipp Quach Marian Wiskow
Philipp Quach Title: Generating place cells on a real robot platform Abstract: Place and Head Direction cells as they are found in the hippocampus of animals play an important role in how the brain learns to navigate. Computer algorithms that train these cells have been applied on agents in simulated environments. However due to increased complexity of the real world it should be validated that algorithms for robot-localising, which have previously succeeded in simulation, work on a real robot platform. We have modified the Ratlab platform which is one such simulated environment to use pictures of the real world as input instead of those generated in a simulation. Marian Wiskow Title: Feature Learning on EEG Recordings Abstract: Unravelling patterns in neurophysiological data such as Electroencephalography (EEG) is paramount to progress in many fields of research and medicine. Recently, Deep Learning has proven exceptional potential for pattern recognition in areas once championed by human perception and cognition. There is a growing number of studies showing Convolutional Networks (CNNs) can perform well in EEG classification. It remains to be investigated, however, whether the features extracted by such methods are neuroscientifically meaningful. The aim of this thesis is to contribute to this question by evaluating a recent approach in the light of newly established findings on fine motor control in the aged brain. In this approach, EEG recordings are represented as series of topology-preserving images to be classified by a Convolutional Network. State-of-the-art regularization techniques are deployed to reduce the risk of overfitting. Visualization of network activation behavior and saliency during class prediction indicates a certain degree of neuroscientific validity of the learned features, but confirms noise susceptibility as a main challenge in the field. The results are extended by an exploratory study on the discrimination of fine motor tasks in stroke patients.
03.07.18	Effect of a Humanoid's Active Role During Learning with Embodied Dialogue System (BSc Defence)	Theresa Pekarek-Rosin
Teaching robots skills and tasks is a time-consuming process and has to be highly and quickly customizable, especially for robots living and working alongside human beings. Transferring the responsibility of supervising the training to the non-expert user allows social robots to adapt to high variability environments more quickly. A dialogue system, providing feedback and instructions should allow humans to train robots intuitively and efficiently. In this study the effect of a humanoid's active role on both the learning process and the user's perception was examined, showing that while training with an embodied dialogue system might be less efficient, it also increases user comfort, affinity and confidence.
26.06.18	Research Talk: Lifelong Learning and Catastrophic Forgetting	German I. Parisi
Artificial agents interacting in highly dynamic environments are required to continually acquire and fine-tune their knowledge over time. In contrast to popular deep neural network models that typically rely on a large batch of annotated training samples, lifelong learning systems must account for situations in which the number of tasks is not known a priori and the data samples become incrementally available. Although significant advances have been made in domain-specific lifelong learning with artificial neural networks, extensive research efforts are required for the development of autonomous agents which currently suffer from flexibility, robustness, and scalability issues. In this talk, I will introduce the main challenges linked to lifelong learning and discuss well-established and emerging research trends motivated by experimentally observed lifelong learning factors in biological systems. Useful references: Parisi, Tani, Weber, Wermter (2018) Lifelong Learning of Spatiotemporal Representations with Dual-Memory Recurrent Self-Organization. arXiv:1805.10966. Parisi, Kemker, Part, Kanan, Wermter (2018) Continual Lifelong Learning with Neural Networks: A Review. arXiv:1802.07569 Parisi, Tani, Weber, Wermter (2017) Lifelong Learning of Human Actions with Deep Neural Network Self-Organization. Neural Networks 96:137-149.
19.06.18	Artificial neural networks for collision detection by current prediction in robotic motors (BSc Defence)	Franziska Pappenhausen
Robots that are able to move their body parts are also able to collide with them and their environment unintentionally. These collisions can cause damage to the robots motors if unrecognized, since the motors will start to consume more power and start to overheat. A MLP for current prediction was developed in a previous work. The main approach of this thesis is to detect a collision by comparing the actual consumed current and the predicted current. First MLPs are developed for collision detection in general to prove the possibility of collision detection and to show how fast a MLP can detect a collision. Then, MLPs are developed for collision localization on the robots body. The motivation behind this is to keep the impact of a reaction, needed to correct a collision, to a minimum. In particular, these approaches are developed for the NICO robot from the Knowledge Technology research group of the University of Hamburg. NICO is a humanoid robot developed for various task many including grasping. Therefore, collisions can occur. For collision detection evidence was given, that it is possible to detect collisions per current prediction. However, the approach of collision localization shows that the tested locations are often confused.
12.06.18	Neuro-Symbolic Learning of Action Plans (MSc Defence)	Andreas Grenzing
This Master Thesis explores the usage of machine learning, to train a humanoid robot high-level actions (i.e. actions defined by an underlying, hierarchical plan of primitive actions). The idea is to utilize the fact that deep neural networks are able to learn hierachical representations from an AI perspective such models may bridge the gap between traditional strategic planning and sub-symbolic motion performance. In the first part we present a research overview of human motion, actions and behavior to clarify which parts of human/robotic action performance such a neuro-symbolic model would address. In the second part we present a concept of how a deep convolutional model may learn action plans. To evaluate our concept, we define and record a dataset of high-level gestures with the NICO robot. Evaluating our model shows some aspects of the hierarchical definition but were not sufficient to conclusively decide whether our concept is viable or not. Besides preliminary tests, time did not permit to achieve the generation of motoric data from the high-level gestures.
05.06.18	Neuro-symbolic foundations of robotic embodiment in cognition, action and creation	Manfred Eppe
Human cognition is determined by embodiment and by the actions that become possible due to embodiment. The importance of embodiment for human cognition becomes very evident in natural language, which is full of metaphors about motion and action. For example, we say that somebody "clicks himself through the internet", even though no physical movement is involved. To investigate embodiment in human cognition, I have developed a logical epistemic action semantics to model motion and action of a mobile robot by means of discrete state transitions, and I have shown that a similar state transition semantics is useful for modelling cognitive foundations of human creativity to generate music and to invent mathematical lemmas. I have, furthermore, investigated embodiment in language by realizing a human-robot interaction framework that utilizes symbolic embodied semantics to realize a human-robotic dialog system. In addition to these symbol-semantic approaches, I have addressed language understanding and semantics in robot perception and control using convolutional and recurrent neural networks. For my future work, I plan to combine the symbolic and neural approaches related to robotic embodiment to investigate the neuro-symbolic foundations of cognition and action. I plan to realize this research by investigating creative robotic problem solving in different domains.
29.05.18	Visual Object Tracking for Robotic Applications (BSc Defence)	Tobias Knöppler
Visual Object Tracking is a challenging task and while there has been a lot of activity and publications in this regard over the last years, very few have been evaluated for specific practical applications. This thesis attempts to optimize and streamline the HIOB tracking framework to make it useful for robotic applications.
29.05.18	Storm Surge Prediction for Hamburg with Artificial Neural Networks (BSc Defence)	Meike Zimmermann
The city of Hamburg, located on the river Elbe, is subject to strong tidal fluctuations due to its closeness to the North Sea coast. For the protection of the harbour and lower areas of the city, accurate predictions of water levels especially during storm surge events are necessary. While traditional methods have been thoroughly researched for predictions for Hamburg, the use of artificial intelligence for the same task has yet to be explored experimentally. Artifical neural networks (ANN) have been successfully applied to sea level forecasts in other locations. However, few examples use recurrent neural networks (RNN), even though RNN are regularly used for time series modelling. This thesis presents a comparison of different ANN models used to predict the tide curve for Hamburg. Gated recurrent unit (GRU) and Elman recurrent neural networks (ERNN) are compared with multilayer perceptron (MLP) models for tide level forecasting. The evaluation considers the prediction of the tide curve in general and of storm surge events in particular.
15.05.18	Increasing the robustness of deep neural networks for text classification by examining adversarial examples (MSc Defence)	Marcus Soll
Adversarial examples are specially crafted samples, where noise is added onto regular samples to make neural networks misclassify the sample despite having no detectable noise for humans. This thesis will explore adversarial examples in the text domain by conducting three experiments with the goal of increasing the robustness of neural networks. The first experiment shows that adversarial examples are easy to craft for text classification tasks and that these adversarial examples transfer between different models. The second experiment shows that defensive distillation does not increase the robustness of a model to adversarial examples. The third experiment shows that adding adversarial examples to the trainings set of a neural network will not increase the overall accuracy of that network. All neural networks tested have a simple architecture based on a single 1-dimensional convolutional layer.
24.04.18	A Speech Enhancement Generative Adversarial Network for the WTM Robots (BSc Defence)	Patrick Eickhoff
The rapid advances in automatic speech recognition (ASR) systems shown in recent years have made speech a viable way of interacting with machines. The knowledge technology research group (WTM) at the university of hamburg is performing extensive research on interactions between neuro-inspired robots and humans. However background noise and ego-noise produced by the robots interferes heavily with their ASR systems. A recent publication by Pascual et al. (2017) proposed a novel generative adversarial network for speech enhancement, the SEGAN. In this thesis we train and test the SEGAN model on different noise conditions and evaluate, if it could provide improvement for the robots ASR systems.
tbd
17.04.18	Research Spottalks	Pablo Barros, Tayfun Alpay
Tayfun: "In this talk, I will introduce and explain the idea of "learning to activate" in recurrent neural networks, and discuss how it can be linked with information-theoretic metrics to potentially achieve time-scaled memory representations." (Paper: Surprisal-Based Activation in Recurrent Neural Networks, preprint of accepted paper for ESANN 2018) Pablo: "In this talk I will discuss how to train deep neural networks to learn crossmodal expectation in a way to enforce the unisensory learning and coincident multimodal binding. Also, how to integrate concepts such as crossmodal correspondence, semantic congruence and unity assumption into the learning mechanisms of the network, and how this helps to describe congruent and incongruent stimuli." (Paper: "Expectation Learning for Adaptive Crossmodal Stimuli Association", preprint of accepted paper for IJCNN/WCCI 2018)
10.04.18	Modelling Affective Cores for Behaviour Modulation in Social Robots (MSc Defence)	Nikhil Churamani
Emotions play an important role in every-day human conversations. They are used by people to express intent and meaning, providing a context to the interaction. Understanding this emotional context will enable robots to provide a natural inter- action experience to the users. Furthermore, using such an understanding of their affective behaviour shall enable the robot to form intrinsic models of emotional appraisal, such as an affective memory and intrinsic mood, helping them improve their performance in Human-Robot Interaction (HRI) scenarios. In this thesis, a deep, hybrid neural architecture is presented for realising emotional appraisal in a robot. The robot models multi-modal perception capabilities to extract the affective context from its interactions with the users. This is used to build an affective memory for the user, modulating the estimation of an intrinsic mood in the robot. This work further proposes modelling of an affective core, acting as a behavioural bias instilled in the robot. Similar to humans, this affective core of the robot influences emotional appraisal, depending upon how it perceives the environment, given its inherent behavioural disposition. The effect of waiting-time perception and social conditioning of the robot is explored, influencing its emotional appraisal. The affective core biases are shown to model significantly different estimates of the emotional state of the robot for the same emotional stimuli, influencing its affective understanding of the environment. To further substantiate this influence on the interaction capabilities of the robot, the Ultimatum Game scenario is realised with the Neuro-Inspired Companion (NICO) robot learning to negotiate with human participants. The robot is shown to offer a varying interaction experience to the users under different affective core biases. This thesis proposes the use of different behavioural biases on the emotional appraisal of the robot. These biases are acquired as a result of the robot?s repeated interactions with humans and influence how it appraises these interactions with them. Such adaptive capabilities in the robot can prove to be useful in realising different personality traits, contributing positively towards improving human-robot interactions.
03.04.18	Reinforcement Learning for Incremental Dialogue Management (MSc Defence)	Thi Linh Chi
Misunderstanding and non-understanding, the communication errors, has a dramatic impact on user satisfaction during human-robot interaction. To solve thisissue, it is proposed to use a deep Reinforcement Learning alg orithm for the dialogue manager to identify accurately the robotic agents level of understanding. It combined information retrieved from user requests with visual cues from the interaction scene to select an appropriate action. The next course of action is to request for clarification then update clarification from the human agent until the robotic agent achieves a full understanding. The deep reinforcement learning modelwith visual cue supp orts significantly minim ized misunderstanding errors. It also included different enhancements, prioritized memory replay and auxiliary loss, to optimize image processing part in the complex deep reinforcement learning model.

Winter Semester 2017/2018

Winter Semester 2017/2018

27.03.18	Research Talk	Tobias Hinz
"Understanding and interpreting image contents requires both recognizing individual entities within an image and their relationships with each other. While there has been some interest at visual relationship detection in the past years, current models need fully labeled data sets and are fully discriminative. Furthermore, recent work indicates that "simple" feed-forward networks like convolutional neural networks are not able to "understand" visual relations on a conceptual level. Rather, it seems that they learn relationships by heart and can not generalize them to novel scenes or objects. We propose using generative models to facilitate learning visual relationships even in (partially) unlabeled data sets. Additionally, we will incorporate attention mechanisms to enable the system to focus on individual objects and relationships. With the combination of generative models and attention mechanisms we hope to avoid the drawbacks of "simple" feed-forward mechanisms which only get to see the input once, while also being able to make use of unlabeled images in learning visual relationships." Related papers: - Lu, Cewu, et al. "Visual relationship detection with language priors." /European Conference on Computer Vision/. Springer, Cham, 2016. - Santoro, Adam, et al. "A simple neural network module for relational reasoning." /Advances in neural information processing systems/. 2017. - Ricci, Matthew, Junkyung Kim, and Thomas Serre. "Not-So-CLEVR: Visual Relations Strain Feedforward Neural Networks." /arXiv preprint arXiv:1802.03390/ (2018).
20.03.18	WTM Student Spottalks	WTM Students
Yong Wang: Towards Unsupervised Learning of Action-relevant Representations Guided by Self-motion Information (advisor: Cornelius) Erik Fließwasser: Transfer Learning of Face Representations Using a Semi-Supervised Strategy (advisor: Pablo) Ahmed Elshinawi: Multi-Attribute Face Generation Using a Conditional BEGAN (advisor: Pablo) Theresa Pekarek-Rosin: Effect of a Humanoid's Active Role during Learning with Embodied Dialogue System (advisor: Matthias) Joan Tomas Pape: Continuous Deep Reinforcement Learning with Normalized Advantage Functions for Robot Grasping (advisor: Matthias)
13.03.18	Introduction Talk: Articulatory features for robot speech recognition	Leyuan Qu
Current automatic speech recognition systems make use of a single source of information about their input, viz. a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy. And in a real environment, ambient noise and reverberation usually degrade quality and reliability of speech recognition. Articulatory features (AFs) composed of probabilities (or, more generally, scores) are abstract classes describing the most essential articulatory properties of speech sounds, e.g. voiced, nasal and rounded. They are independent of acoustic variation such as speaker-dependent spectral differences, background noise, and room reverberation. In this talk, I will present and discuss how to extract AFs from speech using neural networks and apply AFs to robot speech recognition.
06.03.18	Generic Robot Walking using Central Pattern Generation with Neural Networks	Bente Reichard
This thesis presents a new bio-inspired approach for robot motion. The idea is to model the natural movement generation of humans closely. The natural central pattern generators are modeled with two artificial neural networks. This is based on a medical approach predicting muscle activity of a walking human. Ideal muscle activation patterns for humans are known, for robots these need to be discovered. This is done using a genetic algorithm. Thus, not only the generation but also the learning is natural and adaptable.
27.02.18	Research Spottalks	Xiaomao Zhou, Burhan Hafez
Xiaomao: "Advances in neuroscience have long suggested that animal navigations are related to place and head-direction cells which mainly inhabit in animal's hippocampus and exhibit space-related firing properties. To explain their functional roles in navigation, we propose a self-organizing method for robot navigation based on learned place and head-direction cells, which includes modeling hippocampal cells from visual inputs, spatial learning through self-organization, neural planning and action control. Preliminary results from simulation show that our method can navigate a robot to the desired position smoothly and efficiently. This spottalk will mainly elaborate how to implement the developed navigation system on a real robot platform." Burhan: "Fast online learning of continuous visuomotor skills in sparse reward environment is an open challenge for existing reinforcement learning algorithms. We address this by extending our Intrinsically motivated Continuous Actor-Critic algorithm using representations learned with a hierarchical SFA network and a proposed convolutional autoencoder-based actor-critic network, where we obtain state-of-the-art results on the task of reach-for-grasping from raw visual input."
20.02.18	WTM Student Spottalks	WTM Students
Alexandra Lindt - Facial Expression Editing Using a Conditional Adversarial Autoencoder Meike Zimmermann - Tide and Storm Surge Prediction for Hamburg with Artificial Neural Networks Emil Gasanov - Haptic Object Classification with Convolutional Neural Network using humanoid Sensor Hand
30.01.18	Hierarchical Control for Bipedal Locomotion using Central Pattern Generators and Neural Networks	Sayantan Auddy
The walking movement of humans is graceful and robust. This movement is the result of synchronization between the neural mechanisms which generate rhythmic motion, and the dynamics of the skeletal structure. The overall movement is optimized by high-level control centers. Drawing inspiration from this mechanism, a hierarchical controller for bipedal locomotion of robots is proposed in this thesis. Artificial central pattern generators (CPGs) mimic the behavior of the neural mechanisms which produce rhythmic motion in animals. Many existing methods use CPG networks for bipedal locomotion, but most of them focus solely on the CPGs. The proposed controller augments the functionality of a CPG network by adding a novel high-level controller on top of it. Thus, at the lower level, a CPG network with feedback pathways controls the individual joints. The parameters of the CPG network are optimized by a genetic algorithm. At the higher level, a neural network modulates the output of the CPG network in order to optimize the robot's movements with respect to an overall objective. In this case, the objective is to minimize the lateral deviation while walking. The neural network is trained using reinforcement learning. The proposed controller was successfully used to produce stable bipedal locomotion for the NICO robot in simulation. Results of experiments show that the high-level controller was able to improve the performance of the low-level CPG network. Additionally, by comparing the performance of CPG networks with and without feedback mechanisms, the relative effectiveness of low-level feedback has been shown. The proposed controller is not strongly coupled to a particular robot model and is modular in design. The results obtained, by using this controller in simulation, encourage its use on the physical robot in the future.
23.01.18	Personalization of a digital social assistant	Maham Tanveer
The thesis will fill the gap with existing research in emotion recognition and emotion modeling in social robotics by implementing the concept of personalization, and suggesting a stand alone architecture / model integrated with an already existing social assistant in a home environment. The re- search will focus on emotion recognition (primary and secondary emotions from text) and personalization in socially intelligent robots (having deep cognition models like humans and offer human style of social intelligence). With the concept of personalization come the challenges like learning, adap- tion and memory. Socially Assistive Robotics (SAR) systems should adapt to each users physical and cognitive abilities and in this process should learn the patterns from interactions with users (and other robots for example) to boost its own attempts at interaction with its environment; referred to as personalized social interaction [14]. The learning process for SAR should hence be adaptive. The SAR should be able to store the emotional input from user in some form of memory, and use that effectively when taking decisions. The proposed architecture would effect the software part as well as hardware of the digital social assistant e.g. through head gestures (nodding if happy and shaking if sad), through LEDs and lights etc. Such an integration of emotions recognition and modeling into the existing architecture of such digital social assistants can be used to analyze where emotions can be useful in the overall architecture, areas where the emotional data can be used in the architecture, as well as correlation of personalization and emotion recognition with the assistants performance experienced by its users.
22.01.18	Bachelor Thesis Defenses	Yan Li, Jose Rodriguez, Tim Reipschläger
Can robot nociception increase the human-likeness of grasping postures? Abstract: Today a multitude of different humanoid robots exist, built within university projects or by companies. A commonly used technique to perform tasks with a robot is to train a neural network, which handles the robot?s actuators to fulfill the task. However, in most cases, only the looks of the humanoid robot are human-like. During tasks robots have non-human-like postures and motions because the main focus is to fulfill the task in a fast and accurate way. Even if human-likeness is desired, there is no human-likeness evaluation tool, which is commonly used or accepted. This work compares the postures generated by different reinforcement learning conditions during a grasping task in regards to human-likeness, to determine if a condition involving nociception is able to generate more human-like postures than other conditions. For the human-likeness evaluation of robot postures, an ergonomic assessment is used under the hypothesis that humans tend to have grasping postures, that are not painful and do not strain muscles or joints more than needed. Comparison of Design Methodologies of Echo State Networks Abstract: In this work, we compare the design methodologies of Echo State Networks proposed by Ozturk et al. (2007), Boedecker et al. (2009) and Yildiz et al. (2012). We test their capability to predict Mackey-Glass time chaotic series and to learn Elman's grammar, which are standard benchmarks. Bio-Inspired Auditory Signal Processing for Speech Recognition Abstract: With the advances in technology, speech is becoming a standard way to interplay with computers. For the human being, communication through speech has been around for thousands of years. Years of evolution have reproduced a fascinating speech recognition system that performs well in noisy environments, while Automatic Speech Recognition tools perform poorly. However, it is an inevitable problem for computers to endure noisy environment. Known models of the peripheral auditory system and speech enhancement neural approaches have the potential of improving verbal interaction with a computer. This thesis presents the possibility of improving the Word Error Rate of the Google Speech Recognition web API with the introduction of a bio-inspired filter.
16.01.18	Vector Representation of Chat Dialogs for Text Classification	Fabio Wendt
In the task of text classification, models like Word2Vec and Doc2Vec can be used to produce and shape a vector representation of words and documents. Such vectors can be used in a classifier, like a multilayer perceptron, to automatically classify new texts into different categories. For this work, we are working with chat dialogs, which are special forms of text. Their structure differs from commonly used texts for such a task. Chat dialogs are shorter, contain spelling errors, and are written by two authors. This work will find out how to use Word2Vec and Doc2Vec as a base for a new model that produces vector representations for chat dialogs and is used successfully in an MLP. Therefore different models were tested with different parameter settings. The best models then were tested with the classifier. We have found two models that yield to an accuracy of around 90% for the MLP. The first model uses the word vectors trained by Word2Vec to produce a vector for chat dialogs. The second model combines two Doc2Vec models and merges the dialog vectors into one.
09.01.18	Object Detection and Pose Estimation with Convolutional Neural Networks Learned from Synthetic Data	Josip Josifovski
Instance-based object detection and fine pose estimation is still an active research problem in computer vision. While the traditional interest-point-based approaches perform reasonably well under controlled environments and for rigid objects with detailed textures, those approaches fall short when the environment is less-constrained or the objects of interest are texture-less or partially-deformable. On the other hand, CNN-based approaches have shown impressive results in more general object recognition tasks like image classification and category-based object detection and coarse pose estimation, with strong generalization abilities across different object properties and scenes. Still, their performance in those general problems is not quite at the stage of very high precision. This thesis looks into the possibility of getting the best of both worlds, by reducing the scope of application of CNNs from category-based to instance-based while in the same time maximizing their performance and achieving instance-based fine pose estimation in situations where traditional interest-point approaches perform poorly. The main problem of applying a CNN-based approach in this scenario is that it requires a lot of precisely annotated training data and producing such data manually requires a lot of effort. To overcome this problem, a typical approach is to simulate the whole image acquisition process and produce a large amount of training data. But the question that arises is whether the model trained on such synthetic data is applicable in real-world scenario. To analyse this question, we propose a CNN-based pose estimation model that is trained with synthetic data produced from the (possibly imperfect) 3D models of the objects. We also propose a methodology under which we can carefully examine how different parameters of the model, the synthetic data and the training procedure influence on the results of the model on real-world data and whether the model trained with synthetic data can be successfully applied on actual images of the objects. Our results show that the proposed model can be trained with synthetic images of the objects and still be successfully applied in a real-world scenario. Based on the results, we present more general insights about training neural models with synthetic images for application on real-world images.
22.12.17	Neural Forecasting of the Sharpness of Images Taken by Walking Humanoid Robots	Daniel Kreischer
Inspired by how the human brain applies saccadic masking to suspend the processing of blurred visuals during movement of the eyes, this thesis lays the foundation for a selective approach of delaying the capturing of camera frames on a humanoid robot in order to avoid phases in the gait that produce heavily blurred images. Beneficial retarding of individual frames in contrast to skipping them increases the average sharpness of images taken, without significantly decreasing the frame rate. Being a passive technique, the procedure is especially suitable for simple robots which otherwise lack capabilities like mechanically stabilising the imaging device or applying reconstructive post-processing. In order to quantify the ability of the platform to capture sharp images at a particular moment, we neurally predict the sharpness of frames from the current and prior values of the inertial and joint sensors and the actuators of the robot. We use a no-reference (NR) metric that produces comparable values across images with varying content to extract sharpness values from captured images. The results are then used to train an echo state network (ESN) with different learning techniques. The sensor values used as inputs are sampled at a higher rate than the camera frames from which the target values for the neural training are derived. Hence, the training data contains inputs for which the correct outputs are unknown. In this thesis, we develop an offline learning algorithm that copes with such missing target values. We compare it to online learning techniques trained on incomplete data and show that it performs comparable to training on complete data sets. The amount of images that can be captured continuously is limited by the memory of the robot. This leads to short sequences of training data. We show that looping these sequences enables comparable performance to learning in conventional setups. Forecasting image sharpness from the sensor values of the robot proved to be challenging. Due to the robot drifting while walking, the motives of the recorded images varied widely, from clear views of the surroundings to featureless close-ups of obstacles. This created a complex target signal for the training. The precision of the sharpness values forecast by the network trained in this thesis are insufficient for it to be deployed on a robot. The developed offline learning algorithm, however, is able to reproduce recognisable features of the target function.
04.12.17	Multisensory Interaction as an Emergent Property of Cortico-Collicular Self-Organizing Learning	German I. Parisi
The brain integrates information from multiple sensory modalities yielding a coherent, robust, and efficient interaction with the crossmodal environment. Multisensory interaction is progressively fine-tuned throughout the lifespan. Low-level stimulus characteristics (e.g., spatial proximity and temporal coincidence) are available before the formation of learned perceptual representations to bind increasingly more complex higher-level characteristics (e.g. semantic congruency). From a computational perspective, multisensory integration from spatio-temporal properties of stimuli has been widely studied, resulting in a generous number of subcortical processing approaches such as the superior colliculus. Similarly, approaches for binding high-level representations at a cortical level have been proposed. However, cortical and subcortical models have been proposed separately and the interaction of cortico-collicular representations has received considerably less attention. In this talk, I will present and discuss a model developed in collaboration with the Biological Psychology and Neuropsychology group of UniHH. We propose a neural network architecture that learns the interaction of spatial-associative representations to perform causal inference and binding of audiovisual stimuli. This model shows that complex multisensory functions can emerge as a property of self-organizing learning in a cortico-collicular architecture. In the subcortical layer, topographically arranged networks account for the interaction of audiovisual stimuli in terms of population codes. In the cortical layer, congruent audiovisual representations are obtained via the experience-driven development of feature-based associations. Our novel proposal is that activity-driven levels of associative congruency (obtained as a by-product of the self-organizing neurodynamics) can be used as top-down modulatory projections to the subcortical layer. This results in semantically related audiovisual pairs yielding a higher level of integration than unrelated pairs (unity effect). I will present current results and a series of planned experiments to validate our model for multisensory interaction in complex environments.
21.11.17	Improved Interpretation of Robot Commands via Crossmodal Validation with a Spatial Planner	Julian Tobergte
As the desire for a flawless human-robot interaction arises from the accelerated technological progress, robots need a sophisticated understanding of natural language when interpreting the instructions given by a person. However, due to the fact that natural language is ambiguous by design, and human speech is transmitted via a noisy channel, issues can always emerge. Therefore, an internal representation of the interpretation of an instruction needs to be flexible, but still appropriate for evaluating the exact parameters for executing it. In this thesis a declarative representation, which uses logical predicates to express relations, is evaluated in terms of feasibility for inferring execution parameters of an instruction, and in terms of perceptibility of ambiguity. This is realised in the context of a 3D Blocks World, for which collected commands directed at a robotic arm were annotated with semantic interpretation data. Each command directed the robot to move a single object out of a constellation of cubes and prisms with different colors. The conducted experiment, consisting of the evaluation of the presumed start and goal position of the moved object for all commands, shows that the chosen representation is suited for the aforementioned task. The implemented system was able to immediately find the correct solution in 99.67% of the commands in the test data, while being able to spot ambiguous input in the remaining cases. Additionally, a concept for disambiguating the remaining commands is formulated.
07.11.17	Spot Talks and Independent Study Presentations	WTM Students
Fares Abawi - Surprisal in Recurrent Neural Networks (Independent study, advisor Tayfun) Sebastian Springenberg - Attention Mechanisms in Neural Encoder-Decoder Architectures (Independent Study, advisor Cornelius) Nikhil Churamani - Modelling Affective Cores for Action Modulation in Social Robots (Independent Study, advisor Pablo) Franziska Papenhausen - Artificial neural networks for collision detection and current intensity prediction in robotic motors (BSc thesis, advisor Sven) Tobias Knöppler - Hierarchical Object tracking (HIOB) on the NICO robot (BSc thesis, advisor Stefan H.) Patrick Eickhoff - Speech Enhancement Generative Adversarial Network for the WTM Robots (BSc thesis, advisor Cornelius)
10.10.17	Behavioral patterns and neural mechanisms underlying conflict processing	Di Fu
Cognitive control (also called “executive function “) is a form of top-down information processing with functions like storage, plan and manipulation to successfully produce the appropriate response depending on the current goals or long-term plans. When people have to overcome automatic behavior (like reward seeking, addiction, etc.), or the current task is unfamiliar, dangerous, or related with decision-making processing, cognitive control plays a role of vital importance. In psychological experiments, researchers explore the inner mechanisms of cognitive control by studying cognitive conflict processing. Stimulus-response compatibility (SRC) tasks are always adopted and modified by controlling parameters like conflict types or conflict effect to mimic complex conflict circumstances in the real life. My talk will present previous work and findings in the field of cognitive conflict processing, including the paradigms we have designed, developing behavioral patterns, task-related fMRI and resting-state fMRI results, and theory models we had proposed. Finally, I will introduce the plan and progress of some new A5 experiments in the context of the CML project.

Summer Semester 2017

Summer Semester 2017
26.09.17	Realizing Inverse Kinematics For a Humanoid Robot With a Neural Network Trained By Backpropagation of Pose Errors	Vinh Ngu
This talk presents the inverse kinematics solution using neural networks for a robotic arm in 3-dimension. The presented algorithm uses a pre-trained neural network that has been trained beforehand to solve forward kinematics to fine-tune another neural network which is learning inverse kinematics. Therefore, a neural network is firstly trained to learn the forward kinematics. So with a given sequence of joint configurations the trained neural network is able to return a position and orientation in Cartesian space. Afterwards, the pre-trained neural network will be fully connected with another feed-forward multilayer perceptron. This means, the input layer of the pre-trained network is connected with the output layer of the network that is to be trained. Finally, the new neural network, consisting of an untrained structure and a trained one (pre-trained neural network) in the rear part, can be trained supervised using pairs with identical position and orientation. Due to this architecture, the neural network in the front (untrained) will be trained through back-propagation to solve inverse kinematics. The algorithm suggests a transferable concept for other problems in the same category as inverse kinematics. Furthermore, this algorithm has the advantage of using unprepared training samples. The results show that this approach leads to good accuracy for the tackled learning problem.
15.09.17	Intelligent System for Monitoring Persons with Impaired Mobility using Soft Deep Learnin	Bruno Fernandes
Bedridden patients can present physical and mental illnes because of this condition. There are some protocols for avoiding such issues, but either they are to expensive, sensitive to human error or just not applied because of its complexity. We propose the development of a system with techniques of artificial intelligence and pattern recognition to identify the emotional state and posture of bedridden patients, based on the identification of facial and body gestures, called SIAPA (Sistema Inteligente de Apoio ao Paciente Acamado, Intelligent Patient Support System ). The recognition of emotion will enable the patient's emotional state to be inferred so that the proposed system can interact with it and recommend activities that combat a possible depression. Recognition of body gestures will allow SIAPA to identify the patient's posture on the bed over time, verifying the possibility of the appearance of pressure ulcers due to the fact that the patient remains in the same position for a long period of time. In this presentation, we will show the developmental steps of the SIAPA and give a brief introduction about the University of Pernambuco, host of such project.
05.09.17	Neural Relation Extraction with Multi-lingual Attention	Yankai Lin
Relation extraction has been widely used for finding unknown relational facts from the plain text. Most existing methods focus on exploiting mono-lingual data for relation extraction, ignoring massive information from the texts in various languages. To address this issue, we introduce a multi-lingual neural relation extraction framework, which employs mono-lingual attention to utilize the information within mono-lingual texts and further proposes cross-lingual attention to consider the information consistency and complementarity among cross-lingual texts. Experimental results on real-world datasets show that our model can take advantage of multi-lingual texts and consistently achieve significant improvements on relation extraction as compared with baselines.
05.09.17	The Impact of Personalisation on Human-Robot Interaction in Learning Scenarios	Nikhil Churamani
Advancements in Human-Robot Interaction involve robots being more responsive and adaptive to the human user they are interacting with. For example, robots model a personalised dialogue with humans, adapting the conversation to accommodate the user's preferences in order to allow natural interactions. This study investigates the impact of such personalised interaction capabilities of a human companion robot on its social acceptance, perceived intelligence and likeability in a human-robot interaction scenario. In order to measure this impact, the study makes use of an object learning scenario where the user teaches different objects to the robot using natural language. An interaction module is built on top of the learning scenario which engages the user in a personalised conversation before teaching the robot to recognise different objects. The two systems, i.e. with and without the interaction module, are compared with respect to how different users rate the robot on its intelligence and sociability. Although the system equipped with personalised interaction capabilities is rated lower on social acceptance, it is perceived as more intelligent and likeable by the users.
22.08.17	Generative Neural Models for Scene and Concept Understanding	Tobias Hinz
Humans are exceptionally good at understanding and effectively describing complex visual scenes, i.e. to infer the salient entities and their characteristics. They can deduce how objects in the scene interact and what kind of relationship they have with each other. Humans are able to learn this in a mostly unsupervised manner and can even identify and focus on certain aspects of a scene based on external contexts such as questions. Capabilities like this are needed to understand complex visual scenes and to extract knowledge from them, e.g. to answer questions or to reason about a scene. In order to acquire these skills we need to develop methodologies that can learn the underlying concepts and relationships, preferably in an unsupervised way. For this we need good representations for visual scenes which enable us to reason about the underlying concepts. We propose a framework based on generative models, that can learn interpretable and modifiable representations in a mostly unsupervised manner based on scenes and associated data, e.g. questions or commands. These representations then enable us to reason about individual concepts and allow for modifying these concepts and contents of a scene in a controlled manner. Finally, the methodologies developed for learning these representations can then be used for a multitude of applications, such as un-/semisupervised learning, scene labeling (conditioned on external information), command execution, visual question answering, data augmentation, and data generation.
15.08.17	Hey Robot, Why Don't You Talk To Me?	Hwei Geok Ng
This talk introduces the techniques used in an interaction scenario, realised using the Neuro-Inspired Companion (NICO) robot. NICO engages the users in a personalised conversation where the robot always tracks the users' face, remembers them and interacts with them using natural language. NICO can also learn to perform tasks such as remembering and recalling objects and thus can assist users in their daily chores. The interaction system helps the users to interact as naturally as possible with the robot, enriching their experience with the robot, making it more interesting and engaging.
25.07.17	Student Thesis Spottalks	WTM Students
Josip Josifovski - Object detection and pose estimation with convolutional neural networks learned from synthetic data (advisor Matthias) Maham Tanveer - Personalization in a digital social assistant (advisor Sascha) Andreas Grenzing - Neuro-Symbolic Learning of Action Planning (advisors Matthias, German)
18.07.17	An adaptive emotional framework for social robots	Henrique Siqueira
Human-Robot Interaction (HRI) has been receiving increasing attention from scientific communities and industries. This attention can be explained by the growing number of robots in our daily lives, from robots that share workspace with humans in the industrial sector to sociable companion robots living with the elderly in nursing homes. In the context of social robots, one of the most difficult challenges is to make the communication capability of robots as natural as possible. Although the modeling of emotional behaviors through emotion recognition demonstrated several benefits and significant improvements in HRI, the majority of recent approaches have used strongly supervised learning techniques that are not able to adapt to the large variation between individuals and situations in real world applications. A robot capable of understanding these variations could conduct a more natural interaction by given the best-tailored response to a person, specially to the elderly in which the emotional variations are less evident due to physical or cognitive fatigue. To give a robot not just the ability to perceive emotions, but also to adapt and respond accordingly to a person and a situation, we propose an adaptive emotional framework for social robots that will be able to learn emotional variations through interactions using face and upper-body features. The framework will be composed of different artificial neural architectures, including a feedforward for face detection and emotion recognition, an auto-associative for face recognition, and a recurrent architecture for decision-making. The neural networks will be pre-trained using public datasets, and their training will be continued during social interaction. Finally, the proposed framework will be evaluated using the Interaction Quality concept that is the definition of a single metric to measure the effectiveness of an HRI.
18.07.17	Learning Natural Language and Prosodic Features for Human-Robot Interaction	Alexander Sutherland
Emotion is a vital aspect of establishing empathy and understanding between humans in interactions, which in turn leads to humans experiencing those interactions more favourably. In order for robotic agents to fit more naturally into the daily life of humans, they should possess an ability to detect emotions in humans and respond to these emotions accordingly. The goal of responding to these emotions is to reduce frustration and improve interaction quality in Human-Robot Interaction (HRI). Speech is one of the primary forms of communication between humans and is also one of the most emotionally salient, making it particularly interesting when attempting to classify emotions in humans. We will present an overview of our research, which focuses on finding emotional features in speech, primarily in language, and how we intend on using an ensemble architecture to analyse these features from temporal and symbolic perspectives. Furthermore we will touch on the subject of how, once found, emotions can be used to adapt and steer dialogue to make HRI interactions more enjoyable for humans.
11.07.17	Künstliche Neuronale Netze zur Vorhersage der Stromstärke von Robotermotoren	Aleksej Logacjov
Der NICO-Roboter des Arbeitsbereiches Knowledge Technology ist ein humanoider Roboter, der für diverse Aufgaben entwickelt wurde. Durch Bewegungen seiner Gliedmaßen können Kollisionen mit Hindernissen entstehen. Der Roboter besitzt jedoch keine Kollisionserkennung, um reagieren zu können. Dies kann dazu führen, dass Schäden am Roboter entstehen. Diese Problematik diente dieser Arbeit als Motivation. Kommt es zu einer Kollision, steigt die Stromstärke der Motoren in den Gelenken des Roboters an. Mittels Vorhersagen der Stromstärke, können Rückschlüße auf eine mögliche Kollision getroffen werden. Zunächst wurde ein MLP trainiert, welches die Stromstärke eines Kopfmotors vorhersagen konnte. Anschließend wurde ein weiteres MLP trainiert, mit dessen Hilfe eine Vorhersage von vier Motoren im Arm getroffen werden konnte. Beide Netze lieferten vielversprechende Ergebnisse. Die Kollisionserkennung wurde mittels der Vorhersagen und einem weiteren MLP durchgeführt. Diese erkannte erfolgreich Kollisionen von bestimmten Bewegungsabläufen.
10.07.17	Doctoral Defense: Teaching Robots With Interactive Reinforcement Learning	Francisco Cruz Naranjo
We investigate learning approaches, more specifically interactive reinforcement learning to perform a domestic task. We use parent-like advice to explore two set-ups: agent-agent and human-agent interaction. This thesis contributes to knowledge in terms of studying the interplay of multi-modal interactive feedback and contextual affordances. Overall, we investigate which parameters influence the interactive reinforcement learning process and show that the apprenticeship of reinforcement learning agents can be sped up by means of interactive parent-like advice, multi-modal feedback, and affordances-driven environmental models.
04.07.17, 11 am, room F-132	Predicting values and controling actions: an overview of bio-inspired models of mnemonic synergies	Frédéric Alexandre Research Director Mnemosyne team Inria Bordeaux
Mnemosyne is a team in computational neuroscience in Bordeaux, hosted in the NeuroCampus with neuroscientists and medical people. Our scientific positioning is about Mnemonic Synergy (hence Mnemosyne), the study of interactions between different kinds of learning and their organization in the brain in a system of memories. I will explain in more details this positioning and give some examples of current researches in the team, including modeling pavlovian conditioning by considering the different kinds of afferences of the amygdala and modeling operant conditioning through interactions between the prefrontal cortex and the basal ganglia.
27.06.17	Student Thesis Spottalks	WTM Students
Yang Li - Comparison of Design Methodologies of Echo State Networks José Rodriguez - Bio-inspired Auditory Preprocessing for Speech Recognition Tim Reipschläger - Learning Inverse Kinematics with Human-like Poses Fabio Wendt - Vector Representation of chat dialogues for text categorization Julian Tobergte - Improved Interpretation of Robot Commands via Crossmodal Validation with a Spatial Planner
20.06.17	Reinforcement Learning using Symbolic Representation for Performing Spoken Language Instructions	Mohammad Ali Zamani
Spoken language is one of the most efficient ways to instruct robots about performing domestic tasks. However, the state of the environment has to be considered to successfully plan and execute the actions. We propose a system which can learn to recognize the user's intention and map it to a goal for a reinforcement learning (RL) system. This system is then used to generate a sequence of actions toward this goal considering the state of the environment. Symbolic representations are used for both input and output of a Deep RL module. To show the effectiveness of our approach, the TellMeDave corpus is used to train the intention detection model and in a second step train the RL module towards the detected objective represented by a set of state predicates. We show that the system can successfully recognize instructions from this corpus and map them to the corresponding objective as well as train an RL system with symbolic input.
13.06.17	Improving Post-Processing Of Speech Recognition With Lexical Tree Search	Andre Grund
Present approaches to ASR consist of two main models: the acoustic and language model, which are being jointly searched for an optimum solution. Where acoustic models are largely task independent, language models can be restricted to particular tasks. The hybrid ASR-system DOCKS combines strong acoustic models provided by Google Search by Voice and the stationary ASR-system Sphinx-4 to improve the result provided by Google with domain-specific language models and restricted vocabularies. DOCKS can be operated with high ordered statistical n-gram models, but it is currently limited to a flat representation of the search space. This leads to extensive creation of search states and, therefore, to computational inefficiency during decoding, which causes the approach to perform slow regarding execution times. This thesis presents a lexical tree dictionary representation for the search space, which reduces the number of search states and the complexity during decoding of DOCKS and empowers DOCKS to run in real-time. Our experiments show that this approach outperforms the flat representation in execution times for large restricted vocabularies. The results also show that the performance in terms of word error rate is not negatively affected.
06.06.17	Improving the Quality of Crowdsourced Data Acquisition	Kamila Ignatowicz
As the efforts to make human-robot interaction more sophisticated are progressing, large data sets are needed to improve robots’ language capabilities. These data sets need to be of high diversity to make a robot understand a great variety of phrases. Appropriate data is usually specific for a task, and cannot be found in existing data sets, so it has to be collected. Research teams usually do not manage to find a large and diverse crowd to collect the data, so the data sets end up to be monotonous, and robots can only learn a small variety of sentences from them. In recent years, crowdsourcing has emerged as a method to collect large data sets from all over the world online. However, the quality of the generated data is poor, and the workload required to evaluate it too high to make crowdsourcing worthwhile. This thesis shows that crowdsourced data quality can be improved through tutorials and assessment of the work of others, providing simple techniques to make crowdsourcing more feasible for the collection of data sets. The findings suggests that crowdsourcing could establish itself as a method of data collection in the field of human-robot interaction, opening many new possibilities for the lingual training of robots.
06.06.17	Keras Workshop	Tayfun Alpay
Tayfun Alpay: Overview of Current Framework Landscape Pablo Barros: Feedforward Neural Networks in Keras Egor Lakomkin: Recurrent Neural Networks in Keras Erik Strahl: Running Keras, Theano and Tensorflow on Knowledge Technology Servers and Special Workstations - Basic Concept and Usage Manfred Eppe: A Guide to Fast and System-Independent Deployment and Development of GPU-Enabled Tools with Docker
30.05.17	Human-robot language interaction for learning and generalizing safety concepts	Chandrakant Bothe
In a conversation, humans use changes in a dialogue to predict safety-critical situations and use them to react accordingly. We propose to use the same cues for safer human-robot interaction for early verbal detection of dangerous situations. The sentiment helps to guide a decision-making process in the verbal interaction, which can be used as a feedback signal in the dialogue-based context learning. Due to the limited availability of annotated dialogue corpora, we use a classification tool for utterances to neurally learn sentiment changes within dialogues and ultimately predict that of the upcoming utterances. We train a recurrent neural network on context sequences of words, defined as two utterances of each speaker, to predict the sentiment class of the next utterance. Our results show that the model learned safety-cues which can be used for a safer human-robot interaction.
16.05.17	Object Tracking with Convolutional Neural Networks	Peer Springstübbe
Object tracking in video streams is a requirement in many fields of computer science. It is a challenging task because it requires an extraction of visual features from the video stream to identify the tracked object, as well as the ability to adapt to a changing appearance of the object over time. Convolutional Neural Networks (CNNs) have been proven to be powerful at extracting object representing features in image data, and, being a Machine Learning method, they are also able to adapt to a changing of those features over the tracking's duration. The thesis provides a thorough analysis of how a solid CNN based object model can be created during the tracking process by obtaining training data from the video stream. An update strategy is developed that incrementally trains the central CNN time efficiently and avoids bad data samples from being used. A deeper understanding of how the CNN is able to provide an adaptive model is gained throughout the development. In order to conduct the necessary research, a modular tracking framework is constructed, that achieves state of the art results in two commonly used tracking benchmarks.
09.05.17	Comparison of behaviour-based architectures for human-robot collaboration in a package delivery task	Melanie Remmels
Various ways of robotic based package delivery are currently developed in ifferent companies, in start-ups as well as in major enterprises. However, these systems do not collaborate with humans to perform their delivery. In collaborative human-robot tasks, it is of utmost importance that the robot performs the right action at the right time. This behaviour control is typically implemented using the Belief-Desire-Intention-concept (BDI) or state machines. Unfortunately, there is no well accepted common guideline when to use which one of these two approaches. From the literature it follows that state machines are often used for smaller, manageable applications and the BDI concept for hierarchically nested problems. In this thesis, we investigate a collaborative package delivery scenario and evaluate if state machines or the BDI-concept are better suited for this use case. The behaviour, as central module of the use case, has been implemented with PROFETA, a BDI-framework, and SMACH, a state machine implementation for ROS. Both implementations have been analysed and compared in different categories. Our results show, that the BDI-framework was easier to implement, it needed less characters than the state machine implementation. In the dynamic comparison, where the CPU and memory usage have been analysed, both frameworks performed similarly. A big difference has been detected in the comparison of subjective parameters. SMACH is easier to use because tutorials are provided and it is supported by a community. Overall, SMACH should be preferred as framework for the implementation of small use cases like the one presented in this thesis. The BDI concept should be used for larger applications, with an implementation that provides better subjective values than PROFETA.
25.04.17	Weakly supervised acoustic emotion recognition using recurrent neural networks	Egor Lakomkin
A central goal of Human-Robot Interaction (HRI) is to enable robots to be part of our everyday lives. As gestures play a big role in human communication, gesture recognition is an essential task robots should be capable of. Existing gesture recognition work mostly focuses on scenarios with one person in the scene. However, scenes with more than one person are of particular interest, as robots may face multiple people in natural environments. Motivated by the ability of humans to focus on important parts of a scene, this thesis proposes an attention mechanism for dynamic body gestures. For this task, we designed two neural architectures with 2D and 3D convolution kernels. Additionally, we present an approach to train the proposed architectures using a generated dataset. We created this dataset by combining recordings of gestures and generated artificial scenes with multiple people. Our experiments showed that the proposed networks are able to learn an attention mechanism for gestures. 2D convolutions performed well for body gestures with distinct arm positions. 3D convolutions use spatio-temporal features which showed an advantage on subtle gestures with less characteristic arm positions. To demonstrate a potential application of our architectures, we present a concept study to apply the proposed attention mechanism to the task of gesture recognition.
24.04.17	Doctoral Defense: Inspection of Echo State Networks for Dynamic Gestures	Doreen Jirak
We investigate Echo State Networks (ESN), which implement a new training paradigm for recurrent neural networks. We first demonstrate their gesture classification performance considering two feature sets with very distinct complexity. Second, we introduce the recurrence analysis for qualitative and quantitative description of the gesture input and the system dynamics of an ESN, and show that the methodology complements classic stability measures. Finally, we address the reservoir itself and propose an algorithm for pruning connectivity in a one-shot learning scenario.
18.04.17	Structuring and writing a PhD thesis: Guided discussion	Stefan Heinrich, Prof. Stefan Wermter
11.04.17	Syntactic Reanalysis in Language Models for Speech Recognition	Johannes Twiefel
State-of-the-art speech recognition systems steadily increase their performance using different variants of deep neural networks and postprocess the results by employing N-gram statistical models trained on a large amount of data coming from the general-purpose domain. While achieving an excellent per- formance regarding Word Error Rate (17.343% on our Human- Robot Interaction data set), state-of-the-art systems generate hypotheses that are grammatically incorrect in 57.316% of the cases. Moreover, if employed in a restricted domain (e.g. Human- Robot Interaction), around 50% of the hypotheses contain out-of- domain words. The latter are confused with similarly pronounced in-domain words and cannot be interpreted by a domain-specific inference system. The state-of-the-art speech recognition systems lack a mecha- nism that addresses syntactic correctness of hypotheses, which would be somehow comparable to the P600 Event-Related- Potential (ERP) that could be recorded from the brain. We propose a system that is inspired by this P600 ERP and can detect and repair grammatically incorrect or infrequent sentence forms. It is inspired by a computational neuroscience model that we developed previously. The current system is still a proof- of-concept version of a future neurobiologically more plausible neural network model. Hence, the resulting system postpro- cesses sentence hypotheses of state-of-the-art speech recognition systems, producing in-domain words in 100% of the cases, syntactically and grammatically correct hypotheses in 90.319% of the cases. Moreover, it reduces the Word Error Rate to 11.038%.

Winter Semester 2016/2017

Winter Semester 2016/2017
28.03.17	A Self-Organizing Model for Affective Memory	Pablo Barros
Emotions are related to many different parts of our lives: from the perception of the environment around us to different learning processes and natural communication. Therefore, it is very hard to achieve an automatic emotion recognition system which is adaptable enough to be used in real-world scenarios. This paper proposes the use of a growing and self-organizing affective memory architecture to improve the adaptability of the Cross-channel Convolution Neural Network emotion recognition model. The architecture we propose, besides being adaptable to new subjects and scenarios also presents means to perceive and model human behavior in an unsupervised fashion enabling it to deal with never seen emotion expressions. We demonstrate in our experiments that the proposed model is competitive compared with the state-of-the-art approach, and how it can be used in different affective behavior analysis scenarios.
14.03.17	Detecting and Localizing Body Gestures in a Multi-Person Scenario Using a Deep Neural Network	Florian Letsch
A central goal of Human-Robot Interaction (HRI) is to enable robots to be part of our everyday lives. As gestures play a big role in human communication, gesture recognition is an essential task robots should be capable of. Existing gesture recognition work mostly focuses on scenarios with one person in the scene. However, scenes with more than one person are of particular interest, as robots may face multiple people in natural environments. Motivated by the ability of humans to focus on important parts of a scene, this thesis proposes an attention mechanism for dynamic body gestures. For this task, we designed two neural architectures with 2D and 3D convolution kernels. Additionally, we present an approach to train the proposed architectures using a generated dataset. We created this dataset by combining recordings of gestures and generated artificial scenes with multiple people. Our experiments showed that the proposed networks are able to learn an attention mechanism for gestures. 2D convolutions performed well for body gestures with distinct arm positions. 3D convolutions use spatio-temporal features which showed an advantage on subtle gestures with less characteristic arm positions. To demonstrate a potential application of our architectures, we present a concept study to apply the proposed attention mechanism to the task of gesture recognition.
10.03.17	Doctoral Defense: Multimodal Learning of Actions with Deep Neural Network Self-Organization	German I. Parisi
Understanding other people's actions plays a crucial role in our everyday lives. Human beings are able to reliably discriminate a series of socially relevant cues from body motion, with this ability being supported by highly skilled visual perception and other modalities. The main goal of this thesis is the modeling of artificial learning architectures for action perception with focus on the development of multimodal action representations. As a modeling foundation to address our research question, we focus on hierarchies of self-organizing neural networks motivated by experience-driven cortical organization.
28.02.17	End-to-end Self-Learning of Visuomotor Skills through Interaction with the Environment	Matthias Kerzel
Deep learning is dependent on large amounts of annotated training data. For the development of robotic visuomotor skills in complex environments, generating suitable training data is challenging and time-consuming. Deep reinforcement learning tries to alleviate this problem by letting the robot learn unsupervised through trial and error; this, however, comes at the cost of long training times. In this talk, an approach for learning visuomotor skills from explorative interaction with the environment is presented. The robot generates its training data based on initial motor skills acquired through demonstration. End-to-end learning of visuomotor skills is realized with a deep convolutional neural network that combines two important subtasks of grasping, object localization and inverse kinematic, in one integrated architecture.
07.02.17	A Bio-Inspired Robot Navigation System	Xiaomao Zhou
Robot navigation is the core technology and also a difficult task related to robot perception, planning, implementation and many other aspects in mobile robot research. Different from probabilistic approaches which are mainly based on Kalman or Particle Filters, bio-inspired robot navigation is inspired by navigation behavior in animals and makes hypotheses about how the navigation skills are acquired and implemented in an animal's brain and body. Thus it is a promising direction to mimic animal behavior and attempt to achieve their learning ability for intelligent robots. Head-direction cells, place cells and grid cells have been found to play a fundamental role in rat navigation and have been computationally modeled. we aim to develop a navigation system which is based on these biological findings for a real Jackal robot platform. Essentially, robot navigation requires the capabilities of localization, map building, path planning and obstacle avoidance. We intend to just use robot's own cameras and odometry to complete these work, which can significantly simplify the system.
07.02.17	Curiosity-driven Reinforcement Learning for Neural Network-based Robotic Grasping	Burhan Hafez
Reinforcement Learning (RL) helps robots learn behaviors including complex manipulation and navigation tasks autonomously, purely from trial-and-error interactions with the environment and resulting in an optimal policy, which represents the desired behavior of the robot. However, exploring the environment to collect useful experience data is often difficult, especially for grasping tasks where high-dimensional visual data and multiple degrees of freedom are involved. The PhD project proposes to address this RL problem in continuous state and action space by looking into exploration strategies inspired by infant cognition and sensorimotor development, particularly the use of curiosity as an intrinsic motivation. Curiosity plays an important role in child development and we believe it has similar impact on robot learning of grasping skills.
24.01.17	Towards Intelligent Social Robots	Sascha Griffiths
For cognitive science robotics is an interesting tool for studying the human mind while at the same time working on solutions to real world applications of state-of-the-art research. Both social robotics and the related field of developmental robotics have become highly interdisciplinary and one of the main challenges here is that this type of research needs to be informed by both an understanding of the social science and the engineering component. To achieve more natural interaction with robotic systems, inspiration may be sought in the more advanced field of intelligent virtual agents (IVA) or embodied conversational agents (ECA). This community is in closer interaction with the dialogue systems community than the human-robot interaction scene. Further, these interfaces are much more developed with respect to speech-gesture alignment, social variation and naturalness. The main difference, however, is that whereas the robotics community often treats user interfaces and especially speech-interfaces as add-ons in most architectures, the interaction is at the heart of ECAs and tightly integrated into the software architecture. What needs to be achieved for similar systems in robotics is a transfer from the virtual to the physical, taking "real world" constraints of the physical embodiment into consideration. Finally, human-robot interaction can benefit from good research and experiment design in the style of psychological studies.
19.01.17	Using narrative structure to improve comprehension of science articles	Monica Gonzales-Marquez
Why is reading scientific articles so difficult? Most learners experience a deep-rooted anxiety when told they must read scientific texts of any type. When asked, they describe the process as overwhelming, frightening, and intimidating (Negrete & Lartigue, 2004). Yet, when asked to read a novel, no such feelings emerge. This is very likely because readers know what to expect: a progression of events leading to a conclusion of some type. This is clear for two reasons. First, everyone has spent their entire lives submerged in stories. We begin hearing them on our mother's lap, and eventually progress to guided readings of Madame Bovary at university. Second, as many have long intuited, we love stories because our brains are wired for them (e.g. Turner, 1996; Chow, et al, 2014). Note the passionate outrage when a television show in cancelled without an ending! There is increasing scientific evidence (Ibid.) that story is deeply entwined in our psyches, and that in fact, it may be the primary structuring mechanism used by the brain. The experience most readers have with classic narrative is in stark contrast to reading science. Most never receive much training, other than being told repeatedly that what they are reading is not a story. We argue that this assessment is incorrect. If narrative underlies general information structuring, then it should also underlie scientific literature. As such, we propose that once students learn to read science as narrative, their comprehension will increase significantly. To test this prediction, we developed a pedagogical method that used narrative structure to guide readers in constructing the story told in a research article. We also developed a brief questionnaire on general reading comprehension. We tested the efficacy of the method in a three-condition study using three separate university classes. The first received no training, the norm for most students. The second received training on the parts of a research article, i.e. introduction, methods, etc. The third received the narrative training. All groups read actual scientific articles and completed the comprehension questionnaire. The collective responses were treated as a small corpus, and coded for type of language used, type of response, etc., among other features. Our results showed that general comprehension was significantly greater in the narrative condition, as compared to the other two conditions. We will discuss these findings with a view toward further research and classroom applications. References Chow, H. M., Mar, R. A., Xu, Y., Liu, S., Wagage, S., & Braun, A. R. (2014). Embodied comprehension of stories: Interactions between language regions and modality-specific neural systems. Journal of Cognitive Neuroscience, 26, pp. 279-295. Negrete, A. and Lartigue C. (2011). The science of telling stories: Evaluating science communication via narratives (RIRC method). Journal of Media and Communication Studies Vol. 2(4), pp. 98-110 Turner, M. (1998) The Literary Mind: The Origins of Thought and Language. Oxford University Press.
10.01.17	P300 classification in EEG signals using Ladder Networks	Bastian Thiede
In recent advances a new type of semi-supervised neural networks, named Ladder Networks, made it possible for neural networks to utilize unlabeled and labeled data combined in a single cost function. These Ladder Networks achieve state of the art performance on MNIST (a benchmark dataset about handwritten digits) and CIFAR, a dataset containing 80 million small pictures and their respective class. To further test their abilities, EEG data of the BCI-Wadsworth 2004 dataset was used to build a P300 classifier as a core component of a brain-computer interface. The performance was evaluated with all 64 EEG channels as well as using only 14 EEG channels corresponding to the position of electrodes used by the Emotiv Brainwear-EEG. This thesis shows that Ladder Networks have the potential to increase the performance of a classifier even if labeled data is sparse. Computational efforts aside this thesis concludes that it is possible to achieve state of the art results with Ladder Networks while utilizing only 30% labeled data.
22.12.16	Embodied Language in Robotics	Dr. Manfred Eppe
Talk The talk will cover modern challenges of Robotic Natural Language Understanding (NLU) and discuss the importance of grounded and embodied language in robotics. Moreover, recent approaches and principles for building embodied language robots are presented and discussed. Short Bio Dr. Manfred Eppe is currently a postdoctoral fellow at the International Computer Science Institute (ICSI) at the University of California at Berkeley. His research concerns neuronal approaches for semantic data representation and Human Machine Interaction using natural language and other modalities. Before coming to Berkeley, he was based at the Artificial Intelligence Research Institute (IIIA-CSIC) in Bercelona, Spain, within the COINVENT project, where he worked on AI methods for Computational Creativity. He obtained his PhD in Computer Science from the University of Bremen, Germany in 2014, where he gained a background in logical formalizations of epistemic action theory and commonsense reasoning with applications in assistive service robotics and smart environments.
20.12.16	Brain-Inspired Visuo-Haptic Object Recognition	Sibel Toprak
In artificial visuo-haptic object recognition systems, vision and haptics are usually modeled as separate processes. This, however, is far from what actually happens in the human brain, the object recognition capabilities of which we would like such systems to have: A lot of multimodal and crossmodal interactions take place between the two sensory modalities there. Generally, three main principles can be identified as underlying the processing of the visual and haptic object-related stimuli: 1. Hierarchical processing 2. the divergence of the processing onto substreams for object shape and material perception, and 3. the experience-driven self-organization of the integratory neural circuits. The question arises whether an object recognition system can benefit in terms of performance from following a more brain-inspired strategy to combining the visual and haptic input, which we set out to answer in this thesis. We compare the brain-inspired integration strategy with conventionally used integration strategies on data collected with a robot that was enhanced with inexpensive contact microphones as tactile sensors. The results of our experiments involving eleven every-day objects show that 1. the contact microphones constitute a good alternative to capturing tactile information and that 2. the brain-inspired strategy is more robust to unreliable inputs.
13.12.16	Towards Understanding Auditory Representations in Emotional Expressions	Iris Wieser
Conveying emotional expressions in speech is a key element of human communication. However, until today there is no clear consensus among researchers which prosodic and acoustic features are most relevant for emotional expressions in speech. Many state-of-the-art emotion recognition systems only select the hand-crafted features that achieve a higher classification accuracy in a specific scenario. In contrast to these methods, the aim in this thesis is to contribute to a deeper understanding of the acoustic and prosodic features that are relevant for the perception of emotional states. An artificial deep neural network is proposed that allows to learn the auditory features directly from the unprocessed speech signal. For the analysis of the implicitly learned representations three different visualization methods are utilized: filter visualizations, data-driven visualizations, and network-driven visualizations. On the IEMOCAP dataset, the implemented network has shown to be competitive with state-of-the-art approaches. The results indicate that the network has learned novel features that are complementary to hand-crafted features. The analysis gives detailed insights on the learned representations. For example, the visualizations indicate that higher frequencies, a faster speaking rate, and longer utterances are used for more excited emotional states. Also, clear and constant pauses in an utterance indicate a more negative emotional state. The proposed approach is a general method to enable a deeper analysis and understanding of the most relevant acoustic and prosodic features to perceive emotional expressions in speech.
06.12.16	Semi-Supervised Learning with Recurrent Ladder Networks	Marian Tietz
Almost all available data is unlabeled, so that most current neural networks are not able to take advantage of this data. Deep neural networks achieve state-of-the-art results in many areas by means of better hardware utilization and improvement of the models. However, the main driving factor is the amount of available labeled data. Annotating data always involves human labor and is therefore expensive. The natural conclusion to this is to use semi-supervised models that gain information from labeled and unlabeled data alike. Recently, Ladder Networks have shown promising results in semi-supervised learning of convolutional and feed-forward networks. They reach state-of-the-art results with only a small fraction of the available labels. This thesis introduces Recurrent Ladder Networks that add the benefits of semi- supervised training to domains such as speech and language processing. Speech recognition experiments on the TIMIT corpus show that this model is able to increase the relative accuracy by 10% just by adding unsupervised data. Several variations, including a minimal one that requires little implementation effort but still achieves very similar results, are discussed and evaluated for a comprehensive overview of Recurrent Ladder Networks.
29.11.16	A Compressing Auto-encoder as a Developmental Model of Grid Cells	Anthony Kiggunde
The metric representation of space during navigation is attributed to a number of specialized cells located in the hippocampus and entorhinal corti, which represent and integrate measured distance, direction of motion, speed or any other extra inputs from different parts of the mam- malian brain. Neuroscientific evidence has led to the identification of place cells- these cells fire when an animal is in a specific location; head- direction cells - these fire when an animal faces a chosen direction; and grid cells - these form triangular grid-like patterns that tile the entire environment as an animal moves. Biologically, grid cells are organised into modules where the receptive fields of the cells in one module have the same spacing and orientation (between 30 - 60 degrees) but the scale differs in others forming multiple spatially scaled modules that precisely encode position over a large space. Although the mechanisms through which these multiple spatially scaled modules emerge are still of research interest, existing neural models attribute this modular behaviour to odometry such that the change of the triangular tessellating grid cell firing is influenced by the animal's velocity and direction inputs. In our auto-encoder model, we prescribe to evidence suggesting the existence of auto-associative networks within the entorhinal cortex which cohesively support the emerging activity patterns that spatially represent space during navigation. In this contribution, we present an un-supervised learning auto-encoder model that employs recurrent and temporal sequencing mechanisms of an Elman neural network to enable the emergence of multiple spatially scaled modules of grid cells.
22.11.16	MSc thesis defense	Tobias Hinz
Many modern learning algorithms, such as artificial neural networks, require the practitioner to manually set the values of many hyperparameters before the learning process can begin. However, with modern deep neural networks the evaluation of a given hyperparameter setting takes a lot of time and the search space for the optimization procedure is usually very high dimensional. We propose the usage of a genetic algorithm (GA) to optimize the hyperparameters specifically for convolutional neural networks (CNNs). Additionally, we suggest to use a lower dimensional representation of the original data to quickly identify promising areas in the hyperparameter space. We compare the results of the GA and the hyperparameter optimization on lower dimensional inputs to the Tree of Parzen Estimators (TPE) algorithm, an optimization algorithm based on Bayesian probabilities. Our experiments show that the GA using lower dimensional data representations for earlier generations needs less time to find hyperparameters that perform similar or better than those found by the TPE algorithm. Preliminary results also indicate that under some conditions the GA may be better suited to handle external constraints such as memory limitations. Our results suggest that using genetic algorithms and low dimensional data representations for the earlier generations during the optimization process are a promising way to find appropriate hyperparameters for CNNs and a given problem.
08.11.16	Word2Vec and Echo State Network For Thematic Role Assignment	Surender Kumar
Humans have a remarkable capability of acquiring language and in particular more than one language. More interestingly, they learn it within the same neural computing substrate. But, how the structure of a sentence is mapped to its meaning within the brain is still an open issue. To understand this process of language acquisition, several neural language models have been proposed. Most of the existing models are either using syntactic parse trees to create handcrafted features or using localist vector representation of words to process the sentences. Hinaut et al. proposed the θRARes (Thematic Role Assignment Reservoir) neural language acquisition model, for the Thematic Role Assignment (TRA) task. The model learns the thematic roles, i.e., "Who did what to whom", purely from the grammatical structure of the sentences. The semantic information was removed from the sentences by replacing semantic words with a unique token (`SW'). The model process the transformed sentences considering the words as discrete atomic symbols i.e. using localist word representation. The localist representation of the words do not carry any semantic and syntactic information about the words. This thesis proposes an end-to-end neural language model, Word2Vec-θRARes, in an extension of θRARes model. In addition to utilizing syntactic information and modeling temporal aspect of the input sentences, the proposed model also takes into account the semantics of the words. The model contains a Word2Vec unit to generate the distributed vector representation of words which captures the semantic and syntactic relationships. The model also contains a Recurrent Neural Network namely, Echo state Network (ESN), with fixed random connections, which is capable of modeling the sequential data. Thus, the model receives a raw sentence as an input which is processed word by word across time, and the meaning of the sentence is generated as an output. The results of the experiments conducted using the Word2Vec-θRARes model shows that the exposure of semantic and syntactic information to the ESN increases the performance on the TRA task. Apart from this, the Word2Vec-θRARes model also re-analyse the meaning of the sentence on arrival of words. Also, in some cases, the model correctly predicts the meaning of the sentences even before the sentences are finished.
01.11.16	Knowledge Technology Students Spot Talks	Kamila Ignatowicz, Andre Grund Peer Springstübbe, Aleksey Logacjov
25.10.16	Smartphone Controlled Intelligent Robot Platform	Pascal Folleher
Elderly people often need some form of assistance during their daily life. A small affordable assistant robot could help by reminding users to take their medicine or by looking for lost items etc. Other uses for such a robot could be in education or HRI research. This work aims at creating a homogenous robotic platform from heterogeneous hardware and software components. These components include a Raspberry Pi 2, an Arduino board, an Android smartphone, a RGB-D camera and the DOCKS speech post-processing. The ROS framework is used to take care of most communication and to enable the potential use of existing ROS modules. Experiments were conducted to proof that the integration of components is working and that the robot possesses basic navigation capabilities. The experiments show that the aim was mostly achieved but the speech recognition and motor control will need some improvements. The resulting robot is a step in the right direction but more work needs to be done for a practical use as assistance robot or for use in education.
18.10.16	Information theory for neural coding in a nutshell	Sascha Meyen
Neural coding concerns the neural representation of information. A neural network, whether it is artifical or biological, can encode information using different coding schemes such as rate coding or temporal coding. Applying information theory allows researchers to measure how much information different coding schemes convey. If one coding scheme conveys more information than another then researchers can conclude that this coding scheme is used by the neural network. However researchers have to be careful when applying information theory because estimators are biased. This presentation gives an introduction with examples on how to perform an analysis comparing different neural coding schemes using information theory.
11.10.16	Analysis of Spectral Data using Similarity Search	Thomas Klinger
Chromatography is a method from analytic chemistry to purify a mixture of chemical components. Mixtures are processed with a machine and their pure components are extracted. During the analysis, electric signals are captured and recorded. Analyzing data from chemical experiments using Machine Learning algorithms is becoming more and more popular for exploration and intelligent applications due the availability of software tools that can handle large datasets. A method of interest for many applications is to search through a collection of samples and find the most similar one given a new, possibly unknown sample. Chemical software is often based on a traditional way of visualizing the data which makes it difficult to compare two different datasets for similarity. Noise and sparsity make analysis more difficult. To overcome this issues, we explore a novel method for preprocessing and viewing chemical datasets based on a two-step dimensionality reduction method and an image-based representation which combines the classic approaches. Principal Component Analysis (PCA) is the standard method for preprocessing chemical datasets, Linear Discriminant Analysis (LDA) is evaluated as an alternative which performs superior on the underlying dataset. Finally, K-Nearest Neighbors is evaluated for finding the most similar sample using different preprocessed datasets.

Summer Semester 2016

Summer Semester 2016
27.09.16	Emotion Recognition from Body Expressions with a Recurrent Self-Organizing Neural Architecture	Nourhan Elfaramawy
The understanding of affective states plays a key role in social relationships between humans. However, it has been shown to be a very challenging task to model in artificial systems. Accordingly, for robots to communicate in a more natural and proactive way, they must respond to human emotional states. In this thesis, a recurrent self-organizing neural architecture that can effectively recognize affective states from human full-body motion patterns was implemented. A dataset called Bodily Expressions of Emotion (BEE) was collected to evaluate the system. This dataset is composed of full-body movements of 6 different emotions: Anger, Fear, Happiness, Neutral, Sadness, and Surprise, that was recorded using a depth sensor. Fifteen participants who did not take part in the data collection were asked to label the data in order to have a human comparison measure. The BEE dataset was used for training and validating the system. Experimental results showed that the system could learn emotional representations from body motion and that the performance of the system was very close to humans performance when recognizing emotions using the same dataset.
23.08.16	Neural Network Classification Of Materials Through Feedback From 3D Haptic Sensor	Moaaz Maamoon
Touching is a major part of the human sensory system which is used for expanding the human recognizability of the surrounding environment. Touching is used for recognizing different surfaces and it has a contribution in the proprioception which is represented in the joint position sense. The recognition of surfaces in robots is an important addition to the robot's sensibility of the surroundings. Biologically inspired, a pressure force sensor has been integrated into a robotic arm to classify surfaces using feedback from the sensor only. And this is biologically inspired from the human mechanoreceptors. The sensor determines the surface deformation caused by the contact force with objects. Using the real-time feedback from the sensor, the robotic arm controller and experimental sampling methods were developed. An artificial neural network was used afterward to classify the surfaces and to evaluate the efficiency of experiments in terms of stability, reproducibility and results accuracy.
16.08.16	Using Natural Language Feedback in a Neuro-inspired Integrated Multimodal Robotic Architecture	Johannes Twiefel
In this talk we present a multi-modal human robot interaction architecture which is able to combine information coming from different sensory inputs, and can generate feedback for the user which helps to teach him/her implicitly how to interact with the robot. The system combines vision, speech and language with inference and feedback. The system environment consists of a Nao robot which has to learn objects situated on a table only by understanding absolute and relative object locations uttered by the user and afterwards points on a desired object to show what it has learned. The results of a user study and performance test show the usefulness of the feedback produced by the system and also justify the usage of the system in real-world applications, as its classification accuracy of multimodal input is around 80.8%. In the experiments, the system was able to detect inconsistent input coming from different sensory modules in all cases and could generate useful feedback for the user from this information.
09.08.16	Recognition of Transitive Actions with Hierarchical Neural Network Learning	Luiza Mici
The recognition of actions that involve the use of objects has remained a challenging task. We present a hierarchical self- organizing neural architecture for learning to recognize transitive actions from RGB-D videos. We process separately body poses extracted from depth map sequences and object features from RGB images. These cues are subsequently integrated to learn action-object mappings in a self- organized manner in order to overcome the visual ambiguities introduced by the processing of body postures alone. Experimental results on a dataset of daily actions show that the integration of action-object pairs significantly increases classification performance.
02.08.16	Neural Network-Based Control of a Humanoid Robot for Grasping Tasks	Niklas Widulle
A neural network based controller for a humanoid robot arm is designed. The basis of the controller is an inverse kinematics solution, it defines how the robot must move its joints in order to achieve a specified motion of the end effector. This mapping is learned by a multilayer perceptron. Two learning strategies are implemented, optimized, and compared: goal and motor babbling. Motor babbling relies on random motor commands while the more recent goal babbling approach uses goal directed motions. Each of these methods is employed in both a three-dimensional and a six-dimensional task space which includes the orientation of the end effector. Additionally, a solution for relative inverse kinematics is implemented and tested. These are finally tested and compared using the NICO robot.
26.07.16	The Effects of Regularization on Learning Facial Expressions with Convolutional Neural Networks Learning Multiple Timescales in Recurrent Neural Networks	Tobias Hinz, Tayfun Alpay
The Effects of Regularization on Learning Facial Expressions with Convolutional Neural Networks (Tobias Hinz) Convolutional neural networks (CNNs) have become effective instruments in facial expression recognition. Very good results can be achieved with deep CNNs possessing many layers and providing a good internal representation of the learned data. Due to the potentially high complexity of CNNs on the other hand they are prone to overfitting and as a result, regularization techniques are needed to improve the performance and minimize overfitting. However, it is not yet clear how these regularization techniques affect the learned representation of faces. In this paper we examine the effects of novel regularization techniques on the training and performance of CNNs and their learned features. We train a CNN using dropout, max pooling dropout, batch normalization and different combinations of these three. We show that a combination of these methods can have a big impact on the performance of a CNN, almost halving its validation error. A visualization technique is applied to the CNNs to highlight their activations for different inputs, illustrating a significant difference between a standard CNN and a regularized CNN. Learning Multiple Timescales in Recurrent Neural Networks (Tayfun Alpay) Recurrent Neural Networks (RNNs) are powerful architectures for sequence learning. Recent advances on the vanishing gradient problem have led to improved results and an increased research interest. Among recent proposals are architectural innovations that allow the emergence of multiple timescales during training. This paper explores a number of architectures for sequence generation and prediction tasks with long-term relationships. We compare the Simple Recurrent Network (SRN) and Long Short-Term Memory (LSTM) with the recently proposed Clockwork RNN (CWRNN), Structurally Constrained Recurrent Network (SCRN), and Recurrent Plausibility Network (RPN) with regard to their capabilities of learning multiple timescales. Our results show that partitioning hidden layers under distinct temporal constraints enables the learning of multiple timescales, which contributes to the understanding of the fundamental conditions that allow RNNs to self-organize to accurate temporal abstractions.
12.07.16	Ball Localization for Robocup Soccer using Convolutional Neural Networks	Daniel Speck
The latest change of rules for the RoboCup Soccer in the humanoid league allows the balls' surface to consist of up to 50% of any color or pattern, whereas at least 50% have to be white. This makes ball localization an even more important and challenging task, because multi-color balls have changing color histograms and patterns, depending on the balls' current orientation and movement. To handle this newly introduced difficulty this thesis proposes a neural architecture, a convolutional neural network (CNN), to localize the ball in various scenes. CNNs are capable of learning invariances in images and are used in several image recognition tasks. In this thesis, CNNs are designed to locate a ball by training two output layers, one for the x- and one for the y-coordinates. The CNNs get trained with normal distributions fitted around the ball center. This makes it possible for the network to not just locate the ball position but also to provide an estimation of the noise. The whole image is processed by the architecture in full size, no sliding-window approach is used. Afterwards, the CNNs output gets filtered by a recurrent neural network (RNN), which additionally tries to predict future positions of the ball.
28.06.16	Recent Results on Analyses and Learning in Convolutional Neural Networks (to be presented at IJCNN 2016)	Pablo Barros, Nima Mousavi
Learning Auditory Neural Representations for Emotion Recognition (Pablo Barros) Auditory emotion recognition has become a very important topic in recent years. However, still after the development of some architectures and frameworks, generalization is a big problem. Our model examines the capability of deep neural networks to learn specific features for different kinds of auditory emotion recognition: speech and music-based recognition. We propose the use of a cross-channel architecture to improve the generalization aspects of complex auditory recognition by the integration of previously learned knowledge of specific representation into a high-level auditory descriptor. We evaluate our models using the SAVEE dataset, the GTZAN dataset and the EmotiW corpus, and show comparable results with state-of-the-art approaches. Understanding How Deep Neural Networks Learn Face Expression (Nima Mousavi) Deep neural networks have been used successfully for several different computer vision-related tasks, including facial expression recognition. In spite of the good results, it is still not clear why these networks achieve such good recognition rates. One way to learn more about deep neural networks is to visualise and understand what they are learning, and to do so techniques such as deconvolution could play a significant role. In this paper, we train a Convolutional Neural Network (CNN) and Lateral Inhibition Pyramidal Neural Network (LIPNet) to learn facial expressions. Then, we use the deconvolution process to visualise the learned features of the CNN and we introduce a novel mechanism for visualising the internal representation of the LIPNet. We perform a series of experiments, training our networks with the Cohn-Kanade data set and show what kind of facial structures compose the learned emotion expression representation. Then, we use the trained networks to recognise images from the Jaffe data set and demonstrate that the learned representations are present in different face images, emphasizing the generalization aspects of these networks. We discuss the different representations that each network learns and how they differ from each other. We also discuss how each learned representation contributes to the recognition process and how they can be compared to the emotional notation Facial Action Coding System - Facs. Finally, we explain how the principles of invariance, redundancy and filtering, common for deep networks, contribute to the learned features and to the facial expression recognition task in general.
20.06.16	Doctoral Defense: Natural Language Acquisition in Recurrent Neural Architectures	Stefan Heinrich
Our understandings of the behavioural and mechanistic characteristics for natural language are still in its infancy and we need to bridge the gap between the insights from linguistics, neuroscience, and behavioural psychology. To contribute an understanding of the appropriate characteristics in a brain-inspired neural architecture that favour language acquisition, recurrent neural models have been developed for embodied and multi-modal language processing, embedded in a developmental robotics framework. In this disputation the main contributions from the study of these models are reported.
14.06.16	Multi-modal Integration of Dynamic Audiovisual Patterns for an Interactive Reinforcement Learning Scenario	Francisco Cruz
Robots in domestic environments are receiving more attention, especially in scenarios where they should interact with parent-like trainers for dynamically acquiring and refining knowledge. A prominent paradigm for dynamically learning new tasks has been reinforcement learning. However, due to excessive time needed for the learning process, a promising extension has been made by incorporating an external parent-like trainer into the learning cycle in order to scaffold and speed up the apprenticeship using advice about what actions should be performed for achieving a goal. In interactive reinforcement learning, different uni-modal control interfaces have been proposed that are often quite limited and do not take into account multiple sensor modalities. In this paper, we propose the integration of audiovisual patterns to provide advice to the agent using multi-modal information. In our approach, advice can be given using either speech, gestures, or a combination of both. We introduce a neural network-based approach to integrate multi-modal information from uni-modal modules based on their confidence. Results show that multi-modal integration leads to a better performance of interactive reinforcement learning with the robot being able to learn faster with greater rewards compared to uni-modal scenarios.
07.06.16	Knowledge Technology Students Spot Talks	Sibel Toprak, Daniel Kreischer Melanie Remmels, Tobias Hinz
31.05.16	Language-Modulated Safer Actions using Deep Neural Networks	Mohammad Ali Zamani
In the future, robots are expected to work as a companion with humans in various areas ranging from service robots to humanoid robots. Dynamic and unpredictable human/domestic environments force developers to improve safety for human-robot cooperation. One natural approach for humans is to warn about threats using natural spoken language. Then, robots should be able to modulate safer actions by a syntactic/semantic understanding of those warnings. In pour research deep neural networks will be used as the main learning approach of the natural language processing part. However, besides warning messages, other modalities seem necessary to gain a better understanding of threats such as prosody and vision. Generating safer actions depending on context can be performed by reinforcement learning or simply by choosing from an available action set. Moreover, possible tasks and scenarios as well as datasets and platforms will be discussed in this talk.
17.05.16	Human Robot Interaction by Verbal Dialogues with Inferring and Learning of Safety Concepts	Chandrakant Bothe
When robots interact with humans by verbal dialogs, then the question arises whether the robot could predict dangerous situations or actions through language processing. As the robot demand in the area of domestic service increases, the safety in human-robot interaction becomes crucial. It could be beneficial, that the robot could learn safe and unsafe actions, to prevent the dangers that can occur in an uncertain environment. This talk will focus on exploring or generating the corpora for learning safe and unsafe concepts, and I will propose the utilization of annotated or unannotated corpora in a natural language interface. I already used a rule-based verbal dialogue system, Yet Another Dialogue Tool Kit (YADTK), to build a verbal interface. The PhD research under the SECURE project will be focused on inferring and learning of safety concepts primarilybylanguage alone, and will be tested on the NAO humanoid robot platform. We will discuss different possible corpora and datasets, with possible language models and planned experiments.
10.05.16	Towards Effective Classification of Imbalanced Data with Convolutional Neural Networks	Vidwarth Raj
Class imbalance is a problem in machine learning often found in real world datasets where one of the class dominates entire dataset. Most machine learning algorithms fail in classification of such dataset if class to class separability is poor. This thesis aims at finding methods to improve performance on poorly separated imbalanced datasets using Cost Sensitive Neural Networks. The focus dataset which have these characteristics is from XING AG and the problem is to predict a user's job change in near future. Results show that despite being imbalanced and poorly separated, performance metrics such as G-Mean as high as 92.8\% could be reached by using cost sensitive Convolutional Neural Networks to find patterns in time series. This thesis also generalizes the methods used and results show that it can be applied to any imbalanced dataset.
04.05.16	Inertial Measurement Unit Based Multi User Gesture Recognition	Stephan Tietz
Pattern recognition is an active field of research in the artificial neural networks community. In this thesis an artificial neural network, the so called Echo State Network (ESN) will be utilized to recognize ten di?erent hand gestures recorded by the inertial measurement unit of a regular smartphone. There are many different approaches for gesture recognition, mostly consisting of a segmentation and a classification phase. We will give a short overview over the different approaches and then present our own framework, from data collection to the finished classifer. We show how to set up the environment of the ESN to allow for good classi?cation results. As we use real world data the samples are of different length and power. Additionally collected sensor values contain noise and bias. We will show that the nature of the ESN not only allows to deal with these problems, but also others the possibility to perform segmentation and classification at the same time. We achieved classification accuracies as good as comparable approaches that use complicated segmentation algorithms by utilizing only a single artificial neural network.
03.05.16	Doctoral Defense: Neurocomputational Mechanisms for Adaptive Self-Preservative Robot Behaviour	Nicolas Navarro
The field of neurocognitive robotics takes the processing mechanisms of the brain as inspiration and guidance: computer implementations of robot perception and action should be based on brain-like neural architectures and biologically plausible learning mechanisms. Unsupervised learning and reinforcement learning have led to good results on the emergence of internal sensory representations and intelligent reward-seeking behaviours, respectively. However, other aspects of animal behaviour are generally not considered, even though it has been argued that only a more comprehensive study of animal behaviour can lead to a deeper understanding of intelligent behaviour. This thesis does not attempt to provide a comprehensive model of animal behaviour, but rather tries to draw attention to the need for it by presenting the potential of neglected aspects of animal behaviour such as self-preservative behaviour. Self-preservative behaviours are believed to impose the ground rules for more complex and motivated behaviour. Although many of these innate responses are hard-coded in the brain, they are not sufficient for the organisms' survival. They have to adapt, by learning, to new and unexpected situations within their lifetime and thereby be able to interact effectively with their environment. A key component on the lifetime adaptation is the formation of associations/memories between environmental predictors and relevant events, which mainly rely on punishment and reward learning. We believe that a deeper understanding of innate and learned defensive mechanisms could also be helpful in developing future robot generations, making them more adaptable and robust. Therefore, in this thesis, we study and develop three neurocomputational self-preservative mechanisms in the context of humanoid service robots to demonstrate the potential and feasibility of including bio-inspired adaptive self-preservative mechanisms as part of real-world robotic systems. Our aim is to present possible ways in which robots can be endowed with such adaptive self-preservative mechanisms at different neurocognitive levels, going from abstract biological models to neurocomputational models. The first experiment addresses the problem of search for an appetitive stimulus. Here a reinforcement learning (SARSA) algorithm was optimized to learn in a real-world scenario and manoeuvre a humanoid robot towards a charging station. The second experiment focuses on the role of punishment and nociceptive sensory input in motor learning. Both types of feedback play an important role in driving attention, and modulating decision making and action. However, they have not been thoroughly studied in computational models. Here, we compared the effect of both types of feedback on an Actor-Critic learning algorithm (CACLA). Finally, in our last experiment, we studied the role of noxious stimuli in the formation of anticipatory behaviour. This experiment is based on Pavlovian and instrumental conditioning and how environmental cues can be used to anticipate negative outcomes. A hybrid approach using an echo state network (ESN) and a dopamine modulated Pavlovian conditioning model was used to anticipate nociceptive sensory input based on auditory cues. In all three experiments we showed how often neglected, self-preservative mechanisms could solve meaningful artificial intelligence problems while providing the basis for new neuro-inspired computational processes. In particular, we showed how bio-inspired sensorimotor signals associated with nociception and pain can be exploited for learning beyond triggering reactive behaviours. We also developed novel extensions to the learning algorithms used.
02.05.16	Embodied Affective Decision Making in Robots	Dr. Robert Lowe
The importance of the role of affect (e.g. drives, motivations, emotions) in decision making has been increasingly recognized by researchers in the fields of neuroscience and psychology in recent years. It has also been of contemporary interest to roboticists with a focus on issues concerning 'embodiment'. In this talk, I will present work carried out over several projects that focuses on affective mechanisms used to guide decision making in robots. This talk will consist of two parts covering past and recent research in the area of embodied affective decision making in robots. In the first part, drawing from examples of my own, and my PhD students' work, I will provide examples from evolutionary robotics and human-robot interaction as to how affective mechanisms can be exploited in robotics to produce adaptive behavior and decision making, i.e. that which is not the direct product of learning. In the second part, I will discuss recent work on tactile interaction between humans and robots. The ability to reliably convey and interpret emotional signals through touch (as a form of embodied affective interaction) provides an important source of information for appropriate social decision making. Recent results from a human-robot tactile interaction study will be presented that show how emotions can be expressed according to a number of different dimensions amenable to tactile sensing. Bio Dr. Robert Lowe is Assistant Professor (Docent) of Cognitive Science at the Interaction Lab at the University of Skövde, Sweden and also research leader of the newly formed ICE (Interaction, Cognition and Emotion) Lab at the University of Gothenburg, Sweden. Robert studied Psychology at the University of Reading, England and Computer Science at the University of Hertfordshire, England where he also did his PhD. He has worked on a number of European projects - FP6 ICEA: Integrating Cognition, Emotion and Autonomy; 2009-2013), Marie Curie ITN funded RobotDoC: Robotics for Development of Cognition, as co-PI (2010-2013), FP7 NeuralDynamics (2011-2015) contributing computational models of affective mechanisms for decision making and learning for use in robotic systems. He has served on the PC of, and chaired at, several international conferences and workshops, such as the forthcoming: Lorentz Centre Workshop on "Emotions as Feedback Signals" in Leiden 2016 and Alife 2016. He recently organized or co-organized workshops on emotion and robotics (IWINAC 2014, IROS 2015, AMSTA 2016) and is an editorial board member of Frontiers in Psychology and the International Journal of Advanced Robotics Systems (IJARS). Robert has published over 50 articles in peer-reviewed journals, conference or workshop proceedings. His main area of expertise is in developing neuropsychological models for use in socio-technological applications including evolutionary robotics and human-robot interaction and emotions theory and research.
19.04.16	Identifying Dangerous Situations from Acoustic Data	Egor Lakomkin
What if the robot could predict dangerous situations in the home environment? It could be very beneficial in various aspects: the robot could be aware of its behaviour and prevent potential harm to a person or it could signal if there is a threat to a person, which might save lives. In this talk I will present a research project called "Identifying Dangerous Situations from Acoustic Data using Prosody Features" as a part of the SECURE project. The goal is to teach a robot to identify threatening situations, for instance, when a person screams for help. My current PhD research will be focused on acoustic and speech processing by means of deep neural networks with exploiting unlabelled data and testing them on the iCub robotic platform. In this talk, the background of the project will be presented and the possible datasets and experiments will be discussed.
12.04.16	Player Modeling in Texas Hold'em Poker with Echo State Networks	Alexander Klassen
Games have always been traditionally a research area for artifcial intelligence. Through them, a platform is given to directly test computational capabilities against human cognitive abilities and knowledge in a certain domain. Poker is currently the most played card game and one of the few games left where professional human players play at a higher level then computer players. This is mainly because of the imperfect information character of this game. To handle this issue, several basic directions emerged in research on this topic over the last twenty years. One of them is the exploitive counter strategy. The aim of this strategy is to fill the missing information gap by incorporating a model of the opponent and moreover, to exploit his suboptimal game play with aid of this additional knowledge. Although Poker is undeniably a game where the decision for a certain action is heavily influenced by previous game events, all found publications so far used machine learning approaches, being suboptimal for temporal tasks, when modelling the opponent. In this thesis we introduce Echo State Networks for modelling a players' strategy in Texas Hold'em Poker. Echo State Networks are Recurrent Neural Networks following the Reservoir Computing paradigm. As such, they are not only able to detect patterns in temporal dependencies, but are also provided with effective and fast training procedures.
05.04.16	Improving the Post-Processing of Cloud-based Speech Recognition with Tri-Gram Language Models	Valentin Strauß
Automated speech recognition (ASR) deals with the processing of spoken audio-data into processable data. The hybrid ASR-system DOCKS uses Googles "Search by Voice" as cloud-based ASR-system to receive a text-hypotheses for an audio-input and Sphinx-4 as local ASR-system to improve the result with domain-specific language models and a restricted vocabulary. Currently a bi-gram language model is the highest-order n-gram model for the DOCKS system. In this thesis a tri-gram language model for the hybrid ASR-system DOCKS is presented and the performance of this approach was measured by evaluating it on different data-sets. The results of the experiments show, that the performance of the hybrid ASR-system is highly depended on how well the domain of the inputs is covered by the language-model. For a highly restricted domain with n-gram models that sufficiently cover it, the tri-gram language model post-processing approach presented in this thesis achieves the best performance compared to other local, cloud-based and hybrid ASR-systems.

Winter Semester 2015/2016

Winter Semester 2015/2016
29.03.16	Knowledge Technology Students Spot Talks	Stefan Thomas, Florian Letsch Anthony Kiggundu, Pascal Folleher
17.03.16	Doctoral Defense: Indoor Vision-based Robot Navigation: a Neural Probabilistic Approach	Wenjie Yan
We present a neural probabilistic robot localization and navigation model that provides a mere concept while enabling far-reaching functionality. Through emulating several basic functionalities of the brain the system is able to achieve complex tasks such as robust target tracking, environment learning through observation, and flexible robot navigation in a home-like environment. The concept of our work is implemented and evaluated using a robot platform in a home-like environment, and the results show that our neural system helps a robot to realize different functions successfully.
08.03.16	Self-Organization of Temporal Dynamics in Recurrent Neural Networks	Tayfun Alpay
Gradient-based learning in Recurrent Neural Networks (RNNs) is difficult due to the vanishing gradient problem. This problem arises when RNNs have to be unfolded in time, making them too deep for efficient backpropagation. However, this temporal depth is required to effectively capture long-term dependencies in sequences. Multiple solutions have recently been proposed that show promising improvements and a better understanding of the issues at hand. Along these proposals are RNN models that allow the emergence of multiple timescales during training. To contribute to the understanding of the fundamental conditions that allow RNNs to self-organize to rich temporal dynamics, a number of these models are explored on sequence generation and prediction tasks. We compare the Simple Recurrent Network (SRN) with the recently proposed Clockwork RNN (CWRNN), Structurally Constrained Recurrent Network (SCRN), and the Recurrent Plausibility Network (RPN). The timescaling mechanisms of the SCRN and the RPN are adapted to the CWRNN. The widely used Long Short-Term Memory (LSTM), which is specifically designed to deal with vanishing gradients, is included as well.
01.03.16	Analysis of modulated spiking neural network	Markus Soll
In his Bachelor thesis Stefan Bruhns has created a new network type called 'modulated spiking neural network'. This thesis expands his work by analysing the performance of the new network on different tasks and comparing them to different other network types.
16.02.16	Emergence of Multimodal Action Representations from Neural Self-Organization	German I. Parisi
The integration of multisensory information plays a crucial role in autonomous robotics. In this talk, I will vividly argue that robust multimodal representations can naturally develop in a self-organized manner from co-occurring multisensory inputs. In particular, I propose a hierarchical learning architecture with growing self-organizing neural networks for learning human actions from audio-visual inputs. The hierarchical processing of visual inputs allows to obtain progressively specialized neurons encoding latent spatiotemporal dynamics of the input, consistent with neurophysiological evidence for increasingly large temporal receptive windows in the human cortex. Associative links to bind unimodal representations are incrementally learned by a semi-supervised algorithm with bidirectional recursive connectivity. Multimodal representations of actions are obtained using the co-activation of action features from video sequences and labels from automatic speech recognition.
02.02.16	Knowledge Technology Students Spot Talks	Thomas Klinger, Vidwarth Raj Stephan Tietz, Benjamin Scholz, Niklas Widulle, Daniel Speck
19.01.16	Crowdsourced Data Generation - How to Design a Website for Citizen Scientists?	Carina Garber
Along with the ongoing progress in robotics, human robot interaction becomes increasingly important. The EchoRob project aims at natural language acquisition in robots, based on an artificial neural network. It therefore needs a huge amount of language data from the context of human robot interaction in form of texts, speech and annotations. To obtain such data is challenging and time consuming. That is why in this thesis I introduce a design approach for a crowdsourcing website that enables laypersons to help generating that data on a website. I not only suggest the design for a specific set of tasks that should lead to high quality data but also a gamification system that provides important feedback to the users and keeps them motivated during the long run. My design suggestions are based on findings from motivational psychology, research on other citizen science projects and a PACT analysis of the planned website.
12.01.16	Optimal Stable Marriage Mapping with Particle Swarm Optimization	Tobias Tilly
The thesis deals with the Stable Marriage Problem (SMP) by reference to the Student Topic Assignment (STA) at Universities. In the SMP one tries to achieve stable mappings between two groups which have priorities for each other. A valid mapping is accomplished, if a bijection 1:1 between these groups is reached. While there is already an algorithm for solving the SMP by Gale and Shapley there is no way to get the best, but still invalid solution for the problem, if the SMP is not solvable. The Particle Swarm Optimization (PSO), originally proposed by Kennedy, Eberhart and Shi, is a swarm-intelligence-based algorithm. It is utilized to solve optimization problems like the STA. The peculiarity in the student to topic assignment is that it seems to be an instance of the Stable Marriage Problem, but rather is in the class of Assignment Problems (AP). Discussing this controversy and solving the STA, this thesis will show four PSO variations developed for this purpose.
15.12.15	Deep Neural Representation Analysis through Hierarchical Features Visualization	Nima Mousavi
We used visualisations to analyse CNNs trained on facial expression recognition by visualising neurons or feature maps. Through the visualisations we are able to understand the decision-making process and modify the databases and the architecture as needed. As a consequence, we learned that more sophisticated arrangements of classes allows filter maps to learn more differentiated information about the domain. Training CNNs on top of these data representations will improve classification results for all other classification tasks in the domain. We conduct multiple experiments directed at different parts of the architecture, through which we illuminate the mechanisms of CNNs and in this way facilitate the understanding of them. Furthermore, analysis through visualisations allows us to detect deficiencies of the architecture apart from the commonly known phenomenon of ``overfitting'', which might be the first thought that comes to mind, especially in case of generalisation to other databases. These include the suitability of the distribution of images throughout the classes, which can not be easily detected.
01.12.15	Hierarchical Reinforcement Learning using Neural Networks	Sohaib Younis
Hierarchical reinforcement learning solves complex reinforcement learning tasks by dividing them into subtasks and solving them individually in order to solve the whole task. This thesis presents some approaches that try to solve a complex reinforcement learning problem hierarchically using biologically inspired neural networks. The currently available hierarchical reinforcement algorithms are able to solve complex tasks hierarchically once the hierarchy has been defined by the programmer but generally are unable to find any hierarchy in the tasks themselves. The goal of this thesis, in addition to solving a complex learning task, is to discover hierarchy in the task. The thesis then presents the architectures for each approach and how they performed. The results of these approaches show that overall they were able to find some structure, although not hierarchical, and solve the defined task hierarchically.
24.11.15	Gesture Recognition with a Convolutional Long Short Term Memory Recurrent Neural Network	Eleni Tsironi
Motivated by the efficiency of Convolutional Neural Networks in implicit feature extraction and the ability of Long Short-Term Memory Recurrent Neural Networks in learning long-range time dependencies, this thesis proposes a Convolutional Long Short Term Memory Recurrent Neural Network (CNNLSTM) for the task of gesture recognition. The proposed system requires as input a stream of differential images, each of them representing the extracted body motion. Each of the differential images is processed by the proposed CNNLSTM architecture, in order to be assigned a gesture label. CNNLSTM consists of two consecutive Convolutional layers, the flattened output of which goes through a Long Short-Term Memory Recurrent layer. The output of the recurrent layer is fed to a softmax output layer. CNNLSTM is a deep recurrent neural network model and therefore can be trained with Backpropagation Through Time (BPTT). The performance of the system is tested on three different datasets and proves significantly higher than the performance of a baseline feedforward Convolutional Neural Network. CNNLSTM also exhibits common learning behaviour in all the tested datasets independently of the difficulty of each dataset.
10.11.15	Using online gesture learning with Echo State Networks to control an iCub robot head	Konstantin Kobs
Gestures are one of the main communication channels in Human-Robot Interaction (HRI). Because every individual performs gestures differently, personal robots need to be able to adapt to various persons and environments, while efficiently managing their computational resources. In this thesis, we present an approach to realize gesture recognition with Echo State Networks in an online learning scenario. Echo State Networks are a special type of Recurrent Neural Networks, but have the advantage of a more power effective training. We introduce an efficient preprocessing method to prepare image sequences for the network, which is then trained by using a mini-batch gradient descent algorithm to change the network's outputs over time. The recognized gestures are then used to control an iCub robot head for HRI. We show that Echo State Networks can achieve a good performance on the given test data and reach a decent level of adaption in a live scenario, where new unseen data is presented to the robot by a new subject.
27.10.15	Doctoral Defense: One Computer Scientist's (Deep) Superior Colliculus. Modeling, understanding, and learning from a multisensory midbrain structure	Johannes Bauer
The superior colliculus is a mid-brain region which integrates sensory input to localize mul-tisensory stimuli. This thesis aims to close the gap between models of SC physiology and system-level behavior, and development thereof. A new model of the SC based on self-organized statistical learning is proposed and it is shown to replicate a variety of important biological phenomena. It is then applied to a problem in robotics: binaural sound-source localization. We show that our algorithm can learn to per-form state-of-the-art sound-source localization.
27.10.15	Emotional Expression Recognition with a Cross-Channel Convolutional Neural Networks for Human-Robot Interaction	Pablo Barros
The study of emotions has attracted considerable attention in several areas, from artificial intelligence and psychology to neuroscience. The use of emotions in decision-making processes is just one example of how multi-disciplinary the research in emotions are. In my work I assume that communication in a Human-Robot interaction scenario, for instance with gestures, could benefit from additional human emotion recognition. In this presentation, I will show a deep neural network model which is able to recognize spontaneous emotional expressions and to classify them into 3 categories: positive, negative or neutral. The model was evaluated in two experiments, one using benchmark datasets, and the other using an HRI scenario with a humanoid robotic head, which itself gives emotional feedback.
20.10.15	Tutorial on hyperopt: Python library for parameter optimization	Xavier Hinaut
You already have tweaked some (hyper-)parameters to improve the performance of your algorithm? Then this tutorial is for you: "\|hyperopt\| is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions." It enables you to make your hyper-parameter search more automatic and reproducible than if you just do by "trial and error". The library comes with a lot of useful functionalities that you do not have to implement yourself (e.g. Random or Tree of Parzen Estimators (TPE) search algorithms). In particular, you can define your search space on a very fine-grain level of details: continuous, discrete, or logarithmic, and you can have conditional parameters (some may exist or not depending on other parameters), which means you can search over different type of search spaces at the same time. This tutorial will take a look on the basics and trying to get hints on the advanced parts. Follow the link hyperopt for further information. For any questions left, join us and make the tutorial a dynamic exchange for all of us. If you want to know other existing sub-librairies (ConvNets, scikit-learn, etc), click here

Summer Semester 2015

Summer Semester 2015
25.09.15	Knowledge Technology Students Spot Talks	Tayfun Alpay, Markus Soll, Alexander Klassen
18.09.15	Object recognition using point cloud-based neural networks	Marcelo Borghetti
It is known from studies in the monkey ventral pathway that two-dimensional pattern representations are processed in areas such as V2, V4 and IT cortex. However, the role played by the brain to process three-dimensional shape is unclear. Some recent studies suggest that IT neurons encode also three-dimensional spatial configurations, useful for understanding objects, for example. This hypothesis is consistent with classical theories about shape representation based on the concept of Geons. In this work we will presented a neural network designed to extract three-dimensional features from point cloud data for object classification. At the same time, the importance of multiple view points in the recognition, largely studied in psychology and confirmed by many computational studies, is taken into account. Experiments using the Washington Dataset composed of 51 categories and 300 instances of objects shows an improvement using three-dimensional features and motivates further investigation.
28.08.15	Design and Control of Biologically Inspired Joints	Nils Rokita
As robots have to do more and more human tasks, they need more human-like moving capabilities. In this thesis, a biologically-inspired joint operating with tendons is built. The controlling for such joints is examined on this example. It is tried to control the joint with learning algorithms like goal babbling and neural networks. On the one hand, tendons have some advantages over motors directly in the joint as they are space-saving when applied to fingers or can protect the motor from the forces if the robot falls down. On the other hand, controlling a joint operated by tendons is much harder than a simple servo controlled by a PID-Controller.
07.08.15	How to Recognize Spontaneous Emotions?	Pablo Barros
The use of emotional states for Human-Robot-Interaction (HRI) scenarios has attracted considerable attention in recent years. One of the most challenging task is to recognize the spontaneous expression of emotions. Every person has a different way to express emotions, and this is aggravated by the complexity of interaction with different subjects, multi-cue information and different environments. How to deal with the particularities of each person, each expression and each cue? In this work, I will show how our convolutional neural network based architecture learns features that can represent these particularities, and which are the advantages of using them to recognize spontaneous emotions.
24.07.15	Modeling Color Vision with Coding Strategies of Retinal Ganglion Cells	Daniel von Poschinger-Camphausen
The main motivation of this work is to gain insight into improving the reliability of color information by following the coding strategies of the lower stages of the human visual system. In comparison to human perception, the color information of a camera is unreliable as it is greatly in uenced by the luminosity and color of the illumination of a scene. Established unsupervised computational models of self-organizing RGC and V1 receptive fields are extended to process RGB images and are trained upon images of natural scenes. The results show that localized color opponent RF are emerging in chromatically distinct channels, each entirely covering the visual space. The opponent texture of the resulting RF is filtering more biologically plausible chromatic contrasts, which are regarded as the fundamental building blocks of the human visual system's property of color constancy. Because of the unmatched performance the visual system produces, filters capturing more biological plausible chromatic contrasts appear to be of value and are suggesting that computational models utilizing such preprocessing possibly show an improved performance.
03.07.15	Learning Human Motion Feedback with Neural Self-Organization	German I. Parisi
The correct execution of well-defined movements in sport disciplines may increase the body's mechanical efficiency and reduce the risk of injury. While there exists an extensive number of learning-based approaches for the recognition of human actions, the task of computing and providing feedback for correcting inaccurate movements has received significantly less attention in the literature. We present a learning system for automatically providing feedback on a set of learned movements captured with a depth sensor. The proposed system provides visual assistance to the person performing an exercise by displaying real-time feedback to correct possible inaccurate postures and motion. The learning architecture uses recursive neural network self-organization extended for predicting the correct continuation of the training movements. We introduce three mechanisms for computing feedback on the correctness of overall movement and individual body joints. For evaluation purposes, we collected a data set with 17 athletes performing 3 powerlifting exercises. Our results show promising system performance for the detection of mistakes in movements on this data set.
03.07.15	Reinforcement learning with interactive feedback using speech guidance	Francisco Cruz
Recently robots are being used more frequently as assistants in domestic scenarios. In this context we train an apprentice robot to perform a cleaning task using interactive reinforcement learning since it has been shown to be an efficient learning approach benefiting from human expertise for performing domestic tasks. The robotic agent obtains interactive feedback via a speech recognition system which is tested to work with five different microphones concerning their polar patterns and distance to the teacher to recognize sentences in different instruction classes. Moreover, the reinforcement learning approach uses situated affordances to allow the robot to complete the cleaning task in every episode anticipating when chosen actions are possible to be performed. Situated affordances and interaction allow to improve the convergence speed of reinforcement learning, and the results also show that the system is robust against wrong instructions that result from errors of the speech recognition system.
26.06.15	Knowledge Technology Students Spot Talks	P. Folleher, C. Garber V. Strauß, T. Tilly
The Knowledge Technology group is a highly international and interdisciplinary group with research topics on methods in Neural Networks, Hybrid Systems in the fields of Cognitive Robotics, Vision and Language Processing. To increase synergy effects with our students and providing a platform for research discussions and theses preparations, we established the Knowledge Technology student spottalks. The student is supposed to motivate the topic, formulate the research questions(s) addressed in the thesis, and sketch the methodology in 5 slides. Feedback by the group will be given. The talk must not exceed 10 minutes in order to leave time for feedback and discussion. Speaker Carina Garber - Crowdsourced Data Generation: How to Design a Website for Citizen Scientists? Valentin Strauß - Improving the Post-Processing of Cloud-based Speech Recognition with Tri-Gram Language Models Tobias Tilly - Optimal Stable Marriage Mapping with PSO Pascal Folleher - Forecasting Image Quality on a Moving Humanoid Robot for Variable Image Sampling
19.06.15	Prediction of merchant ship fuel consumption with neural networks	Ahmed Saleh
This thesis focuses on discovering the effect of the ship's surrounding environmental conditions on the speed and fuel consumption, which contribute to the final cost of a planned voyage. This thesis is supported by Skysails GmbH, the world famous developer of the SkySails kite propulsion system. Skysails produces and distributes a marine performance management tool that operates in many cargo ships. Therefore, a huge amount of ambient data have been acquired from different vessels that operate on harsh offshore environment. In this thesis, we are investigating whether the regression methods (neural networks, support vector regression) are able to derive fuel consumption parameters for a future voyage of travelling ship or more generally learning a true function from ambient data.
05.06.15	Gesture Recognition and Semantic Parsing with Echo State Networks	Johannes Twiefel, Doreen Jirak
Reservoir Computing (RC) is an attractive training mechanism for recurrent neural networks due to its ease and has therefore gained popularity in the neural network community. In this talk, we will present our approaches to RC addressing the Echo State Networks applied to gesture recognition and semantic parsing. We will present our experimental set ups and show recent results from our research.
29.05.15	Neural-network-based Action Object Semantics for Assistive Robotics	Luiza Mici
In this talk I will briefly present my background and motivation, preliminary work done recently and the outline of my research project. The ongoing development of robotics and the growth in number of elderly individuals have led to an increasing interest on applying assistive robotics in order to maintain or even improve the quality of care. At the hospital, in elderly care centres and domestic environments a patient can be assisted by making sure he/she performs essential daily activities such as eating, drinking and taking medications. Therefore, thorough understanding of the human-object interaction activities is the most important building block of an effective assistive framework. Our focus is on domestic scenarios in which human activities involve humans and objects of common use. The recognition of human-object interaction activities requires the analysis of the objects present in the scene, the analysis of human actions, and finally the analysis of interplay between objects and human actions. Therefore, our goal is the implementation and evaluation of a neural-network-based framework, able to learn, represent and integrate two different perception tasks: object recognition and human action recognition.
28.06.15	Multi-sensory neural reinforcement learning using the options framework	Dr. Victor Uc-Cetina
In reinforcement learning, the options framework allows us to implement intelligent agents capable to learn temporal abstractions. Such framework is based on the theory of semi-Markov decision processes. In this talk, I will give an introduction to the options framework and I will discuss some research lines that can be followed to investigate the design of intelligent agents capable to perform both: multi-sensory integration and temporal abstractions learning by means of options. Short Bio Victor Uc-Cetina received a PhD degree (Dr. rer. nat.) in Computer Science from Humboldt-Universität zu Berlin, Germany, in 2009, a M.Sc. in Intelligent Systems from the Monterrey Institute of Technology and Higher Education, Mexico, in 2002, and a B.S. in Computer Systems Engineering from the Merida Institute of Technology, Mexico, in 1999. He is currently a professor in the Faculty of Mathematics at the Autonomous University of Yucatan, in Mexico. His research interests are mainly focused on artificial intelligence and machine learning methods.
15.05.15	WTM Students Spot Talks	E.Tsironi, J.Yalpi, K.Kobs N.Mousavi, S. Younis
28.04.15	Recurrent Neural Network-based Probabilistic Language Model	Sathyanarayanan Kuppusami
Statistical n-gram language models are widely used for their state of the art performance in a continuous speech recognition system. In a domain based scenario, the sequences vary at large for expressing same context by the speakers. But, holding all possible sequences in training corpora for estimating n-gram probabilities is practically difficult. Capturing long distance dependencies from a sequence is an important feature in language models that can provide non-zero probability for a sparse sequence during recognition. A simpler back-off n-gram model has a problem of estimating the probabilities for sparse data, if the size of n-gram increases. Also deducing knowledge from training patterns can help the language models to generalize on an unknown sequence or word by its linguistic properties like noun, singular or plural, novel position in a sentence. For a weaker generalization, n-gram model needs huge sizes of corpus for training. A simple recurrent neural network based language model approach is proposed here to efficiently overcome the above difficulties for domain based corpora. A neural probabilistic language model is directly integrated with the cloud based speech recognition system for its real time usage rather than the common two-pass n-best list rescoring method. Probabilistic n-gram and neural network language models are evaluated on the TIMIT Core Test and Scripted Robot Command corpus for their overall performance based on word error rate and precision. The experiments show that both the language models perform equally well with no significant difference in word error rate, but recurrent neural network model exhibited generalization and capturing long term dependencies slightly better than n-gram model for the corpora.
21.04.15	Modulated Spiking-Neurons	Stefan Bruhns
A way of neuromodulation of spiking neurons and its impact to artificial neural networks is shown. In this context the difference to networks with spiking neurons without neuromodulation and known GasNet model are taken into account. A Shape-Discrimination-Task is used for comparison. In that task a robot should recognize two different objects based on its shape. Neuromodulation offers more ways to map dynamics within a neural network.
17.04.15	Doctoral Defense: Artificial Neural Models for Feedback Pathways for Sensorimotor Integration	Junpei Zhong
The brain comprises hierarchical modules on various physiological levels. Neural feedback signals modulate the neural activities via inhibitory or excitatory connections within/between these levels. They have predictive and filtering functions on the neuronal population coding of the bottom-up sensory-driven signals in the perception-action system. In this thesis, we propose that the predictive role of the feedback pathways at most levels of action and perception can be modelled by the recurrent connections in different artificial cognitive platforms. This will be examined by three recurrent neural network models. Furthermore, the three models and experiments with them show that the recurrent neural networks are able to model feedback pathways and to exhibit the feedback-related sensorimotor predictive functions.
16.04.15	Developmental Robotics for Embodied Language Learning	Prof. Angelo Cangelosi
Growing theoretical and experimental research on action and language processing and on number learning and space representation clearly demonstrates the role of embodiment in cognition and language processing. In psychology and neuroscience this evidence constitutes the basis of embodied cognition, also known as grounded cognition (Pezzulo et al. 2012). In robotics, these studies have important implications for the design of linguistic capabilities in cognitive agents and robots for human-robot communication, and have led to the new interdisciplinary approach of Developmental Robotics (Cangelosi & Schlesinger 2015). During the talk we will present examples of developmental robotics models and experimental results from iCub experiments on the embodiment biases in early word acquisition studies, on word order cues for lexical development and number and space interaction effects. The presentation will also discuss the implications for the "symbol grounding problem" (Cangelosi, 2012) and how embodied robots can help addressing the issue of embodied cognition and the grounding of symbol manipulation use on sensorimotor intelligence. References Cangelosi A. (2012). Solutions and open challenges for the symbol grounding problem. International Journal of Signs and Semiotic Systems, 1(1), 49-54 (with commentaries) Cangelosi A, Schlesinger M (2015). Developmental Robotics: From Babies to Robots. Cambridge, MA: MIT Press. Pezzulo G., Barsalou L.W., Cangelosi A., Fischer M.H., McRae K, Spivey M.J. (2011). The mechanics of embodiment: a dialog on embodiment and computational modelling. Frontiers in Psychology, 2(5), 1-21 Biography Angelo Cangelosi is Professor of Artificial Intelligence and Cognition and the Director of the Centre for Robotics and Neural Systems at Plymouth University (UK). Cangelosi studied psychology and cognitive science at the Universities of Rome La Sapienza and at the University of Genoa, and has been visiting scholar at the University of California San Diego and the University of Southampton. Cangelosi's main research expertise is on language and cognitive modelling in humanoid robots, on language evolution in multi-agent systems, and the application of bio-inspired techniques to robot control (e.g. swarm of UAVs). He is the coordinator of the Marie Curie ITN "RobotDoC: Robotics for Development of Cognition" (2009-2014) and the UK EPSRC project "BABEL: Bio-inspired Architecture for Brain Embodied Language" (2012-2016), and of the FP7 project "ITALK" completed in 2012. Cangelosi has produced more than 200 scientific publications, and has chaired numerous workshops and conferences including the IEEE ICDL-EpiRob 2011 Conference (Frankfurt, August 2011). In 2012-13 he was Chair of the International IEEE Technical Committee on Autonomous Mental Development. He will be Editor-in-Chief of the IEEE Transactions from Autonomous Development from January 2015, and currently is Editor of the journal Interaction Studies. He is co-author, with Matt Schlesinger, of the forthcoming book "Developmental Robotics: From Babies to Robots" (MIT Press, 2015).
02.04.15	Structural interference in approach and avoidance behaviors by a neurally controlled robot	Jonas Sperling
Abstract: Humans learn to adapt to high complex, high dimensional real world problems. The separation of neural circuits into modules, which act on disjunct functions, enables the human to adapt. Dopamine in the basal ganglia encodes differences of expected and experienced reward. External stimuli do not only occur in positive matter, but also appear as negative signals. The amygdala represents strong negative external stimuli as fear. The nuclei complex adapts fast to fear experiences. Over time, discrimination takes place and the fear signals are understood more precisely. There are two types of dopamine releases that reinforce learning. Phasic mid brain dopamine fires on positive external stimuli. Tonic dopmaine is suppressed on negative external stimuli. The phasic activation, from positive stimuli, is unproportionally higher than the suppression of tonic dopamine. As a result a "dopamine-ramp" occurs. The ramp disadvantages learning on negative stimuli in the basal ganglia. The abstraction of dopamine as temporal differential (TD) error is used to teach an agent to navigate towards a state with reward. As well as for reward in navigation, artificial neural networks can be used to map input signal to states with fear. The artificial neural network maps input signals to Q-values, which encode volitional behavior as an answer of the input signals. Contrary external stimuli, reinforced on one neural structure, cause interference. An artificial multi layer perceptron (MLP) shows no behavior adaptation to solutions that are reinforced with contrary signals. An agent, controlled by a single structure MLP, is not capable to learn the mapping of input signals to actions. The positive TD error for reward contradicts the fear reinforcement. As a result of this interference, the agent does not learn the non-linear approximation of the input-ouput mapping. As a remedy to this problem, two separate neural circuits can have distinct weight changes via TD error for pleasant reward and TD error for fear. Each neural circuit of the separated MLP has half the amount of neurons in the hidden layer. The separation into two neural circuits, show disjunct stepwise optimization towards fear and reward learning. The disjoined structure shows adaptation to conjugate external stimuli. In practice, the discretization of neural structures into disjunct functions, is not proportional to the complexity of the minimizing error function. Contrary action selections of the agent extend exploration.

Winter Semester 2014/2015

Winter Semester 2014/2015
31.03.15	Development of a Multimodal Android Interface for ROS-based Robots	Oliver Degener
Speech-based human-robot communication is natural and efficient but also very prone to error. Therefore, the purpose of this bachelor thesis was the development of a multimodal Android interface for ROS-based robots. The application connects to a ROS master and awaits closed-ended question requests. Users can either speak to the robot or use the touchscreen to answer these questions. To address the issue of poor status visualisation in existing interfaces the application utilises an avatar, animated graphics and icons. User tests have shown that the provided solution is capable of delivering resilient communication in common scenarios. The results of the speech recognition have been improved by the post-processing software DOCKS. Errors in the speech recognition were handled gracefully by the touchscreen.
20.03.15	Interaction is more beneficial in complex reinforcement learning problems than in simple ones	Chris Stahlhut
Giving interactive feedback, other than well done / badly done alone, can speed up reinforcement learning. However, the amount of feedback needed to improve the learning speed and performance has not been thoroughly investigated. To narrow this gap, we study the effects of one type of interaction: we allow the learner to ask a teacher whether the last performed action was good or not and if not, the learner can undo that action and choose another one; hence the learner avoids bad action sequences. This allows the interactive learner to reduce the overall number of steps necessary to reach its goal and learn faster than a non-interactive learner. Our results show that while interaction does not increase the learning speed in a simple task with 1 degree of freedom, it does speed up learning significantly in more complex tasks with 2 or 3 degrees of freedom.
19.03.15	Mechanical Design of the Arms and Neural Arm Control for the Humanoid Robot Platform Nimbro-OP	Atif Mahboob
Nowadays, many humanoid teen sized robot platforms have been developed by different research groups. The idea is either to fulfill the research purposes or to produce a specific task fulfilling machine. This imposes many challenges on the design of algorithms for different actions like walking or reaching some target. There are many sophisticated humanoid research platforms available, but one crucial thing to look at is the developmental cost associated with them. As the name describes, humanoid robots are the robots that resemble humans in their design as well as their way of performance. This thesis aims at contributing to the arm design for humanoid robots and to the learning to approaching a target. We used 3D plastic printing for part manufacturing. An arm with multi degrees of freedom is designed to enable the robot free movement around the body. This thesis also contributes in the field of designing a simulator model of a robot. The research carried out in this thesis takes the early learning in infants as the basis. We used this idea to develop a learning algorithm that eventually enable the robot to reach goals in 3D space with accuracy.
12.03.15	Language Acquisition using Echo State Networks applied to the iCub-robot	Xavier Hinaut
Primates can learn complex sequences that can be represented in the form of categories, and even more complex hierarchical structures such as language. The prefrontal cortex is involved in the representation of the context of such sequential structures, and the striatum permits the association of this context with appropriate behaviours. In this experiment, the Reservoir Computing paradigm is used: a recurrent neural network with fixed random connections models the prefrontal cortex, and the read-out layer (i.e. output layer) models the striatum. The model was applied to language syntactic processing, especially thematic role assignment: given a sentence this corresponds to answer the question "Who did what to whom?". The model processes categories (i.e. abstractions) of sentences which are called "grammatical constructions". The model is able to (1) process correctly the majority of the grammatical constructions that were not learned, demonstrating generalization capabilities, and (2) to make online predictions while processing a grammatical construction. Moreover, we observed that when the model processes less frequent constructions an important shift in output predictions occurs. It is proposed that a significant modification of predictions in a short period of time is responsible for generating Evoked-Related Potentials (ERP) like the P600: this typically occurs when unusual sentences are processed. The use of the model for complex abstract sequence processing shows that the faculty of representation and learning of sequences in the brain may be based on highly recurrent connections. This experiment suggests that artificial recurrent neural networks can provide insight into the underlying mechanisms of human cortico-striatal function in sentence processing. Finally, to show the ability of the model to deal with a real-world application, the model was successfully applied in the framework of human-robot interaction for both sentence comprehension and production. The sentence production model was obtained by "reversing" the sentence comprehension model: it processes meanings as input and produces sequence of words in output. More recently, an incremental version of the model with Hebbian-like learning has also been developed, demonstrating the biological plausibility of the underlying mechanisms. This incremental version could be used in a developmental learning scheme for robots.
10.03.15	Timescale-Independent Pattern Detection with Echo State Networks	Timur Olzhabaev
Certain common problems in pattern detection may be regarded as timescale re- lated, e.g. detection of patterns of varying lengths, slightly (time-)warped variations of same patterns and short-scale variations of pattern values, such as noise. This thesis investigates those problems in the context of pattern detection with 'Echo State Networks' (ESNs), a reservoir-based type of recurrent neural networks. It aims to look at ESN behaviour and to shed light on ESN parameter in uence on good performance when confronted with those problems. We conduct experiments on synthetic data and optimize different ESN models using evolutionary optimiza- tion in the context of timescale dependent problems. We then analyse behaviour and parameter distributions.
04.03.15	Determining Optimal Feature-Vector Length for Text Categorization Approaches	Marco Jendryczko
The thesis analyses different supervised text categorization methods to find optimal feature vector sizes for given application scenarios. We use feature selection algorithms to reduce the dimensionality of the input data and supervised classifiers for the actual categorization task. Several metrics are used to evaluate and to compare the different approaches. Based on the results suggestions are given for the applications scenarios. As a basis the vector space model is used. Feature selection is done by Chi-Square, information gain and oneR. Each of them is combined with the classifiers, we used supervised self organizing maps (SOM), decision trees and the support vector machine (SVM). The result suggest that self organizing maps with information gain are more suitable for time-constrained applications, where information gain in combination with support vector machine is better for best performance with no time constraint. In general, SVMs are performing very well and decision trees are very efficient in classification. This thesis also gives a guideline to analyse performance with different feature vector sizes for future work.
10.02.15	Interactive Reinforcement Learning through Speech Guidance in a Domestic Scenario	Francisco Cruz
In the recent past robots are being used more frequently as assistants in domestic scenarios. In this context we train an apprentice robot to perform a cleaning task using interactive reinforcement learning since it has been shown to be an efficient learning approach benefiting from human expertise on performing domestic tasks. The robotic agent obtains interactive feedback via a speech recognition system which is tested to work with five different microphones concerning their polar patterns and distance to the teacher to recognize sentences in different instruction classes. Moreover, the reinforcement learning approach uses situated affordances to allow the robot to complete the cleaning task in every episode anticipating when chosen actions are possible to be performed. Situated affordances and interaction allow to improve the convergence speed of reinforcement learning, and the results also show that the system is robust against wrong instructions that result from errors of the speech recognition system.
20.01.15	Intelligent Pathfinding of Humanoid Robots on the Example of RoboCup	Martin Poppinga
Classic pathfinding algorithms face different problems for proper navigation in a dynamically changing environment like in the RoboCup humanoid Soccer league. Working with uncertain and noisy data and missing an effective localization, an intelligent approach was tested. Classic potential fields and a evolutionary trained CTRNN were combined for an improved navigation which includes finding an effective path to the goal, the correct orientation of the robot and obstacle avoidance. This combination was successfully tested in a simplified 2D Simulation.
13.01.15	Learning Large-Scale Unsupervised Motion Features	Dennis Hamester
Unsupervised learning of features or representations has been shown to yield highly competitive results in comparison to hand-crafted features. This can be observed for an increasing number of Machine Learning tasks, most prominently object recognition. However, for related tasks like action recognition in videos, hand-crafted features are still the dominant method of choice. In this talk, I will give an overview of recent research and Neural Network models for unsupervised motion features. Most of these model are extensions of either Restricted Boltzmann Machines or Autoencoders and incorporate higher-order multiplicative interactions. The occurrence of multiplicative units is very common and their necessity will be motivated. Higher-order interactions on the other hand require many additional model parameters and generally result in more difficult training. As an alternative, I will introduce a model that avoids complex higher-order interactions and is much easier to train. This model works by detecting synchrony between temporally separated features using just multiplicative units. Recently, I have successfully generalized this model to arbitrarily sized videos using a weight-sharing / convolutional approach. The presentation will conclude with some open questions regarding the application and design of experiments. Current ideas for future research directions will be given as well.
16.12.14	Biologically Inspired Localization On Humanoid Soccer Playing Robots	Marc Bestmann
To enable the localization in the RoboCup Soccer context, the algorithm Rat- SLAM, which does simultaneous localization and mapping, was ported into an existing RoboCup framework. The algorithm was extended to use the data from the existing image processing and to improve thereby the position matching. The use of the distances to the goal posts and of the lines in relation to the robot were examined. It was shown that these additional data can improve the matching of the RatSLAM algorithm.
09.12.14	Learning objects from RGB-D sensors using point cloud based neural networks	Marcelo Borghetti
The efficient recognition of parts of the environment is relevant for a variety of scenarios ranging from lower-level grasping and manipulation to higher-level robotic assistance. In this presentation we will show an approach for scene understanding for assistive robotics based on learning to recognize different objects from RGB-D devices. Using the depth information it is possible to compute descriptors that capture the geometrical relations among the points that constitute an object or extract features from multiple viewpoints. We developed a framework for testing different neural models that receive this depth information as input. Also, we propose an approach using three-dimensional RGB-D information as input to Convolutional Neural Networks. Results reported F1 scores greater than 0.9 for the majority of the objects tested, showing that the adopted approach is efficient as well for classification.
25.11.14	Structuring and writing a thesis: Guided discussion for PhD students	Stefan Heinrich, Prof. Stefan Wermter
The idea for this gathering is to discuss what a good PhD thesis needs in terms of structure and coherence. For this we want to have a look at some examples for good theses with varying style and scope to provide a good start as well as hints on considerations and pitfalls for PhD students on all levels. Discussions will include corner stones that PhD committees are looking for as well as why a thesis always is individual but also special and knowledge about writing a Master's thesis helps only to some extent.
11.11.14	Real-Time Gesture Recognition Using a Humanoid Robot with a Deep Neural Architecture	Pablo Barros
Three-dimensional (3D) integral imaging allows one to reconstruct a 3D scene, including range information, and provides sectional refocused imaging of 3D objects at different ranges. This paper explores the potential use of 3D passive sensing integral imaging for human gesture recognition tasks from sequences of reconstructed 3D video scenes. As a preliminary testbed, the 3D integral imaging sensing is implemented using an array of cameras with the appropriate algorithms for 3D scene reconstruction. Recognition experiments are performed by acquiring 3D video scenes of multiple hand gestures performed by ten people. We analyze the capability and performance of gesture recognition using 3D integral imaging representations at given distances and compare its performance with the use of standard two-dimensional (2D) single-camera videos. To the best of our knowledge, this is the first report on using 3D integral imaging for human gesture recognition.
21.10.14	Learning to Reach with Interactive Reinforcement Learning	Chris Stahlhut
Giving interactive feedback, other than well/bad done alone, allows to learn faster in Reinforcement Learning. If we reduce the number of wrong actions the learner takes, e.g. by undoing bad actions, we can reduce the number of steps necessary to learn a task. However, the amount of help needed is not thoroughly researched. This thesis aims to answer this question with two tasks. In both tasks the learner learns to move one arm each to reach for an arbitrary position within its reach. The first task is with a shoulder as joint alone, the second with a shoulder and an elbow. We will discover that the advantage of undoing bad actions manifests itself only in the arm with elbow and shoulder and not in the arm with a shoulder as only joint. We will see that not only did the interactive learner learn faster, it learned to take less steps to reach its target, too. Furthermore, it needed less configuration effort to get good result, while the non-interactive learner did not learn very well in some configurations. If both learners learn well, we need a lot of feedback to show an increase in the behaviour, as well as the learning speed. If the difference between both learners is large, even a probability of 10% of interactive feedback shows an improvement of both properties. With a probability of 60% or more, both properties become more stable than the property of the original learner without interactive feedback.

Summer Semester 2014

Summer Semester 2014
30.09.14	Initial Steps on the robot Nicu: design and implementation plans	Atif Mahboob
In the current research on many robots, one goal is to make inexpensive and safe robots for application in domestic environments. Therefore, we have started to develop a humanoid robot called NICU using the existing robot hardware for further extensions to employ in Human-Robot interaction. NICU exhibits the walking capabilities of Humanoid Open Platform Nimbro-OP and the interaction capabilities of i-Cub Head. The existing arms with three degree of freedom have been already transformed into arms with five degree of freedom and also a grip is included. The arm design is not based strictly on any existing arm manipulator, rather it is a new idea implemented by part design and later printing those parts by a 3D printer. The expectation at the end of the research work is to make NICU capable of performing smoothly with new arms, i.e. to perform grasping using neural network techniques and giving a human a friendly look. The talk is mainly about explaining the progress so far, the approaches used and to have discussions about approaches possibly to be considered for future implementations.
23.09.14	Learning of Human Motion Feedback by Neural Network Self-Organization	Florian von Stosch
In many physical activities such as dancing, sports, and strength training, specific movements have to be learned and perfected. Therefore, it is helpful to receive feedback on how well one is performing a movement and how to improve it. This thesis examines how to give motion feedback automatically by means of learning a model of correct motion and extracting feedback from it. The proposed learning architecture uses a recursive self-organizing neural network model to learn postural and temporal aspects of motion. We present three techniques for feedback extraction from a thus learned model, which give information about the whole body and about individual joints. To evaluate our approach, we captured a body pose dataset of athletes performing compound barbell lifts and dumbbell exercises using a depth-sensing camera. Results show a recall and precision of 0.9 and higher for detection of mistakes in movements on single-subject data.
26.08.14	A Multichannel Convolutional Neural Network for Hand Posture Recognition	Pablo Barros
Natural communication between humans involves hand gestures, which has an impact on research in human-robot interaction. In a real-world scenario, understanding human gestures by a robot is hard due to several challenges like hand segmentation. To recognize hand postures this work proposes a novel convolutional implementation. The model is able to recognize hand postures recorded by a robot camera in real-time, in a real-world application scenario. The proposed model was also evaluated in a benchmark database.
18.08.14	Better Generalization with Forecasts	Dr. Mark Ring
Autonomous agents operating in the real world require a method for representing what they know about the world. Certain recent approaches to representing world knowledge are based in prediction: they describe the world as a set of predictions about what the agent will perceive if it acts in various ways. One predictive method that shows particular promise is the General Value Function (GVF). These GVFs (also called "forecasts") have many benefits, including their simplicity and adaptability, uniformity, flexibility, and semantic interpretability. That is: forecasts can encode, with minimal structure, predictions that can be learned and refined through experience; they can encode small- and large-scale predictions equally well; they can be used without modification for a wide variety of environments and sensorimotor apparatus, and their predictions correspond well to human notions of knowledge. Consequently, forecasts may also more readily capture regularities in an autonomous agent�s sensorimotor stream than other predictive methods. In this talk I will describe recent research that my colleague Tom Schaul and I did to investigate the ability of forecasts to capture such regularities. We generated focused sets of forecasts and measured their capacity for generalization, and then compared the results to a closely related predictive method (PSRs) already shown to have good generalization abilities. Our results indicate quite dramatically that forecasts provide a substantial improvement over PSRs in terms of generalization: the features they encode generalize better to as-yet-unseen parts of the agent's state space and also lead to better value-function approximation in reinforcement-learning agents. Mark Ring received his Ph.D. from the University of Texas at Austin in 1994, focusing then, as now, on continual learning. He has held research positions at the GMD (now Fraunhofer) Institute in Bonn, Germany, at the University of Alberta in Canada (working closely with Richard Sutton), and at IDSIA in Lugano, Switzerland.
12.08.14	Reinforcement Learning for Assistive Robots	Francisco Cruz Naranjo
Interactive reinforcement learning constitutes an alternative for improving convergence speed in reinforcement learning methods. In this work, we investigate inter-agent training and present an approach for knowledge transfer in a domestic scenario where a first agent is trained by reinforcement learning and afterwards transfers selected knowledge to a second agent by instructions to achieve more efficient training. we combine this approach with action-space pruning by using knowledge on affordances and show that it significantly improves convergence speed in both classic and interactive reinforcement learning scenarios.
16.07.14	Improving Domain-independent Cloud-based Speech Recognition with Domain-dependent Phonetic Post-processing	Johannes Twiefel
Automatic speech recognition (ASR) technology has been developed to such a level that off-the-shelf dis- tributed speech recognition services are available (free of cost), which allow researchers to integrate speech into their applications with little development effort or expert knowledge leading to better results compared with previously used open-source tools. Often, however, such services do not accept language models or grammars but process free speech from any domain. While results are very good given the enor- mous size of the search space, results frequently contain out-of-domain words or constructs that cannot be un- derstood by subsequent domain-dependent natural lan- guage understanding (NLU) components. We present a versatile post-processing technique based on phonetic distance that integrates domain knowledge with open- domain ASR results, leading to improved ASR perfor- mance. Notably, our technique is able to make use of domain restrictions using various degrees of domain knowledge, ranging from pure vocabulary restrictions via grammars or N-Grams to restrictions of the accept- able utterances. We present results for a variety of cor- pora (mainly from human-robot interaction) where our combined approach significantly outperforms Google ASR as well as a plain open-source ASR solution.
08.07.14	Self-Organizing Learning Systems for Human Activity Recognition	German Parisi
Recently, there has been a significant interest on ambient intelligence for the recognition of human activity. In this context, the classification of human actions and gestures has shown to be a challenging task to accomplish with an artificial system. Despite the prominent use of depth sensors for pose-motion feature estimation, the question how to process this information for efficiently learning human activity is still open. Additionally, in order to operate in natural scenarios, the recognition process must be robust under noisy conditions and exhibit real-time characteristics. In this talk, I will introduce my previous research work on the recognition of full-body actions and hand gestures using self-organizing networks. I will then present a learning framework based on neurobiological evidence for deploying my approach on mobile robot platforms. I will focus on the processing of time-varying data with incremental self-organizing networks, the extension of these architectures for classification tasks, and a pragmantic comparison between feedforwards and recurrent self-organizing architectures for noisy input.
07.07.14	Machine Learning of Motor Skills in Robotics: From Simple Skills to Table Tennis and Manipulation	Dr. Jan Peters (TU Darmstadt)
Autonomous robots that can assist humans in situations of daily life have been a long standing vision of robotics, artificial intelligence, and cognitive sciences. A first step towards this goal is to create robots that can learn tasks triggered by visual stimuli from higher level instruction. However, learning techniques have yet to live up to this promise as only few methods manage to scale to high-dimensional manipulator or humanoid robots. In this talk, we investigate a general framework suitable for learning motor skills in robotics including both manipulation of static and dynamic objects that are perceived using vision. The resulting approach relies on a representation of motor skills by parameterized motor primitive policies acting as building blocks of movement generation, and a learned task execution module that transforms these movements into motor commands. We discuss task-appropriate learning approaches for imitation learning, model learning and reinforcement learning for robots with many degrees of freedom that perceive the manipulated objects using robot vision. Empirical evaluations on a several robot systems illustrate the effectiveness and applicability to learning control on an anthropomorphic robot arm. These robot motor skills range from basic visuo-motor skills to playing robot table tennis against a human being and manipulation of various objects. Biography Jan Peters is a full professor (W3) for Intelligent Autonomous Systems at the Computer Science Department of the Technische Universitaet Darmstadt and at the same time a senior research scientist and group leader at the Max-Planck Institute for Intelligent Systems, where he heads the interdepartmental Robot Learning Group. Jan Peters has received the Dick Volz Best 2007 US PhD Thesis Runner Up Award, the Robotics: Science & Systems - Early Career Spotlight, the INNS Young Investigator Award, and the IEEE Robotics & Automation Society's Early Career Award. Jan Peters has studied Computer Science, Electrical, Mechanical and Control Engineering at TU Munich and FernUni Hagen in Germany, at the National University of Singapore (NUS) and the University of Southern California (USC). He has received four Master's degrees in these disciplines as well as a Computer Science PhD from USC. Jan Peters has performed research in Germany at DLR, TU Munich and the Max Planck Institute for Biological Cybernetics (in addition to the institutions above), in Japan at the Advanced Telecommunication Research Center (ATR), at USC and at both NUS and Siemens Advanced Engineering in Singapore.
01.07.14	Robotic Implementation and Evolution of Multisensory Localisation with Reward Mediated Learning (Defense)	Andreas Grenzing
In this thesis we transfer a multisensory reinforcement learning model, developed by Weisswange et al., to an actual robotic platform (an iCub) in a real experimental set-up. We evaluate the issues in doing so and prepare and conduct a localisation experiment for evaluating the performance of the learner with respect to our set-up.
24.06.14	Bio-Inspired Sound Source Localisation for Robot Speech Recognition	Jorge Davila Chacon
Sound source localisation (SSL) is important for the safety of robots deployed in dynamic environments and for human-robot communication. In this talk I explain how SSL can considerably improve the performance of automatic speech recognition (ASR) with humanoid robots by tracking the position of a speaker. The embodiment of humanoid robots can provide spatial cues that are important for robust SSL. Such spatial cues come from the interaural time (ITD) and level differences (ILD) of sound sensed by the robot. I present a model where auditory spatial cues are represented with spiking neural networks modelling relevant areas in the mammalian auditory pathway. It models the medial and lateral superior olives for the extraction of ITDs and ILDs. Afterwards, both cues are integrated with a model of the inferior colliculus (IC) using Bayesian inference. Finally, a feed-forward neural network is used to classify the IC output. This final layer is not intended to model a particular area of the auditory system. Nevertheless, it is important to handle non-linearities produced by reverberation and ego-noise from the robot. Once the sound source angle is estimated, the humanoid can turn its head to the angle where ASR exhibits the best performance. Interestingly, ASR does not have the best performance at the angle where the ear microphone points directly to the speaker, but at the angle where the ear pinna reflects sound most intensely towards the microphone. To conclude the talk, I will discuss possible extensions of this model to include vision, e.g. for segregating speech at the neural level in addition to the behavioural level. I look forward to your participation in the seminar.
10.06.14	Restricted Boltzmann Machines for Sequence Learning	Dennis Hamester
In my current work, I am exploring Restricted Boltzmann Machines (RBM) as building blocks for Fingerspelling (FS) recognition. FS is part of many sign languages and is used by deaf people to manually spell out names and words. My FS dataset is given as a labelled sequence of hand postures, captured and represented densely using the hand pose estimation developed during my Master's Thesis. FS recognition is particularly interesting, because neighbouring postures are not isolated but can form higher level structures bearing complex semantics, like syllables and words. This motivates the use of RBMs, because they can be easily stacked on top of each other to form so called Deep Belief Networks in order to learn more complex structures. RBMs are a subclass of undirected probabilistic graphical models. Despite their origin in probability theory, RBMs have close connections to Artificial Neural Networks (ANN). From this perspective, RBMs are simple feed forward ANNs with an unsupervised Hebbian learning rule. This way, RBMs can be interpreted as mapping observations onto latent units that best explain the input. Certain variants, e.g. the Conditional RBM (CRBM), are also capable of learning sequential data. In this talk I will present the FS recognition task and the FS dataset. Furthermore, (Conditional) RBMs will be introduced as well as some common use-cases and results.
20.05.14	Who's Afraid Of Singular Matrices? Reservoir Computing for Dynamic Gestures	Doreen Jirak
Using Recurrent Neural Networks (RNN) for sequence modeling is one of the fundamental approaches in the area of neurally-inspired computations. Derived from classic dynamical system theory, computations in RNNs trained by Backpropagation Through Time (BPTT) suffer from vanishing error gradients, trapping in local minima and bifurcation. Reservoir Computing (RC) offers an alternative for RNN training by fixing network topology and facilitating learning by simple linear readout of the network weights. In this talk I will give an introduction to RC, especially Echo-State-Networks (ESN), and explain why this approach could be useful for dynamic gesture recognition. In a classification task, I employed ESN as ensembles for 5 commanding gestures. First experiments show promising classification results. In a later stage, I would like to transfer the approach to humanoid robots for application in a Human-Robot-Interaction (HRI) scenario. The steps to achieve this comprise using ESNs also as motor generating network to learn the robot proactively participate in HRI. Also, the multimodal aspect using gestures as visual and language as additional auditive channel are of special interest for my work.
06.05.14	Towards Bioinspired Robot Navigation - RatSLAM with the Humanoid Robot NAO	Stefan Müller
Mapping, localization and navigation are three basic, but important tasks for autonomous robots. They have always been major challenges in robotics. Variations in environment and local conditions force robots to behave more robustly than possible with hard-coded behaviors. Over time many approaches dealing with this problem have been done. Most of them use data collected from odometry or landmarks and a priori generated maps of the environment. During the last decades, ever increasing computational power has led to powerful probabilistic Simultaneous Localization and Mapping (SLAM) algorithms, which deal with real-world sensory data using Kalman filters, particle filters and/or expectation-maximization algorithms (EM). Therefore, SLAM is well investigated, developed and used most frequently nowadays. In the last years, there has been much research in biological motivated mapping and localization approaches. These approaches try to achieve accurate mapping with the use of noisy inaccurate sensors. In 2004, a biologically inspired SLAM approach has been introduced that offers a solution for concurrent mapping, localization and navigation that can be used on real robots. This approach is called RatSLAM. It is based on a computational model of parts of the rodent's hippocampus, which maintains the believed location inside the world. RatSLAM uses landmark sensing and odometric information in combination with a competitive attractor network (CAN) to form a topologically consistent representation with largely Cartesian properties. For now, RatSLAM was used for wheeled robots and for straightforward flying quadrotor platforms. It was never used in combination with humanoids because of their increased constraints. In this presentation an approach is presented that uses RatSLAM as a basis for an adapted mapping and navigation solution for the humanoid robot NAO from Aldebaran Robotics.
22.04.14	Planning and Decision Making Strategies in Environments with Partial Knowledge	Marcelo Borghetti Soares
Mobility in Wireless Sensor Networks (WSN) are employed today not only as an efficient mechanism to increase the network's lifetime but as an effective cooperative way to perform difficult tasks ranging from coverage to exploration in complex environment. There are several research issues in both planning and control of robots carrying out missions in synergy with WSNs. Not only there are relevant optimization problems that need be tackled, but also questions related to the inherent stochastic nature of real robots, real sensor nodes and real environments. The objetive of this seminar is to present planning and decision making strategies that can be serve as inspiration to develop solutions that model the rational behaviour of a team of robots acting cooperatively.

Winter Semester 2013/2014

Winter Semester 2013/2014
28.03.14	Master Project Presentation: Human-Robot-Interaction	Master students HRI project
The dissemination of robotic technologies for domestic purposes was the driving force for our Master project "Human-Robot-Interaction" (HRI). The goal of our project was to focus on how HRI-related important topics like vision, navigation and behaviour features the purpose of robotics in a domestic environment. A simple task for humans, yet difficult to implement on robots, was to navigate from the KT kitchen to different other office rooms searching for used cups. The project simplified the object recognition approach by implementing an interacting senario between the robot and the users in the room where for the used cups was asked for. In this presentation, we will present our algorithms in more detail and give insight into the performance of our experiments, which had been carried out in the KT environment. The talk is followed by a live-demonstration.
25.03.14	Comparison of Different Learning Algorithms in a Dynamic Environment	Max Bryan
The well-known machine learning algorithm Reinforcement Learning uses Q-values, which spread the information about which action to take from the goal state to the surrounding states. Thereby, over several trials, the information is back-propagated. Experiments with rats or human beings, on the other hand, have shown that decision-making can be seen as forward thinking, respectively thinking in creating plans and executing them. The thesis investigated different machine learning algorithms, both backward and forward oriented, and tested their behavior in a dynamic environment, in which the goal condition changes over time. The Q-value-based algorithms all suffered from deprecated Q-values after each goal change. A planning algorithm, which only learns a model of the environment and does not use Q-values, performed best in small environments, but suffered from an exponentially growing size of possible paths to check. So on the one hand, planning algorithms that use Q-values can reduce the number of planning paths to check effectively if the Q-values are useful. On the other hand, they can compensate for the deprecated Q-values to some extent.
11.03.14	Development and Evaluation of Semantically Constrained Speech Recognition Architectures	Johannes Twiefel
Cloud-based automated speech recognition (ASR) systems employ large acoustic models which lead to good performance in a wide domain like web searches or dictations. In restricted domains (like ones using a constrained vocabulary), these systems often do not perform as well as local ASR systems because they are not adjustable to that domain due to API restrictions. (Users are able to pass audio data to the system, but no further configurations are possible like passing lan- guage models.) We investigated the option of post-processing the results of cloud-based ASR systems instead of employing adjustable local ASR systems. The post-processing heuristics are based on phonetic distances and different matching techniques and differ in the extent of applied domain knowledge by ranging from vocabularies over statistical language models and grammars to a fix list of sentences. To be able to compare our approach with related approaches we chose corpora from the human robot interaction (HRI) domain and a well-known standardized corpus. The results of our experiments show that most of our heuristics significantly outperformed the Google ASR system and a local open-source ASR system. We also observed that providing more domain knowledge to the post-processing system leads to better performance.
04.03.14	Learning to reach with interactive machine learning	Chris Stahlhut
Every robotic arm has to reach in order to grasp an object and reinforcement learning allows the agent to learn this task. To date, multiple approaches have been shown to work but it may take many steps until the learner reaches the target reliably. Interactive machine learning allows a human or other external, more knowing agent to help the learner along. It may reduce the number of steps to reach the target for the first time by "nudging" it towards a better position or giving more informative feedback. In my thesis I intend to explore how these additional influences can be used to augment an existing reinforcement learning algorithm and if or how much they increase the learning performance. I plan to test this augmented algorithm in a reaching task with a simple 2 dimensional 2 DoF arm.
04.03.14	Learning of Human Motion Feedback by Neural Network Self-Organization	Florian von Stosch
In many physical activities such as dancing, sports, and strength training, specific movements have to be learned and perfected. Therefore, it is helpful to receive feedback on how well one is performing a movement and how to improve it. This thesis aims to develop a learning framework for providing feedback on the correctness of human body movements. The system will use a feature extraction module for capturing relevant properties of body motion. A recurrent self-organizing neural network architecture will be then used to learn a set of correct movements. At detection time, the system will estimate a measurement on the correctness of the test motion sequence and provide qualitative feedback on how to fix mistakes in the dynamics of body postures. The framework will be evaluated on datasets containing both correct and wrong body motions from depth map video streams.
11.02.14	Interactive Language Understanding with Multiple Timescale Recurrent Neural Networks	Stefan Heinrich
Natural language is the cognitive capability that clearly distinguishes humans from other living beings but is by far not yet understood. In contrast to other cognitive functions or methods to obtain precise information how language is processed and - more important how - language has developed in the brain are very limited. Traditional generativists theories on language processing are often problematic in explaining 'the how', because they neglect the involvement of embodiment and situated context, i.e. the close relation of language processing and proprioception. Recent constructivists theories are often limited to grounded items on single word level, not capturing the complex temporal dynamics. In recent research we aimed at bridging this gap in terms of searching for the crucial characteristics of a neural architecture that allows for the emergence of language. With a MTRNN-based architecture that combines the processing of complex phonetic utterances with visual experience we showed that the MTRNN inherits some important characteristics in terms of connectivity, plasticity, and temporal abstraction. In the upcoming talk I want to follow up on these studies and report on the recent refinements and extensions of the architecture. In the new multimodal architecture that also reflects the cell assemblies theory we aim at taking motor movements into account and show that the modelled characteristics allow for the abstractions of concepts from visual and proprioceptive perception as well as the generalised verbalisation of concepts in natural language.
28.01.14	Sound Source Localisation for Speech Recognition	Jorge Davila-Chacon
It is possible to think of daily life examples where sound source localisation improves the understanding of language in noisy environments, such as when elderly people with auditory deficits turn their heads to 'face' the speaker with their ear instead of their eyes. Following this heuristic, we explored the impact of such turning behaviour for improving speech recognition using the iCub humanoid head. More specifically, we assessed the feasibility and impact of this approach by measuring 1) the time it takes the robot to face a sound source from different starting angles, and 2) whether a specific angle between the sound source and the robot's face improves speech recognition. After initially obtaining promising results, a second round of experiments displayed oscillations in the system's performance. However, further inspection in the mechanism used for extracting sound signals revealed a clear advantage in the use of sound source localisation for improving speech recognition.
14.01.14	Experiments in Multisensory Localization	Johannes Bauer
Over the last more-than-two years, I have been theorizing about multi-sensory integration. I have developed a model of multi-sensory integration in the superior colliculus (SC) and shown that it both performs near-optimally and reproduces certain aspects of the neurophysiology of the SC and the behavior of SC-using animals... in computer simulations. It is high time that I applied this model in a physical experiment. In the first part of this talk, I will speak about the goals and motivation of my PHD project and give a description of the model I have developed. This will be followed by a discussion of the biological phenomena this model reproduces and a blue-skies outlook on some of the next steps in modeling I want to take. The second part of the talk will be concerned with the currently pressing issue of designing a robotic experiment for my model. To give you an intuition of the conceptual problems I am facing, I will first give you my thoughts on what I think an experiment can do for me in principle. After that, I will explain some technicalities of the infrastructure available for the experiment, the kind of output I want to generate, and criteria for the input. The talk will end with an invitation to the audience to give their thoughts and suggestions.
17.12.13	Simulation of humanoid robots in V-Rep and Webots	Gerrit Steinberg
This bachelor thesis investigated two different kinds of robot simulation programs for comparison, namely Webots and the open-source simulator V-Rep. The former is well-established for robotic simulations but expensive, the latter is more cost-effective and thus interesting for future usage in robotic projects. An important aspect in the simulation of robots is the simulation of humanoid robots, with NAO (a specific humanoid robot manufactured by French robotics company Aldebaran Robotics) taking a special interest in this field due to its widespread use. Although the V-Rep environment does provide an existing model for the NAO robot, that model is not controllable through NAOQi. In this regard, the question arises whether V-Rep can be adapted to control the NAO just as easily as Webots can. The NAO runs on a software called Naoqi. The simulator Webots offers already a robot controller that uses the functionality of Naoqi, making it easier for the programmer to specifically tell the robot what the user wants him to do. In V-Rep, this function is not available. After extensive research, it was found out how exactly the NAO is able to be controlled via V-Rep. This research consisted of examining various control methods, followed by an explicit comparison between V-Rep and Webots in terms of their complexity, implementation, as well as similarities and differences In the end, it can be concluded that V-Rep represents a suitable alternative to Webots. Although not all of the functionality was tested, it was shown that V-Rep has, in some parts, even advantages over Webots, for example the choice of physics engines and the possibility to run V-Rep much faster than Webots.
05.12.13	Dimensionality reduction and coordinated movement control: How nature alleviates the 'curse of dimensionality'	Naveen Kuppuswamy, ETH Zürich
The versatility and adaptability of behaviour demonstrated in nature is usually ascribed to the large neuro-mechanical redundancy in the body of organisms. The resulting high-dimensionality however complicates our understanding of the principles underlying coordinated movement control and learning; the so-called "Curse of Dimensionality" in motor control. Elucidating the principles by which the Central Nervous System (CNS) alleviates this curse in achieving motor coordination is vital for both understanding natural intelligence as well as for developing novel control techniques for complex robots. In this talk, a mitigation strategy based on dimensionality reduction in control is presented, for the facilitation of learning real-time movement coordination. In the first part, the dimensionality problem and conventional techniques for planning and control in robotic systems are reviewed along with the neuroscientific hypotheses of optimal motor control, developmental skill acquisition, and muscle synergies. In the second part, a control-theoretic framework is presented for a synthetic analysis of reduced dimensional behaviour in systems. The framework is used for analysing the different factors facilitating dimensionality reduction in systems, i.e. (i) natural dynamics, (ii) output (task-space relevance) and, (iii) input (modularisation of the control). On the basis of various simulation experiments, a principle of exploiting reduced dimensionality is proposed for the design and control of systems that must autonomously acquire motor skills. The implications of this principle are then discussed for the neuroscientific theories of motor coordination and development and for the design and control of high-dimensional artificial systems such as biomimetic robots. Short Bio: Naveen Kuppuswamy holds a Master of Science (MS) in the field of Electrical Engineering from the Robot Intelligence Lab of the Korea Advanced Institute of Sience and Technology (KAIST). He successfully defended his PhD-thesis concerning goal-oriented motor control in the Artificial Intelligence Lab within the group of Prof. Dr. Rolf Pfeifer (University of Zurich, Switzerland). He is affiliated in recent research projects like RobotDoc and Amarsi.
03.12.13	Three Models of Haar-NN	Nils Meins
In this talk we will present our work according to HaarNN-Ensembles. HaarNN-Ensembles use Haar-feature-like pattern as input for Hebb-learning Neural Networks combined with the ensemble learning algorithm called 'Adaboost'. We consider three different models of HaarNN: First, the Bayes-like stable pattern model which combines the stable states of an Hopfield Neural Network with the Bayes method [1]. The second method is in analog to the first, but uses the cluster algorithm K-means for grouping the input. The third approach groups the input pattern for the different classes through an moving average and has online learning ability. [1] Meins, N., Magg, S., Wermter, S. Neural Hopfield-ensemble for multi-class head pose detection. Proceedings of International Joint Conference on Neural Networks (IJCNN). pp. 1327-1334, Dallas, 2013.
26.11.13	BSc Defense: Robot Navigation Using a Cognitive Map in Webots	Swantje Thüring
Die hier vorgestellte Arbeit beschäftigt sich mit der Navigation von mobilen Robotern und gliedert sich in zwei Arbeitsschritte. Zunächst wurde das von W. Yan et al. [1] vorgestellte Framework in der Simulationsumgebung Webots implementiert und damit für Simulationen nutzbar gemacht. Damit ist es möglich, Roboter in Webots mit der von Yan vorgestellten Navigation zu betreiben. Im zweiten Schritt wurden diese Roboter verschiedenen, aufeinander aufbauenden Testszenarien ausgesetzt, um dort ihre Fähigkeit zur Navigation auf die Probe zu stellen. Diese Tests wurden für diese Arbeit dokumentiert und ausgewertet, um schließlich Rückschlüsse zu ziehen, inwiefern die Navigation mit verschiedenen Umgebungen, aber auch mit unterschiedlichen Robotern funktioniert. [1] Yan, W., Weber, C., Wermter, S. A Neural Approach for Robot Navigation based on Cognitive Map Learning. 2012 International Joint Conference on Neural Networks (IJCNN2012), pp. 1146-1153, Brisbane, Australia, 2012.
19.11.13	Hierarchical Predictive Mechanism Modelling based on Recurrent Neural Networks	Junpei Zhong
Predictive mechanisms exist hierarchically in the cortical areas, which compensate delays via prediction so that the cortical areas encode present, not past, events. To realize such compensation in the context of artificial neural networks and cognitive robotics, we suggest to use recurrent connections to create an internal memory to store the previous dynamics. In the first experiment, we emphasized that a short-term memory should exist within the mirror neuron system to assist with the action understanding and prediction. A recurrent network with parametric biases was thus applied to realize this system function in which the parametric bias units recognize and encode walking patterns from another robot. Furthermore, in the second experiment, we focused on the encoding of information in the primary visual cortex. Specifically, the convergence of ventral and dorsal pathways with predictive functions may give rise to the understanding of object affordance and object manipulation and its control. To model this, we designed a recurrent predictive network with a horizontal product where the information of object feature and object movement becomes successfully separated in its two hidden layers. Since the dorsal pathway also allows for motor-relevant representations, in the third experiment, we also developed such recurrent connections for sensory latency compensation which also support smoother and faster behaviours in sensorimotor integration tasks. We expanded the use of recurrent connections in the sensorimotor system, particularly in the sensory prediction part, so that the recurrent connections could compensate the delay in the sensory percepts. A continuous actor-critic automaton (CACLA) was also applied in the generation of smooth behaviours corresponding to the predictive sensory percepts. In the last experiment, we combined the first and second experiments and proposed that the learning of the visual pathways also leads to the conceptualization of visual information. This was realized by recurrent networks with a horizontal product too, which was extended with parametric bias units. We exemplified this model through a robot passively observing an object to learn its features and movements and meanwhile the pre-symbolic representation was self-organized in the parametric units.
04.11.13	Neurocomputational mechanisms for self-protective robot behaviour	Nicolas Navarro-Guerrero
Self-defensive and survival neural circuits belong to the most important and essential capabilities of any organism. They are believed to impose the ground rules for more complex and motivated behaviour. Although many of these innate responses are hard-coded in the brain, they are not sufficient for the organisms' survival. They have to adapt to new and unexpected situations within their lifetime and thereby be able to interact effectively with their environment. A key component on the lifetime adaptation is the formation of associations/memories between environmental predictors and relevant events, which mainly rely on punishment and reward learning. We believe that a deeper understanding of innate and learned defensive mechanisms could also be helpful in developing future robot generations, making them more adaptable and robust. In this research project, we are studying and developing neurocomputational self-protective mechanisms for a humanoid service robot such as association of appetitive stimuli and more importantly association of aversive or noxious stimuli. Our first experiment addressed reward-driven learning where a robot is trained to seek for appetitive stimuli. Here a reinforcement learning (SARSA) algorithm was optimized to learn in a real-world scenario and manoeuvre a humanoid robot towards a charging station. On a second experiment we study the role of noxious stimuli on the formation of anticipatory behaviour. This experiment based on the vast literature on Pavlovian and instrumental conditioning and how environmental cues can be used to anticipate negative outcomes. A hybrid approach using an echo state network (ESN) and a dopamine modulated Pavlovian conditioning was used to anticipate painful signal based on auditory cues. Finally, our current research focuses on emulated pain signals due to its important role in attracting attention and modulating decision making and action. Pain is an unpleasant sensory experience associated with actual or potential tissue damage and thus a strong indicator for behaviour readjustment. However maladaptive behaviours may arise leading to persistent avoidance behaviour and ironically to more pain and long-term disability. Particularly, we are studying the boundaries between adaptive and maladaptive withdrawal behaviours in a motor learning task. Specifically, a continuous actor-critic automaton (CACLA) algorithm optimized for real-world learning is used to teach a humanoid robot to reach for a moving target.
22.10.13	Gestures and Reference Instructions for Deep Robot Learning	Pablo Barros
The communication task is one of the most important regarding the Human-Robot Interaction (HRI). It is the main form of cooperation between humans and robots, and thus several studies have been made focusing the natural language interpretation. One of the main points of the human-robot communication is the joint attention using gestures and speech. This kind of communication leads to scenarios when a human can express things like: " Robot, get me that tool" or "Give it to me" always pointing to some object. This problem has a multi-modal temporal nature and thus is still a complex pattern to be recognized. My current PhD work proposes the use of deep architectures allied with computer vision and speech recognition techniques to solve this multi-modal problem. In this talk the motivation and the background for this particular problem will be presented, as the project plan to walk through this problem using the deep architectures advantages. An application scenario and the methodology of this work will be discussed as well as the expected results.
15.10.13	Neural-based Recognition of Human Actions from Depth Images	German I. Parisi
In the last 5 years there has been a relevant increase for research on ambient intelligence for the recognition of human activities and behavioral patterns in indoor environments. The recent introduction of low- cost range sensors using structured light technologies (e.g. Microsoft Kinect and ASUS Xtion) allows to efficiently estimate object distances from the observed scene and obtain a mapping in terms of both projective and real world coordinates, overcoming a set of historical limitations imposed by most color cameras. Not surprisingly, this novel sensory trend led to an increasing number of vision-based applications using depth maps instead of, or in combination with, RGB pixel matrices. The use of depth information has shown to increase robustness under varying light conditions and significantly reduce computational effort for demanding visual tasks such as the tracking of moving human targets. However, despite the development of new sensor technologies and an exponential growth in computational power, the human brain continues to outperform vision-based applications for object segmentation, tracking and behavior analysis in terms of performance and accuracy. One of the major reasons for this flaw is the disregard of the dynamics of the human visual system for encoding and processing visual stimuli to solve those visual tasks that a human observer can achieve with little effort. It is therefore of great importance to study convenient motion representations and hierarchical neural architectures that better take advantage of processed depth information. In this seminar I present the state of the art in learning systems for human action recognition using depth images and analyse the use of biologically inspired architectures to develop applications for real time behavior analysis. For this purpose, I describe a set of strategies derived from neurocognitive evidence to extract and process relevant motion features from an observed scene and use them as input for a hierarchy of self-organizing networks, e.g. SOM and other extensions. I conclude with future directions on human activity recognition and the introduction of mobile robot platforms to extend depth sensors for active tracking.

Summer Semester 2013

Summer Semester 2013
24.09.13	Neural Network Ensembles Using Haar-like Features for Face Detection	Tayfun Alpay
Face detection has experienced a growing interest in the last decade with the rise of intelligent real-time systems. Applications range from video surveillance to mobile apps and tagging in social media. We have explored the performance of neural network ensembles performing face detection by applying perceptron learning to the Viola Jones Object Detection Framework. We have trained single and multi-layered perceptrons as weak classifiers, which use Haar-like feature templates as inputs. These networks are grouped into ensembles by AdaBoost. Our focus is on the binary classification problem, although we also achieve successful multiclass head pose detection (three viewing directions) with multi-layered perceptrons. In this thesis we evaluate the impact of boosting on the neural networks, as well as the use of different types of feature templates. Our results show that neural network ensembles can yield a high detection rate with a low number of false positives.
27.08.13	Supervised and interactive reinforcement learning for robots actions	Francisco Cruz
Humans learn actions on different ways, one of them is researching our environment and trying repeatedly up to achieve a desired aim. Learning may be supported by e.g. a person advising another one in the process of using a coffee machine the first time. Reinforcement Learning has demonstrated being a useful learning approach, but, as recent work has shown, lacks often performance. This work looks for improving learning speed of a robot within human agents interaction. It is proposed to use a domestic scenario, where a human supports a robot in cleaning tasks of a plane surface (e.g. a table). In this talk first ideas and scenarios will be presented. In addition, the proposal invites discussion about using interactive reinforcement learning for action prediction and neuro-biologically motivated representations.
06.08.13	Human robotic simulation in a new work environment	Gerrit Steinberg
Webots is a state of the art robot simulator used by WTM but with the high price tag isn't there something else we could use? I will explore a recently introduced simulation environment called V-REP to simulate robots and try to find out what's different to Webots in general and whether possible improvements are worth the effort to switch to V-REP. Since WTM already has an extensive body of work done for Webots, easy transfer of existing modules for the NAO robot to V-REP is essential to justify a move towards the new environment. I will try to explore and compare several ways of merging existing code into the V-REP environment and the outcome of this project will be (one or more) modules to interface NAOqi and therefore previous work with a simulated NAO in V-REP. Additional analysis on simulation performance will follow to complete the analysis of possible advantages and disadvantages.
17.07.13	Grasping in 3 dimensions with a NAO robot using Self-Organizing Maps and stereovision	Timm Holler
This thesis deals with the experiment of training different SOM set ups for the task of grasping in three dimensions with a robot arm. The training and recognition is based on data provided by one, respectively two cameras. Three different SOM set ups were trained and evaluated. One set up was a single SOM trained with the data of one camera(1x2D), the other setup was one SOM trained with data of two cameras (1x4D) and the last set up consisted of two SOM, each trained with the data of a different camera(2x2D). The results suggests that given the context of the experiment the 2x2D is the superior configuration and that the set ups that involved both cameras simultaneously are significantly more precise than the set up trained with data from only one camera.
16.07.13	Hierarchical SOM-based Detection of Novel Behavior for 3D Human Tracking	German Parisi
We present a hierarchical SOM-based architecture for the detection of novel behavior in indoor environments. The system can unsupervisedly learn normal activity and then report novel behavioral patterns as abnormal. The learning stage is based on the clustering of motion with self-organizing maps. With this approach, no domain-specific knowledge on normal actions is required. During the tracking stage, we extract human motion properties expressed in terms of multi-dimensional flow vectors. From this representation, three classes of motion descriptors are encoded: trajectories, body features and directions. During the training phase, SOM networks are responsible for learning a specific class of descriptors. For a more accurate clustering of motion, we detect and remove outliers from the training data. At detection time, we propose a hybrid neural-statistical method for 3D posture recognition in real time. New observations are tested for novelty and reported if they deviate from the learned behavior. Experiments were performed in two different tracking scenarios with fixed and mobile depth sensor. In order to exhibit the validity of the proposed methodology, several experimental setups and the evaluation of obtained results are presented.
11.07.13	Neural Hopfield-ensemble for multi-class head pose detection	German Parisi/Nils Meins
Multi-class object detection is perhaps the most important task for many computer vision systems and mobile robots. In this work we will show that Hopfield Neural Network (HNN) ensembles can successfully detect and classify objects from several classes by taking advantage of head-pose estimation. The single HNNs are using pixel sums of Haar-like features as input, resulting in HNNs with a small number of neurons. An advantage of using these in ensembles is their compact form. Although it was shown that such HNNs can only memorise few patterns, by utilising a naive-Bayes mechanism we were able to exploit the multi-class ability of single HNNs within an ensemble. In this work we report successful head pose classification, which presents a 4-class problem (3 poses + negatives). Results show that successful classification can be achieved with small training sets and ensembles, making this approach an interesting choice for online learning and robotics.
11.07.13	Master project: Human-Robot-Interaction - Final Presentation	Faisal Rashed, Maximilian Bryan, Longshan Sun Ahmed Saleh, Florian von Stosch
A robot system for finding previously learned objects in a known environment In the Human-Robot Interaction Project, we implemented a mechanism for a robot to find previously learned objects in a semi-complex indoor environment. We integrated our mechanisms into the behaviour and learning framework of the previous project team, which enables an humanoid robot (Nao) - capable of getting taught via speech - to distinguish between different objects on a static table and to point to the object in question. In addition, we combined the Nao with a XTion and a Turtlebot platform, for which we developed a search algorithm on top of an efficient navigation system. We also enhanced the existing speech recognition as well as segmentation and object recognition to cope with our scope. Furthermore, we evaluated our approach in terms of speech recognition, object recognition and search algorithm performance to show the effectiveness of our project.
02.07.13	Neural and Statistical Processing of Auditory Spatial Cues	Jorge Davila-Chacon
When confronting binaural sound source localisation (SSL) algorithms with different environments and robotic platforms, there is an increasing need for non-linear integration methods of spatial cues. Based on interaural time and level differences, we can compare the performance of neural vs. statistical SSL methods. For this purpose, we implemented a testing architecture with three degrees of freedom, i.e. a different combination of a) representation of binaural cues, b) clustering and c) classification algorithms. The heuristic for the selection of methods is the same at each degree of freedom: to compare the impact of traditional statistical techniques versus machine learning algorithms with different degrees of biological inspiration. The overall performance is evaluated in the analysis of each system, including the accuracy of its output, training time and adequateness for life-long learning. The results support the use of hybrid systems consisting different kinds of artificial neural networks, as they present an effective compromise between the characteristics evaluated.
18.06.13	MSc Defense: Improved Full-DOF Hand Pose Estimation Using Depth Images	Dennis Hamester
Hand pose estimation is the task of deriving a hand's articulation from sensory input. Traditional approaches to this challenging problem rely on either costly hardware or measure directly at the user's hand (e.g. data-gloves). This thesis presents a CV-based method, which captures the complete kinematic state of the hand. Recent research indicates that the Kinect is a viable device to implement marker-less pose estimation, thus enabling unobtrusive and natural HCI applications. These approaches formulate pose estimation as an optimization problem: a high dimensional hypothesis space is constructed from a hand model, in which particle swarms search for the best solution. We propose various changes to this novel approach. For preprocessing, we do not rely on skin color, but exploit the properties of depth images and recognize hands by their shape. Our hand model is extended to include anatomical constraints of hand motion by applying a PCA. This also allows us to treat pose estimation as a problem with variable dimensionality. The most important benefit becomes visible once our PCA-enhanced model is combined with biased particle swarms: accuracy and performance of pose estimation are improved significantly. Several experiments were conducted as part of this thesis. These give proof to the superior properties of our pose estimation.
04.06.13	Self-Organization and Statistics for Modeling Multi-Sensory Integration in the Superior Colliculus	Johannes Bauer
The superior colliculus (SC), or optic tectum (OT) as it is called in non-mammals, is a prime example of natural multi-sensory integration. A brain region found in all vertebrates, it combines auditory, visual, and tactile information to locate events in space, and generates motor output resulting in orienting behavior. The talk will give a brief introduction to/refresher of SC physiology, on the behavioral and neurological levels, motivate the merit of modeling the SC from the perspective of a computer scientist, and then discuss why and how self-organization and statistics are good tools for the task.
21.05.13	Quo Vadis Gesture Recognition?	Doreen Jirak
07.05.13	Recurrent Neural Networks for Grammatical Structure Processing, with an Application to Human-Robot Interaction	Xavier Hinaut
Primates can learn complex sequences that can be represented in the form of categories, and even more complex hierarchical structures such as language. The prefrontal cortex is involved in the representation of the context of such sequential structures, and the striatum permits the association of this context with appropriate behaviours. In my PhD work, the Reservoir Computing paradigm is used: a recurrent neural network with fixed random connections models the prefrontal cortex, and the read-out layer (i.e. output layer) models the striatum. The model was applied to language syntactic processing, especially thematic role assignment: given a sentence this corresponds to answer the question "Who did what to whom?". The model processes categories (i.e. abstractions) of sentences which are called "grammatical constructions". The model is able to (1) process correctly the majority of the grammatical constructions that were not learned, demonstrating generalization capabilities, and (2) to make online predictions while processing a grammatical construction. Moreover, we observed that when the model processes less frequent constructions an important shift in output predictions occurs. It is proposed that a significant modification of predictions in a short period of time could be related to Evoked-Related Potentials (ERP) like the P600 which typically occurs when unusual sentences are processed. Finally, to show the ability of the model to deal with a real-world application, the model was applied in the framework of human-robot interaction for both sentence comprehension and production. The sentence production model was obtained by "reversing" the sentence comprehension model: it processes meanings as input and produces sequence of words in output. Short Bio: Xavier Hinaut received an M.S. in Computer Science from the University of Technology of Compiègne (France) in 2008 and an M.S. in Cognitive Science from the Ecole Pratique des Hautes Etudes (France) in 2009. In early 2013, he obtained a PhD in Computational Neuroscience at the INSERM Stem Cell and Brain Research Institute in Lyon (France) under the supervision of Peter Ford Dominey. His research explores brain mechanisms of complex sequence processing, in particular on language syntactic comprehension. He is interested in using Recurrent Neural Networks to model these processes and investigating how such models can be used in developmental robotics.
23.04.13	Impact of pain-like signal on robot learning: more than a trigger of reactive behaviours	Nicolas Navarro
Self-defensive and survival neural circuits belong to the most important and essential capabilities of any organism. They are believed to impose the ground rules for more complex and motivated behaviour (Sternson, 2013). Pain signals have a key role in these processes attracting attention and motivating decision and actions. Although pain signals are quickly linked to their role in reactive behaviour, they also play a key role in learning, for instance leaning to predict pain enables agents to avoid injuries. Moreover we believe that pain signals may also speed up learning and lead to smoother and safer behaviour. To test this hypothesis we have design an object reaching experiment with a Nao humanoid robot. Here we use reinforcement learning (Continuous Actor-Critic Learning Automaton, a.k.a. CACLA) to test the impact of pain-like signal on learning. Learning from scratch approach is used to compare learning speed and subjective aesthetic robot's pose when reaching for an object.
09.04.13	Embodied Language Understanding with a Multiple Timescale Recurrent Neural Network	Stefan Heinrich
How the human brain understands natural language and what we can learn for intelligent systems is open research. Recently, researchers claimed that language is embodied in most - if not all - sensory and sensorimotor modalities and that the brain's architecture favours the emergence of language. In this talk I will report our recent study about the characteristics of such an architecture and propose a model based on the Multiple Timescale Recurrent Neural Network, extended by embodied visual perception. The study shows that such an architecture can learn the meaning of utterances with respect to visual perception and that it can produce verbal utterances that correctly describe previously unknown scenes. In addition I would like to open a discussions about integrating visual sensations for multiple objects in the scene over time to investigate the characteristics of an architecture that is supposed to reflect relations between objects in natural language.

Winter Semester 2012/2013

Winter Semester 2012/2013
26.03.13	Progress and Open Questions in Robot Navigation based on Cognitive Map	Wenjie Yan
As a fundamental research topic, autonomous indoor robot navigation is challenging in robotics. Although many models for map-building and planning exist, it is difficult to integrate in the real world due to the high amount of noise and complexity. Following our previous work [1], we present a neural cognitive model for environment mapping and robot navigation based on learning spatial structure by observing the persons' movement using a ceiling-mounted camera. A robot can plan and navigate to any given position in the room based on this map, and adapt it based on interaction with the obstacles. In addition, consistent with the suggestion of [2] that a correspondence between the notions of visual appearance and spatial location representations exists, salient visual features are learned and stored in the map during navigation. This anchoring of visual features in the map space enables the robot to find and navigate to a target object by showing an image of it. We implemented this model on a humanoid robot and tests are conducted in a home-like environment. Results of experiments are evaluated and we conclude our work with an outlook for the future development. [1] W. Yan, C. Weber, and S. Wermter, "A neural approach for robot navigation based on cognitive map learning." in Proceedings in the 2012 International Joint Conference on Neural Networks (IJCNN2012). Brisbane, Australian: IEEE, June 2012, pp. 1146-1153. [2] M. J. Farah, K. M. Hammond, D. N. Levine and R. Calvanio, "Visual and spatial mental imagery: Dissociable systems of representation." Cognitive Psychology, Volume 20, Issue 4, October 1988, pp 439-462.
12.03.13	BSc Defense: Application and Analysis of Classification in Textcategorization: A Comparative Study	Stefan Thomas
Text categorization is the task of discovering the category or class text documents belongs to, or in other words spotting the correct topic for text documents. While many machine learning schemes for building automatic classifiers exist today, these are typically resource demanding and do not always achieve the best results when given the whole contents of the documents. A popular solution to these problems is called feature selection. The features (e.g. terms) in a document collection are given weights based on a simple scheme, and then ranked by these weights. Next, each document is represented using only the top ranked features, typically only a few percent of the features. The classifier is then built in considerably less time, and might even improve accuracy. In this work we implemented different feature selection algorithms, create feature vectors of different lengths and use them together with different classification algorithms. The quality of the classification is used to examine the interactions between the algorithms. We found that the concrete selection algorithm is less important for the resulting quality of the classification than the length of the feature vector. To compute our experiments faster, we used a parallel implementation for the feature selection and the classification task.
27.02.13	Visual 3D Reconstruction, Registration and Recognition	Kevin Köser, ETH Zürich
Kevin Köser is a senior researcher in the Computer Vision and Geometry Lab at the Swiss Federal Insitute of Technology (ETH) Zurich. His research interests lie in 3D computer vision problems, such as shape and motion extraction from video, visual geolocalization, registration and tracking and recently involved novel feature representations and geometric relations as well as efficient robust matching schemes. He has worked in several national and international collaborations on 3D reconstruction, registration and recognition with industrial partners such as the BBC, Bosch, Keystone, Leica Geosystems, Navteq, Nokia, Swiss Television SF, Testo, Volkswagen and XSens. Dr. Köser obtained his PhD (Dr.-Ing.) from the University of Kiel for his thesis "Geometric Estimation with Local Affine Frames and Free-form Surfaces" (dissertation award 2009) and his research was awarded several prizes, most notably the DAGM 2011 Main Prize (German Association for Pattern Recognition). He also serves as a member of the program commitee for the major international conferences on computer vision and in robotics and teaches the classes "3D Photography" and the "Computer Vision Lab" at ETH Zurich.
26.02.13	BSc Defense: Evolving Locomotion for a Humanoid Robot	Heye Vöcking
The purpose of the bachelor thesis was the evolution of artificial neural networks to develop locomotion for the Darwin robot. The DARwIn-OP, in the future referred to as Darwin, is a 50cm tall humanoid robot which is used, amongst others, in the RoboCup for robot soccer. The main problem in robot soccer is a robust and fast locomotion. Since a humanoid robot is a very complex system, it is difficult to handcraft a robust walking algorithm. Furthermore it needs to be adjusted by hand if the floor or the weight distribution of the robot itself is changed. An approach to automatically develop a walking algorithm is biologically in- spired evolution in which a gradual improvement of individual solutions can be achieved over many generations. The advantage of evolution is that the problem itself does not need to be solved by hand, but only needs a so-called fitness function, which rates the quality of a solution. In addition, the execution of the required experiments can be parallelized very well. But evolution has also certain difficulties which must be overcome. For example, tens of thousands of experiments need to be performed in order to find a good solution in a complex search space. We have developed a system, which exploits the concurrency offered by evo- lution and performs the experiments in the Webots simulator on many different computers in parallel, therefore finding solutions in a reasonable amount of time. It uses a realistic model of the Darwin, which makes a transfer of a good solution to the real Darwin realistic. This would be beneficial for further support of the RoboCup soccer team, which can achieve even better results in future championships. The focus of this work is on oscillating pattern generation within the artificial neural network (ANN) and by external sources, as well as the impact of neurons in the hidden layer of the ANN. The experiments have shown, that an ANN is able to generate a pattern without the use of a central pattern generator. Furthermore the results indicate that at least 4 neurons in the hidden layer have to be present to evolve a locomotion.
12.02.13	Adaboost and Hopfield Neural Networks on Different Image Representations for Robust Face Detection	Nils Meins
Face detection is an active research area com- prising the fields of computer vision, machine learning and intelligent robotics. However, this area is still challenging due to many problems arising from image processing and the further steps necessary for the detection process. In this work we focus on Hopfield Neural Network (HNN) and ensemble learning. It extends our recent work by two components: the simultaneous usage of different image representations and combinations as well as variations in the training procedure. Using the HNN within an ensemble achieves high detection rates but shows no increase in false detection rates, as is commonly the case. We present our experimental setup and investigate the robustness of our architecture. Our results indicate, that with the presented methods the face detection system is flexible regarding varying environmental conditions, leading to a higher robustness.
29.01.13	Asymmetric agency and social sentience in modeling situated communication for human-robot teaming	Dr.ir. Geert-Jan M. Kruijff (DFKI Saarbrücken)
The talk looks at situated collaboration between humans and robots, for example in complex situations like disaster response. This collaborative context gives rise to several issues. One, we need to start from the assumption of asymmetric agency: Humans and robots experience and understand the world differently. The asymmetry implies that symbols cannot be considered abstract types, embodying an objective truth. Instead, different agents employ different types in building up understanding, or more precisely they construct subjective judgments as proofs for why a particular type can be applied to some experience. This then gives rise to another issue, namely how these different actors could then arrive at some mutually shared understanding or ''common ground.'' We see this as the need to align judgments, and formally construct this as an alignment between (abductive) proofs over multi-agent beliefs and intentions. This places grounding meaning in context in a new light: Grounding meaning, between actors, is the alignment of judgments against subjective experience, within a social, situated context. The intentional aspects of social dynamics play thereby just as much a role as do beliefs. In a collaborative activity, meaning becomes socially and situationally construed referential content, with a semi-persistent nature that is subject both to social dynamics (why the meaning is construed and used) and environment dynamics (what the meaning is construed for). This requires an actor, particularly a robot, to be explicitly aware of these social dynamics and its own role(s) in it. The talk captures this through the notion of social sentience.
15.01.13	A Predictive Network Architecture for a Robust and Smooth Robot Docking Behavior	Junpei Zhong
As the brain attempts to compensate the latency of its sensorimotor cycle, a biologically-inspired learning model of predictive sensorimotor integration is proposed to compensate that latency in a robotic system. In this model, an Elman network is employed for sensory prediction and a Continuous Actor-Critic Learning Automaton (CACLA) trained for smooth action generation. For a robot approaching experiment, this architecture also improves the smoothness of the robot's trajectory and therefore brings a faster and more accurate approaching behavior. Experimental results prove that this novel architecture improves the approaching success-rate and decreases the approaching time.
18.12.12	Real-time Object Detection Algorithm in Brain-Inspired Vision System Technology	Dr. Yasuomi Daishin Sato (Kitakyushu, Japan)
In order to make robot vision closer to human vision, we have to further progress brain-inspired vision system technology, consisting of three different subunits of "recognition," "eye movement control" and "memory." We here present current development of the recognition subunit. We focus on improving the Elastic Graph Matching (EGM) for object detection. The EGM is often regarded to be one of neural models for visual object recognition that are successfully applied into industries, but still has unsolved problems for the real-time processing. To solve such problems, we suggest different scale similarity map and then develop the real time EGM-based object detection, comparable to the existing computer vision technology. We will present briefly a binocular movement control model, which will be integrated with the EGM-based object detection for realizing an object tacking in the near future.
04.12.12	Vision-based Human Tracking for Novel Behavior Detection	German Parisi
Vision-based human action tracking and novel behavior detection represent a cornerstone of data-driven scene analysis. I will present a learning framework for real-time novel behavior detection from depth maps video sequences. Based on neurocognitive evidence, the framework prioritizes the extraction of spatio-temporal properties, such as location and direction of motion. Unlike other related work discussed in the literature, my framework does not require domain- specific prior knowledge on performed actions. Spatio-temporal patterns are processed to obtain suitable representations for unsupervised clustering of the extracted data. In the field of novelty detection, self-organizing maps (SOM) have shown to be a prominent method for learning patterns from training data. Techniques and experimental results on SOM-based novelty detection will be presented and discussed.
27.11.12	A topographic Elman network	Dr. Cornelius Weber
Neural fields are a popular method of neural information processing, because of their intuitive topographic representation of neural activity. However, their centre-surround competitive connectivity is usually hand-designed or trained with the wanted activity pattern applied directly to all units. I will present an Elman network in which the hidden layer learns to become like a neural field, while direct feedback is given only to the output layer. The example network has a one-dimensional "chain" of neurons on each, input, hidden and output layer. A static/moving hill of neural activation on the input layer shall be reconstructed/predicted by the network on the output layer. Topographic weight initialisation and certain constraints on weights and activations make the network learn the desired centre-surround connectivity on the hidden layer, and a topographic static/moving activity hill emerges there, too. The network also learns to predict an activity hill that can be either static or move in either of two directions. Then the connectivity becomes more complicated and the network's mechanisms deserve further investigation.
13.11.12	Model-based Hand Pose Estimation for Gesture Recognition Using Kinect	Dennis Hamester
Estimating a detailed human hand configuration in 3D space is a challenging task but has diverse interesting applications in the field of Human-Computer Interaction. These include natural, intuitive user interfaces, learning by demonstration and speech-replacing gestures, either static or dynamic. The availability of cheap depth sensors in the recent years (Kinect, Xtion) offers new opportunities to approach vision-based gesture recognition. One possibility for a gesture recognition system is to provide a full hand description for correct pose estimation. Here, a discrete set of possible poses is defined for searching a continuous space for optimal solutions. One approach to this task was proposed by Oikonomidis et al. (2012). They reformulate the problem as an optimization problem, which is then solved using standard Particle Swarm Optimization. But the proposed method suffers from two drawbacks: (1) it uses only basic preprocessing, resulting in uncertainties during the early stages and (2) a high computational cost effectively preventing real-time applications. Thus, the goal of this Master thesis is to improve upon their work. Early image preprocessing will be replaced by previous work done at WTM. To achieve real-time performance, certain key areas in their theoretical foundations have been identified, which will be revised. Based on a working hand pose estimation, a system for static gesture recognition will be implemented. This system will be used to evaluate the advantages and limitations of the refined methods.
30.10.12	Bioinspired Simultaneous Localization and Mapping: Using ratSLAM with the humanoid robot NAO	Stefan Müller
Simultaneouslocalization and mapping (SLAM) has always been one major challenge in robotics. Variations in environment and local conditions forcerobots to behave more robustly than possible with hard-coded behaviors. Over time many different approaches dealing with this problem have been done. Most of them use data collected from odometryor landmarks and pre-generated maps of the environment. Increased computational power has led to powerful SLAM algorithms, which deal with real-world sensory data using Kalman filters and/or expectation-maximization (EM) algorithms. Therefore SLAM is well investigated and developed. In 2004 a new approach in SLAM has been introduced: ratSLAM. It is based on a computational model of specific parts of the rat hippocampus, which maintains the rodent's believed location in the world. In this model, landmark sensing and odometric information are used by a competitive attractor network to form a topologically consistent representation with largely Cartesian properties. For now ratSLAM has only been used and tested on wheeled robots. In this talk I will give an overview of the current state-of-the-art. Furthermore, problems that appear when bringing ratSLAM to the next level by integration in a humanoid robot are pointed out and possible solutions are given.
16.10.12	Modelling Locomotion on the NAO humanoid and Explore its Learning Capability	Gauss Lee
Modelling locomotion is a very interesting but challenging research area in robotics, especially by using a bio-inspired neuronal controllers. Recently, the approaches to modelling motor abilities on robots have developed a lot by systematically applying CPGs (central pattern generators) to different sorts of robotic platforms including both quadrupedal and bipedal robots. In this talk, in the first part, a relatively new CPG architecture (an integration of oscillatory recurrent neural network and feed-forward network ) is proposed to represent the parts controlling the rhythmic movement like crawling and walking. The biological grounds for this architecture is also introduced. In the second part, an interesting learning architecture, CPG-ACTOR-CRTIC, is introduced to involve the learning mechanisms and proffers potential meanings for exploring motor learning capabilities of infants.
11.10.12	Spot Talk: Zusammenführung verschiedener lernfähiger Systeme am Nao	Pascal Schröter
09.10.12	BSc Defense: Reccurent Neural Networks for Speech Recognition on Humanoid Robots	Sebastian Tschörner
There are different approaches for Automatic Speech Recognition systems to map the audio input into a string of phones which will then be processed further. Most systems use Hidden Markov Models to build up a model of each phone. Lately the use of neural networks of different kinds has grown. The advantage of these networks is the greater robustness against noise. A drawback is the problem of learning long time dependencies. The error between the said phone and the recognised phone is used to train the networks. For long lasting phones this error vanish or blow up. With the proposition of a new type of artificial neurons, called 'Long Short Term Memory', the problem of vanishing errors is solved and there were studies on neural networks for Automatic Speech Recognition containing this new type of neuron. For speech recognition on robots there is the problem of strong noise coming from the robots fans or motors. Based on the greater noise robustness of neural networks we examine the advantages of Long Short Term Memory neural networks compared to Hidden Markov Model on an Automatic Speech Recognition task of audio data, recorded on a humanoid robot of type iCub. We show that the predominant noise of the robot prevents any correct recognition results on systems based on Hidden Markov Models, as well as on systems based on neural networks.
02.10.12	Master project: Human-Robot-Interaction - Final Presentation	Pascal Folleher, Peer Springstübe, Johannes Twiefel
In this final project presentation, a hybrid learning system is introduced as the result of the human-robot interaction (HRI) project. The system uses a Nao robot for interaction, which can be controlled by speech. It is able to learn presented objects and recognize them afterwards. Furthermore, the system can differentiate between learned objects and point at the object asked for. A hybrid system is used, which consists of region-of-interest (ROI) extraction, different feature extractors and a neural network for each feature extractor. The outputs of each network are used for ensemble learning. In a performance test, the system recognizes 100 percent of the learned objects. The system consists of many loosely coupled modules which are connected via a robot operating system (ROS) to be able to distribute the computations to several machines. The project presentation includes a talk and an interactive demonstration and provides room for interesting discussions.

Summer Semester 2012

Summer Semester 2012
18.09.12	Working@WTM: Presentations of Current Projects in Cognitive Robotics	WTM Student Assistents
Presentation Titles Pascal Schroeter (Supervisor: Cornelius Weber): "Zusammenführung verschiedener lernfähiger Systeme am Nao" Marc Bestmann (Supervisor: Sven Magg): "Darwin - A framework for evolution and execution of neural networks on robots" Heye Vöcking (Supervisor: Stefan Heinrich): "Recognizing Objects in Images for NAO Using ROS and OpenCV" Mario Mohr (Supervisor: Nicolas Navarro): "A simple implementation of bio-inspired auditory processing"
28.08.12	Modelling of the Contribution of Gesture to Learning to Count Using the iCub Humanoid Robot	Marek Rucinski (Plymouth, UK)
In my talk, I am going to present a robotic connectionist model of the contribution of gesture to learning to count. The model is based on a recurrent artificial neural network architecture, connected with the iCub humanoid robot. I compare counting performance of the model in two conditions: without and with gesture, showing that the proprioceptive signal connected with gesture indeed carries information which may be exploited when learning to count. I also compare the behaviour of the model with that of human children, based on current experimental psychology literature. Finally, I present the prospects for future work, analysing the ways in which the model can be extended.
24.07.12	Energy Efficient Trajectory Planning for Object Grasping by the NAO Robot	Abhronil Sengupta
The presentation focuses on the trajectory planning task for object grasping with reference to the NAO robot. The intermediate trajectory for movement from a specific initial arm configuration to a final one is modeled using Bezier curves and the parameters of the curve are optimally determined using energy minimization criteria. The optimization is performed using a bio-inspired evolutionary algorithm, namely Differential Evolution. The talk will focus on the algorithmic aspects of the trajectory planner along with discussions on the feasibility for online implementation on the NAO robot.
10.07.12	Hybrid Neural Ensembles for Face Detection	Nils Meins
The success of an ensemble of classifiers depends on the diversity of the underlying features. If a classifier can address more different aspects of the analyzed objects, this allows to improve an ensemble. In this talk I will present an ensemble using as classifier members a Hopfield Neural Network (HNN) that uses Haar-like features as an input template. I will analyse the HNN as the only classifier type and also combine it with threshold classifiers to a hybrid neural ensemble, so that the resulting ensemble contains --as members-- threshold and neural classifiers. This ensemble architecture is evaluated for the domain of face detection. I will show that a HNN that uses summed pixel intensities as input for the classification has the ability to improve the performance of an ensemble.
03.07.12	BSc Proposal: Rekurrente Neuronale Netze zur Spracherkennung auf humanoiden Robotern	Sebastian Tschörner
26.06.12	Manipulation von Trainingsbildern zur Verbesserung eines Gesichtsklassifikators	Darko Rakic
In dieser Arbeit wird ein mehrstufiges Objekterkennungssystem am Beispiel von Gesichtern vorgestellt, die von einer humanoiden Roboterplattform aufgenommen wurden. Vier Stufen kennzeichnen das Verfahren, ein Farbfilter, eine Normalverteilung, eine "Support Vector Machine" (SVM) und eine einfache Methode, die sich der Hauptkomponentenanalyse und einem Abstandsmaß bedient, das bestimmt, ob es sich um ein Gesicht handelt oder nicht. Dieses Verfahren beinhaltet nur eine Personendetektion. Zu einem Gesicht wird ein Merkmalsvektor erzeugt, der verschiedenen Eigenschaften, wie dem Bild als solchem, Histogramme, Gradienten und weitere inne hat. In den ersten beiden Stufen des Verfahrens findet mittels Diskriminanzanalyse eine Klassifikation in die Kategorien: Gesicht, Nicht-Gesicht oder eventuelles Gesicht statt. Die SVM erlernt eine maximal trennende Hyperebene aus den Trainingsdaten und entscheidet bei einem Testbild auf welcher Seite der Hyperebene die Projektion landet, um so die Klassenzugehörigkeit zu bestimmen. Abschließend wird von den übriggebliebenen Kandidaten mit Hilfe von "Eigengesichtern" (eigenfaces) ein Abstandsmaß berechnet, um zu bestimmen, ob es sich um ein Gesicht handelt. Um die Resultate zu verbessern, soll die Trainingsmenge verändert werden. Den Fragen, wie wichtig ist die Wahl der Trainingsmenge, wie kann die Trainingsmenge verändert werden um bessere Ergebnisse zu erzielen, welche Vergrößerungsmethoden bringen etwas, welche Rolle spielt die Methodik bei der Erstellung, wird im Rahmen dieser Untersuchung nachgegangen. Methoden zur Veränderung der Trainingsmenge, wie Streckung und Stauchung, werden angewandt, um eine Verbesserung der Trainingsdaten zu erzeugen. Untersuchungen haben gezeigt, dass die getesteten Methoden, Verbesserungen bei der Klassifikation erbracht haben.
19.06.12	SOM-based Text Categorization	Stefan Thomas
19.06.12	Speech Recognition for Human-Robot Interaction	Stefan Thomas, Stefan Heinrich
In this workshop, I will present an overview on automatic speech recognition (ASR). The main components like language models, dictionaries, grammars, search managers and acoustic models will be briefly explained. Then I will show how ASR is implemented in three different open- source frameworks and how they can be used: First, I will present the "Julius" engine, how to install it on a Linux-based systems and how the different components are defined by a configuration. Second, I will introduce the the RWTH automatic speech recognition (RASR) system. Third, I will show how the CMU Sphinx tool-kit can be used on the NAO robot and how some shell scripts can be used to generate the different components. Finally, I will have a closer look at the acoustic model training with sphinxtrain. Sources: http://www-i6.informatik.rwth-aachen.de/rwth-asr/ http://cmusphinx.sourceforge.net/
05.06.12	Sound Source Localization: Binaural Cues Integration with Bayesian Inference	Jorge D. Chacon
I will present the performance results of a spiking neural network for binaural sound source localisation (SSL) using the Nao robot. In order to explore the effect of the robot's embodiment on the sound signal, the cues used for SSL were the interaural time (ITD) and level (ILD) differences. As biological plausibility was pursued in our implementation, ITDs and ILDs were extracted with models of the medial superior olive (MSO) and the lateral superior olive (LSO). The influence of the robot's body on the sound signal was quantified through a statistical analysis of the MSO and LSO response of sounds produced at different azimuth angles. These descriptive values proved essential to achieve accurate SSL. Finally, the outputs from the MSO and LSO were integrated in a model of the inferior colliculus (IC). The connection weights between the MSO and LSO neurons to the IC neurons were estimated using Bayesian inference. This inference process allowed the algorithm to perform robustly under high levels of ego-noise. Keywords: Bio-mimetic binaural sound source localisation, Spiking neural networks, Bayesian inference.
29.05.12	Image Processing for Gesture Recognition Using the Kinect	Dennis Hamester, Doreen Jirak
22.05.12	A SOM-Based Model for Multi-Sensory Integration in the Superior Colliculus	Johannes Bauer
We present an algorithm based on the self-organizing map (SOM) which models multi-sensory integration as realized by the superior colliculus (SC). Our algorithm differs from other algorithms for multi-sensory integration in that it learns mappings between modalities' coordinate systems, it learns their respective reliabilities for different points in space, and uses mappings and reliabilities to perform cue integration. It does this in only one learning phase without supervision and such that calculations and data structures are local to individual neurons. Our simulations indicate that our algorithm can learn near-optimal integration of input from noisy sensory modalities.
15.05.12	Adaptive Learning and Language Understanding in an Omni-Modal MTRNN	Stefan Heinrich
With the focus on CTRNN as a method for grounding utterances in natural language in omni-modal experiences, recent findings in training a state-of the art CTRNN will be presented and a novel idea, in terms of extending such a network to take into account all available modalities, will be discussed. The talk will be divided into two parts: 1) Recent research has revealed that hierarchical linguistic structures can emerge in a recurrent neural network with a sufficient number of delayed context layers. As a representative of this type of network the Multiple Timescale Recurrent Neural Network (MTRNN) has been proposed for recognising and generating known as well as unknown linguistic utterances. However the training of utterances performed in other approaches demands a high training effort. To overcome this issue, we analyse the training demands and propose a robust mechanism for adaptive learning rates and internal states to speed up the training process substantially. In addition we compare the generalisation of the network for the adaptive mechanism as well as the standard fixed learning rates finding at least equal capabilities. 2) The MTRNN revealed interesting capabilities but also limitations: To generalise and therefore generate untrained utterances, the network needs to "recognise" the initial state value for some context units. In addition no integration of multi- or omni-modal experience have been accomplished so far. As part of current and future work an extended MTRNN will be discussed that overcomes these issues. For this Oberseminar also some preliminary results of generalising utterances will be outlined.
08.05.12	Image-based Place Models for Geographic Recommendations	Christoph Schlieder
Classical recommender systems solve an information filtering task. They suggest a data object, typically a text document describing some product or service, that is likely to be relevant to the user based upon her or his previous choices. A geographic recommender system recommends items from a library of geo-referenced objects. In multi-object recommendation, collections of items are suggested which should consist of somehow similar exemplars but, at the same time, must show variability. A geographic multi-object recommender suggests, for instance, a list of cities to visit or a slide show of images illustrating a certain place. The talk gives an introduction into the research field and presents the approach taken by our Tripost multi-object recommender for selecting a small set of geo-referenced images of a touristic location. Motivated by this research, we ask how different conceptualizations of a city can be identified in web-based image collections. The method is illustrated by analyzing image data set from the cities of Amsterdam, Bamberg, Cardiff, and Dublin. In addition, the talk presents a recently collected data set of GPS tracks and geo-referenced photographs taken from visitors of the Old Town of Bamberg. Our data suggests that differences in the frequency of spatial choices need to be taken into account when building place models for recommender systems.
08.05.12	Gesture Recognition with Conditional Random Fields	Doreen Jirak
Gesture recognition provides a natural interface for Human-Robot-Interaction. Until now, several important methods have been exploited in order to correctly recognise and classify gestures, for instance, Neural Networks and Hidden Markov Models (HMM). However, most techniques rely on additional technical devices to ease data capturing and classification process. Vision-based gesture recognition in contrast is most user-comfortable, but also the most challenging approach to the recognition problem. In this talk we will motivate usage of statistical models like Conditional Random Fields (CRF)[1]. A CRF is a derivative of the well-known Markov Networks, which have been extensively used in computer vision problems like image segmentation and object recognition. As CRFs are probabilistic graphical models, it can handle missing or erroneous data by reasoning under uncertainty, thus further motivating application in the field of Artificial Intelligence. Recent studies showed good results compared to state-of-the-art models like HMM, so we will extend this formalism to the gesture recognition problem concerning commanding gestures, which will be further used in a human-robot-scenario. Future work on graphical models in terms of different topologies to model sub-gestures and so to enhance the prediction and therefore recognition accuracy will be discussed. [1] Klinger, R. & Tomanek, K. Classical Probabilistic Models and Conditional Random Fields, Technical Report, Department of Computer Science, Dortmund University of Technology, 2007
17.04.12	Robots&Co: Knowledge Technology Lab Visit for HIForum e.V.	Doreen Jirak, Erik Strahl

Winter Semester 2011/2012

Winter Semester 2011/2012
28.03.12	Applying the functional actor-model to heterogenous distributed systems in particular the sensors and actuators of a Nao robot.	Prof. Esser (HAW) (Nils Meins)
27.03.12	Learning Features and Transformations with a Predictive Horizontal Product Model	Junpei Zhong
According to the theory of two parallel visual pathways[1], the 'dorsal pathway' encodes spatial information, invariant of stimulus-specific properties, while the 'ventral pathway' encodes object feature identity, invariant of positions and sizes. In this talk, I will propose a novel biologically-inspired neural architecture constitutes of a horizontal product in a multi-layer network in order to acquire this property of the visual system. Experiments show that, learned in an unsupervised manner, two sets of hidden units represent cells in the ventral and dorsal pathways, respectively. Moreover, the fast responding dorsal-like cells have become direction selective while the slow responding ventral-like cells encode 'object identity', which are analogous to the recording of different populations of complex cells in V1 [2]. [1] M. Mishkin, L. Ungerleider, and K. Macko, "Object vision and spatial vision: Two cortical pathways," Trends in Neurosciences, vol. 6, pp. 414-417, 1983. [2] N. Priebe, S. Lisberger, and J. Movshon. Tuning for spatiotemporal frequency and speed in directionally selective neurons of macaque striate cortex. Journal of Neuroscience, 26(11):2941-2950, 2006.
13.03.12	A Neural Approach for Robot Navigation based on Cognitive Map Learning	Wenjie Yan
In this talk we present a neural network architecture for a robot learning new navigation behavior by observing a human?s movement in a room. While indoor robot navigation is challenging due to the high complexity of real environments and the possible dynamic changes in a room, a human can explore a room easily without any collisions. We therefore propose a neural network that builds up a neural memory for spatial representations and path planning using a person?s movements as observed from a ceiling-mounted camera. Based on the human?s motion, the robot learns a map that is used for path planning and motor-action codings. We evaluate our model with a detailed case study and show that the robot navigates effectively.
02.03.12	RoboCup@Home: Benchmarking Social Robots	Dr. Tijn van der Zant (Groningen) (Jorge D. Chacon)
RoboCup@Home is the largest domestic service robot benchmark and arguably also the most complex for any fully autonomous robot to participate in. The benchmark is set up as a competition, where the robots can do a series of tests in an apartment. The tests become more and more complex over time and usually have the form of a story, such as that the robot has to behave as a butler, or that somebody lost an item and the robot has to find it. In the benchmarks we test for all the aspects that we deem necessary to create a social robot. Since nobody exactly knows what it takes to create a social robot, even our benchmarks are being put under scientific scrutiny in order to improve on them. We avoid the trap of moving towards some local optimum of robotics bodies and minds by changing the benchmarks every two years. In order to keep track of the improvements we have developed methodologies for statistical benchmarking. This is needed because the robots change, but also because we under-specify many aspects of the competition, both for the tests and for the environments, since the real world is also under-specified. We aim to create a general purpose social robot that can be put to use in many situations. The robot has to be able to behave itself under highly uncertain and dynamic situations. Therefor we take the robots outside of the apartment and into the shopping mall. We started with a very successful 'shopping' test. This year we plan to take the robots for a walk and take the elevator to get to the shopping mall. Since the robot has to be fully autonomous, there is a large focus on artificial intelligence. We increased the focus on AI by introducing the first test ever that focuses on the understanding of the robots with regards to their environment. This test is so unstructured that we have a computer program that generates the problems that the robot has to solve, of course within certain boundaries.
28.02.12	BSc Defense: Grounding of Language in Sensorimotor World Interaction of a Humanoid Robot, using Neural Networks	Thomas Christian Blank (TU Harburg)
Symbol systems provide techniques to emulate processes of the human mind, such as abstract reasoning, natural language and problem solving, in artificial intelligent agents. To purposefully employ these techniques in artificial agents, however, these symbols need to be grounded, i.e. the agent must be able to connect a symbol to its real world referent and vice versa. Marocco et al. (2010) [1] proposed a method that uses neural networks to ground symbols in the sensorimotoric experience of a humanoid robot. After training, the networks were capable to identify three different situations and utter a corresponding proto-word. We tried to employ the proposed method on a different humanoid robot, NAO. In the process, we identified and investigated the parameters that in uence the stability of the proposed method and the quality of its results. [1] Davide Marocco, Angelo Cangelosi, Kerstin Fischer, and Tony Belpaeme. Grounding action words in the sensorimotor interaction with the world: experiments with a simulated icub humanoid robot. Frontiers in Neurorobotics, 4:1-15, 2010.
21.02.12	Smartboard Introduction	Dataport Company
14.02.12	BSc Defense: Neuronal Networks for Bipedal Walking of a NAO robot	Patrick Schmolke
Many scientific works have shown the usefulness of artificial neuronal network based robot controllers. The choice for simple robots has been object approach, for example approximating the position towards geometrical line patterns, while for humanoid robots it has been the bipedal walking problem. As the latter is still a challenge, the approach of finding a neuronal network based controller by means of an evolutionary algorithm was topic of the bachelor thesis. Although none of the found controllers reached expectations, the process itself seemed to work properly. The results of the project including the approach and an analysis of known issues will be presented.
07.02.12	A neurocomputational amygdala model of auditory fear conditioning: A hybrid system approach	Nicolas Navarro
In this work, we present a neurocomputational model for auditory-cue fear acquisition. Computational fear conditioning has experienced a growing interest over the last few years, because of its possible implications for clinical treatment of human anxiety disorders. Fear learning is a simple and robust learning paradigm that involves sensory and motor aspects [1]. We argue that a deeper study of the mechanisms underlying fear circuits in the brain will contribute not only to a better conceptual understanding of neural fear processing in general but also to the development of safer robots. Our specific model is based on LeDoux's dual-route hypothesis of fear and also dopamine modulated Pavlovian conditioning. Our reservoir approach is capable of learning the temporal relationship between auditory sensory cues and an aversive or appetitive stimulus. The model was tested as a neural network simulation but it was designed to be used with minor modifications on a robotic platform. Applications of this learning mechanism may include artificial self-protective systems, autonomous battery recharging [2] or detection of emergency situations. [1] H.-C. Pape and D. Pare, "Plastic synaptic networks of the amygdala for the acquisition, expression, and extinction of conditioned fear, Physiological reviews, vol. 90, no. 2, pp. 419-463, 2010. [2] N. Navarro, C. Weber, and S. Wermter, "Real-world reinforcement learning for autonomous humanoid robot charging in a home environment," in Proceedings of the Annual Conference Towards Autonomous Robotic Systems (TAROS), ser. Lecture Notes in Computer Science, R. Groß, L. Alboul, C. Melhuish, M. Witkowski, T. Prescott, and J. Penders, Eds., vol. 6856. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 231-240.
24.01.12	Efficiency through differentiation in homogeneous and heterogeneous groups of agents	Sven Magg
Foraging tasks are commonly used in homogeneous or heterogeneous swarm experiments since the principles behind foraging are similar to other tasks (e.g. collecting, harvesting, search, etc). In this talk I present work done as part of my PhD thesis, in which I've extended the results of a previous robot foraging experiment by Labella. Labella had shown that a group of agents, adapting their activity according to their individual success, has a consistently higher efficiency in terms of collected objects per time spent outside their nest. By analysing the foraging behaviour of homogeneous and heterogeneous groups using his adaptation algorithm, I've been able to identify the reasons responsible for the efficiency increase and show that - even for my very abstract multi-agent scenario - efficiency does not necessarily increase when adapting agent activity to individual success rates.
20.12.11	BSc Defense: Geräuschquellenlokalisation mit einem humanoiden Roboter	Robert Keßler
Um eine solide Interaktion mit einem menschlichen Benutzer möglich zu machen, muss jeder Roboter in Zukunft ein umfangreiches auditorisches System bereitstellen. Neben der Spracherkennung ist dabei auch die Geräuschquellenlokalisation ein notwendiger Bestandteil. Diese Arbeit zeigt die Realisierung eines Lokalisierungsverfahrens in der horizontalen Ebene, mit einem humanoidem Roboter unter Nutzung von Cross Correlation und Künstlichen Neuronalen Netzen. Der Roboter soll 24 verschiedene Richtungen am vollen 360 Grad Kreis, mit einem jeweiligen Abstand von 15 Grad, unterscheiden. Dies gelingt dem vorgestellten Verfahren mit einer Genauigkeit von bis zu 92%. Dabei treten die fehlerhaften Klassifikationen hauptsächlich zwischen benachbarten Richtungen auf.
13.12.11	Emergence of neural Schemata during Intercepting a moving Target	Prof. Andreas Fleischer (Nils Meins)
In principle biological systems have to reduce complexity of possible states to obtain an efficient functional structure. These states may be different in nature. One may consider the protein space of a molecule, interactions between molecules in a cell, interactions in a neural network or interactions between biological agents. During the evolutionary process a functional structure has to develop a sufficient fitness or it will end. A genetically determined protein may appear to evolve slowly. However, the neurophysiological system of an agent has to adapt to biological relevant characteristics of the information flow in order to be able to react immediately on demands of the environment. Thus, it is a challenging question, what these different levels of building a functional adaptive structure have in common. An attempt will be put forward about combining schema theory with the approach of reaction-diffusion systems. Experimental results of movement control are evaluated on the basis of such a model. Reaction-diffusion systems provide a rich variety of activity patterns. Different types will be discussed. How do biological patterns emerge and how does the behaviour of an agent lead to a schema. The protein space can be considered to be like a picture, each possible protein and each pixel to represent a point in a high dimensional space, respectively. Such a picture may be disturbed or noisy. However, if besides others a certain picture is stored in an association matrix a recurrent attractor net may reconstruct the original picture. However, reaching a stable attractor point, that is a static picture, would be boring. On the basis of specific association matrices a recurrent system can be constructed which provides a sequence of consecutive states. This leads to the hypothesis that complex movements are based on such consecutive states of motion primitives. From a modeling point of view, it should be possible to develop a neural network which allows for implementing a schema, i. e. a motor engram, which interacts with feedback control in an iterative manner. Neural population coding is supposed to represent a suitable modeling approach. Considering two interacting neural layers of a reaction-diffusion system they may generate an attractor state and a moving ‘bubble’, respectively. The bubble running along a motor engram activates a set of muscles and causes the arm to move towards a target. Modeling results are evaluated against the background of experimental data obtained from hand movements. They support the hypothesis of a motor schema. It determines the trajectory of intercepting a moving target. It comes up, that in conjunction with the motor schema the transition probability of the bubble elicits a new motion primitive and its direction during the next step. The Gestalt of the movement trajectory is determined by a schema which is adapted to the requirements of the task.
23.11.11	Building Robust and Adaptive Predictive Systems	Prof. Bogdan Gabrys (Prof. Stefan Wermter)
Moving away from engineering maxim of "simple is beautiful" towards biological statement of "complexity is not a problem". Automatic classification or prediction model building directly from data with a minimal or no human supervision is already absolutely crucial in order to stay competitive and maximally exploit the data in quickly changing business environments. However, the methodology for ensuring that created predictive models and systems are as good as possible should be in place before using them with confidence. No human supervision in model building or operation also implies that one should use powerful enough techniques which: a) can learn the data to any degree of accuracy and b) be able to adapt to changing environments. In order to address these challenging issues and following various inspirations coming from biology coupled with current engineering practices, we propose a major departure from the standard ways of building adaptive, intelligent predictive systems and moving somewhat away from the engineering maxim of "simple is beautiful" to biological statement of "complexity is not a problem" by utilising the biological metaphors of redundant but complementary pathways, interconnected cyclic processes, models that can be created as well as destroyed in easy way, batteries of sensors in form of pools of complementary approaches, hierarchical organisation of constantly optimised and adaptable components. The results of extensive testing for many different benchmark problems and various snapshots of interesting results covering the last decade of our research into predictive modelling will be shown throughout the presentation and a number of challenging real world problems will be discussed in the context of the main goals of a major EU INFER project which we currently coordinate.
22.11.11	PhD Proposal Meeting: Towards object identification and robot orientation based on multi-modal alignment and localisation	Johannes Bauer Jorge D. Chacon
07.11.11	Development of Coordinated Eye and Head Movements	Dr. Cornelius Weber
Various optimality principles have been proposed to explain the characteristics of coordinated eye and head movements during visual orienting behavior. However, the neural substrate of the underlying computations has been left unspecified. At the same time, researchers have suggested several neural models to underly the generation of saccades, but these do not include online learning as a mechanism of optimization. Here, we suggest an open-loop neural controller with a local adaptation mechanism that minimizes a proposed cost function. Simulations show that the characteristics of coordinated eye and head movements generated by this model match the experimental data in many aspects, including the relationship between amplitude, duration and peak velocity in head-restrained conditions, and the relative contribution of eye and head to the total gaze shift in head-free conditions. Our model brings together an optimality principle, a neural architecture, and a local learning mechanism into a unified control scheme for coordinated eye and head movements. The model may well generalise to further open-loop controlled movements, such as fast reaching movements.
25.10.11	Ein Frage Antwort System fuer mathematische Textaufgaben	Christian Liguda (Bielefeld)
Eine Grundschule hat bei einem Wettbewerb 30 Tischtennisschläger gewonnen. 2 weitere spendet ein Lehrer. Zu jedem Schläger gehören 3 Bälle. Alles soll an die 8 Klassen verteilt werden. Wie viele Schläger und Bälle bekommt jede Klasse ?''. Auch wenn die durch eine solche Aufgabe gegebenen Rechnungen durch Computeralgebra Programme lösbar sind, so gibt es nur wenige Programme, welche auch Textaufgaben bearbeiten können. In dem Vortrag wird zunächst ein sehr kurzer Überblick über diese bisherigen Systeme gegeben. Da die meisten neueren Programme auf einem ähnlichen Ansatz basieren wird dieser, sowie einige damit verbundene Probleme, knapp dargestellt. Anschließend wird dann ein, im Rahmen einer Masterarbeit, entwickeltes System vorgestellt, welches in der Lage ist, Textaufgaben zu lösen. Hierbei wird insbesondere eine neue Form der Wissensrepräsentation eingeführt, um die Struktur einer Textaufgabe darzustellen, mit welcher sich mehrere Probleme vorangegangener Systeme lösen lassen. Des Weiteren wird auf die Transformation von Texten in ein entsprechendes Modell eingegangen. Abschließend wird eine kleine Evaluierung des Systems sowie weitere Ausblicke besprochen.
04.10.11	Programming NN with OpenCL	Stefan Heinrich
This short talk will provide an introduction to parallel programming and the parallel programming API OpenCL. Furthermore some experiences in using OpenCL for implementing a training algorithm including the possible speedups will be reviewed.
04.10.11	Impressions from the IROS2011	Stefan Heinrich, Junpei Zhong
In this interactive talk, a wrap up of the Cognitive Neuroscience Robotics workshop and the (largest) International Conference on Intelligent RObots and Systems 2011 will be given and discussed. Topics include recent developement in humanoid and assistive robotics and emerging novel research directions. Particulary interesting papers related to research topics in the group will be provided afterwards.

Summer Semester 2011

Summer Semester 2011
20.09.11	BSc Proposal: Grounding language in sensorimotor world interaction of a humanoid robot using neural networks.	Thomas Blank
In this talk a Bachelor project of implementing a model of robot-world interaction [1] on a NAO will be proposed: The robot is supposed to push an object - like a ball or a box - and observe the objects' action. At the same time it perceives a word and should learn to connect this word with the observed action (grounding). These grounded names will be learned with a recurrent neural network and should enable the robot to classify an observation correctly later. The main goal of this project is to get insight into the possibilities and limitations of this model, applied on a real NAO robot. [1] Davide Marocco, Angelo Cangelosi, Kerstin Fischer, and Tony Belpaeme. Grounding action words in the sensorimotor interaction with the world: Experiments with a simulated iCub humanoid robot. Frontiers in Neurorobotics,4(7), 2010.
16.09.11	Artificial evolution of body structure using webots: likely problems and solutions	Willi Fiebranz
In this talk I would like to present my work on evolution of body structures using Webots. After a short overview about my psychologically motivated research in general, I will provide a closer look at what I did and how evolution was incorporated into Webots. A 3D body plan similar to stick figures was evolved together with a neural network controller, fully within the 3D Webots environment using Matlab for the controllers. I will discuss problems I've encountered and how those were (or could be) solved, which might be useful to everyone who will attempt to do similar simulations with dynamic changes in robot structure and parameters within Webots.
13.09.11	Robotic modelling of interactions between numbers and space	Marek Rucinski, (Prof. Stefan Wermter)
Mathematical cognition is a branch of cognitive science that aims at understanding various aspects of numerical processing in the brain. The field has been receiving an increasing attention during the past two decades, and due to the contributions from such fields as experimental psychology, neuroscience and computational modelling, substantial progress has been made. Another tool that could be used for the benefit of mathematical cognition studies is robotic modelling - which has a principal advantage over traditional computational modelling methods in a sense that it inherently takes into account the aspect of embodiment. This is in line with the growing amount of evidence that many aspects of mathematical cognition (including the number concept itself, despite it's pure and abstract appeal), are deeply rooted in our very basic interactions with the environment. In the work presented here we look at certain manifestations of an intrinsic connection between representations of number and space in the brain, namely the size and distance effects, SNARC effect and Posner-SNARC effect. Building on previous purely computational modelling approaches, we formulated a robotic model of such interactions. This enabled us to achieve two goals. First, we obtained a model that is much less arbitrary than previous computational models. Second, we showed how certain characteristic patterns of connections, crucial for reproducing aforementioned behavioural phenomena, can emerge as a result of a simple developmental process due to the robot morphology and environmental correlations.
26.08.11	Visual-Topological Mapping: An approach for indoor robot navigation	Jorge D. Chacon
Using visual features for robotic navigation in natural and changing environments remains to be an unsolved problem. This paper describes the use of nearest neighbor algorithm on salient visual features for indoor mapping. The mapping procedure consists on defining an arbitrary number of vector fields -or paths- in a domestic environment, pointing towards a set of target locations. From these specified paths, the robot constantly estimates its most probable location, adjusts its orientation and moves forward a predefined distance, in order to converge to a target location. A SURF (Speeded Up Robust Features) implementation was used for object detection and scene recognition. For every location in a path, the perceived scene was represented with the set of detected objects, and their particular geometric properties relative to the robot's body. A first experiment confirmed that the deviation from the trained paths -due to odometric error- decreased as the number of trained intermediate-locations increased. A second experiment showed promising results when testing the effectiveness of nearest neighbor search for estimating the robot location. Finally in a third experiment, a satisfactory convergence rate of the robot towards predefined target locations, was observed under real world circumstances. Keywords: Topological robot-mapping, visual object-recognition, nearest neighbor, vector-field navigation.
23.08.11	CINACS: Discussions on computational mechanisms in the superior colliculus	Johannes Bauer
18.08.11	Learning Robot Skills for Everyday Manipulation Tasks	Dr. Freek Stulp (TU Berlin)
In most activities of daily living, related tasks are encountered over and over again. Countless times, we flip light switches, insert keys in locks, pour coffee, brush our teeth, etc. To exploit this regularity, humans reuse existing motor skills for recurring tasks. For robots, using a set of motor skills also drastically reduces the search space for control, and thus facilitates learning. Therefore, my main research goal is to leverage the advantages of the skill-centric approach to achieve autonomous robots that operate flexibly, robustly and safely in human environments. The specific topics I will consider in this presentation are: * Representing robotic skills with Dynamic Movement Primitives. * Optimizing the robustness, efficiency and safety of skills through model-free reinforcement learning in high-dimensional spaces. * Grounding symbolic knowledge about skills in observed behavior, and exploiting this knowledge to generate sequences of skills for more complex tasks. * Learning task-relevant perceptual features in face model fitting applications, and my current ideas on how this can/should be used in the context of skill-based robotics. Since assistive robots in clinical and household environments must be able to manipulate their environment, my main application domain for evaluating these methods is (mobile) manipulation.
16.08.11	CTRNNs, GasNets, and the Question of Evolvability	Sven Magg
After GasNets were introduced by Husbands et. al. in 1998, they were used for evolution of controllers in many different tasks - from pattern generator tasks to quadrupedal walking. The findings in these experiments were that GasNets can be evolved faster than the same controller type without gaseous connections. Other studies also compared GasNet controllers to other controllers using continuous time recurrent (CTRNN) or probabilistic (PNN) neural networks. In terms of evolvability (length of evolutionary run until a successful controller is found or also robustness of evolved controllers), GasNets always outperformed other network types or the GasNets without gas in these experiments. In this talk, I am going to introduce GasNets and present work done for my MSc thesis. In this research one possible reason for the evolutionary advantage - the propensity to form pattern generator sub-networks - of GasNets was investigated. It was shown that they were outperformed by CTRNNs on a simple shape discrimination task, in which those sub-networks should not lead to an advantage. Different variants of CTRNNs and GasNets were investigated to find the cause of the performance differences and additional results on pattern generator tasks were presented at the end which finally led to the rejection of the initial hypothesis.
12.07.11	Neural Ensembles	Nils Meins
In this talk we propose a Hopfield neural network (HNN) working on the basis of Haar feature for visual object detection and more specificially for frontal faces. We will compare our HNN model with the Threshold classifier and use of the Haar feature as Viola and Jones has proposed it. The motivation is to have more benefit from the feature itself with the ontop HNN Classifier instead of the threshold classifier. As a result we try to figure out inhowfar the HNN contributes to a more robust and precise ensemble. We will train our ensembles for comparison with the ensemble algorithm AdaBoost. We will show our results with frontal faces datasets.
05.07.11	Camera-based identification and localization of objects for flexible robots	Carsten Fries
The presentation presents a camera-based identification and localization of objects. In this regard, a unique 3d surface reconstruction is performed for each object, in which object-specific features are assigned to each surface point. The reconstruction is based on different perspective camera views. In these images features are sought which can be reliably detected in other views. This allows finding feature correspondences between images, thereby enabling the 3d reconstruction. The so-called 3d feature model (surface points + features) is used to identify an object in an arbitrary image and to determine its spatial position and orientation. Furthermore, an articulated robot performs mobilization activities. With the known spatial pose gripping and moving operations can be performed on objects.
28.06.11	Vision-based Gesture Recognition	Doreen Jirak
In the last decades, gesture recognition systems based on different input devices, for instance data-gloves, have been developed. Recent applications employ acceleration data derived from according wrist-watches or the Wiimote. Although these approaches have their advantages, they hinder the ease and naturalness for the user in the Human-Computer-Interaction (HCI). In contrast, vision-based systems provide an intuitive interface. However, major limitations are due to typical low-level vision problems like segmentation, illumination changes and self-occlusion, to name a few. Additionally, most algorithms do not meet realtime constraints necessary for online-processing of the gestures. This talk will give an introduction to gesture recognition in the context of HCI. Therefore, the most important developments will be introduced and an overview of the individual processing stages will be given. A major focus will be on the significant methods employed so far in this research area. Finally, a neuroscientifically inspired model will be presented, serving both for implementation and extensions. The theoretical implications will be discussed with the audience. The practical issues will be presented in a following talk (part II).
24.05.11	Recurrent Neural Networks with Parametric Bias - Theory & Application	Dr. Jens Kleesiek
This talk will be subdivided into two parts. After an introduction of the original Recurrent Neural Network with Parametric Bias (RNNPB) architecture an improved version (iRNNPB), with respect to stability, speed of convergence and accuracy is presented. The applied changes that lead to this boost in performance are elucidated and examples are given showing the intriguing capabilities of this recurrent neural network, including storage and retrieval, recognition and generalization of time series. In the second part the iRNNPB is applied in a real world robot scenario for action-driven classification of object categories. It will be shown how the architecture handles real world multi-modal sensorimotor data acquired from a humanoid robot and how it compares to canonical methods used for classification. Furthermore, the generalization potential concerning object attributes, its robustness against noise and how this can be used for de-noising of sensory channels is demonstrated.
17.05.11	Word and sentence processing in the brain - a project proposal	Stefan Heinrich
In the last decade the debate has been intensified if humans have a amodel system for language and higher cognition or if everything we think or talk about is grounded in the modalities of our sensory, sensori-motor or motor systems. Additionally we received neuroscientific evidence for the second view and valuable models of how language is likely to be distributed over the brain. Models, we have seen so far were restricted to single words on the one hand and to words which were closely related to specific motor actions and regions on the other. From the point of view of a computer scientist I will wrap up in short these research and the open chances to contribute. Furthermore I will present some new ideas about models, which capture the distribution of words in the brain that are related to different audio-video input and some higher concepts like traits or interrelationships of objects. I want to shape these ideas in a PhD project proposal and discuss the implications for application on cognitive robots.
10.05.11	BSc project proposal: Sound Source Localisation with the NAO	Robert Keßler
Im Bereich der heutigen Serviceroboter ergibt sich die Problematik, dass der gewöhnliche Anwender sich selten mit speziellen Eingabemethoden auskennt. Da bei zwischenmenschlicher Kommunikation Sprache als Hauptmedium genutzt wird, liegt es nahe auch die Mensch-Roboter-Interaktion (MRI) durch natürliche Sprache zu realisieren. Sei es als Serviceroboter in der Gastronomie, zu Hause oder als Rettungsroboter in Kriesengebieten - in allen Bereichen ist es für den Roboter wichtig, dass er nicht nur erfährt, welche Aufgabe ihm mittels Sprache gestellt wird, sondern auch woher die Information kommt. Beispielsweise soll ein Serviceroboter im Haushalt mit einfachen Mikrofonen festellen können woher ein Hilferuf kommt, um darauf entsprechend zu reagieren. Wenn der Roboter die Geräuschquelle gefunden hat, können andere Teilsysteme des Roboters weitergehende Informationen sammeln (Bsp. Videobilder). Durch welche Methoden eine Geräuschquelle im Raum georten werden kann ist hierbei die Hauptfragestellung, die bearbeitet werden soll. Die Motivation diese Fragestellung zu bearbeiten kommt daher, dass zwar schon die Grundlagen für einige Verfahren in den neunziger Jahren bekannt waren, den Robotern es jedoch an Rechenleistung fehlte. Heutzutage ist die Technik soweit, das auch autonome Roboter in der Lage sind die relativ intensiven Berechnungen durchzuführen. Dies spannt ein attraktives Anwendungsgebiet auf.
26.04.11	Learning ambient object categories for affordance prediction	Junpei Zhong
This talk is split into two parts. In the first part, features of the new NAO laser head are firstly introduced. With the help of this laser head, it is possible to accurately scan indoor environment and create the map from it. Secondly, some previous works about SLAM based on laser sensor will be presented. From the past decades, many researchers attempt to solve the problem of simultaneous localization and mapping, commonly called SLAM.Many papers use different ways to implement or optimize SLAM methods, but most of them concentrate a small part of SLAM. With representative methods of each part of SLAM, the motivation of this talk is to help ourselves understand SLAM better and to explore future research possibilities arising from SLAM.
19.04.11	Journal Club: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition	Wenjie Yan

Winter Semester 2010/2011

Winter Semester 2010/2011
29.03.11	KSERA Workshop week: Towards socially assistive robot that helps elderly people	Dr. Cornelius Weber, Nils Meins, Wenjie Yan
Our objective is to research into the foundations, representations and applications of intelligent systems. This research in computer science is focused on Knowledge Technology and Knowledge Management motivated by natural systems, e.g. biology, cognition and neuroscience. We want to study and exploit nature-inspired, in particular hybrid neural and symbolic representations, in order to build next-generation adaptive knowledge-based systems, learning ambient intelligence systems, multimodal neural agents, self organising knowledge discovery systems and human robot interaction systems. Applications include medical reasoning under uncertainty, intelligent retrieval systems, interactive systems, adaptive engineering, data/text mining systems, cognitive and neuroscience-inspired robots, speech/language systems, intelligent web agents and hybrid techniques for medical diagnosis.
29.03.11	Towards a computational amygdala model	Nicolas Navarro
Recent neuroscientific evidence suggests that the human amygdala mediates performance in many reward-based decision-making tasks. This group of nuclei located deep within the medial temporal lobes seems to represent the affective/emotional valence, i.e. state value, of a situation, necessary for coordinating physiological, behavioral and cognitive responses. These findings directly link to computational neural reinforcement learning (RL) techniques that are well established within our group. In general, computational models dealing with emotional circuits, especially fear, are rare. Here we suggest a novel model comprising microcircuits for fear acquisition, processing and extinction. The model aims at filling the computational modeling gap between neurophysiological methods and neuroimaging studies. In addition to a computational study we plan to implement the proposed neural architecture on a real-world robot leading to a demonstrator capable of predicting aversive and appetitive situations.
22.03.11	Journal Club: Joint Haar-like features for face detection	Nils Meins
08.03.11	NAO Laser Head and SLAM	Junpei Zhong
This talk is split into two parts. In the first part, features of the new NAO laser head are firstly introduced. With the help of this laser head, it is possible to accurately scan indoor environment and create the map from it. Secondly, some previous works about SLAM based on laser sensor will be presented. From the past decades, many researchers attempt to solve the problem of simultaneous localization and mapping, commonly called SLAM.Many papers use different ways to implement or optimize SLAM methods, but most of them concentrate a small part of SLAM. With representative methods of each part of SLAM, the motivation of this talk is to help ourselves understand SLAM better and to explore future research possibilities arising from SLAM.
24.02.11	KSERA Integration: Towards approaching a person with the NAO robot.	Wenjie Yan, Elena Torta
08.02.11	Visual Feedback for Saccade Learning	Dr. Cornelius Weber
Saccades direct gaze direction to peripheral visual targets. Their magnitude is adapted throughout lifetime, and adaptation can be easily induced experimentally. An adaptation scheme that also works in newborns is unlikely to rely on sophisticated object recognition, memory, and geometrical difference computation. The lack of a geometrical model is also appealing to roboticists as it renders unnecessary any hardware-specific calibration/parameterization. We consider possible feedback signals that drive saccade adaptation, and differentiate two possible mechanisms. The first is a signed feedback, which contains information about whether the saccade was too long or too short. The second exploits the high foveal resolution, containing information about the resolution increase of the saccade target but not containing any vectorial error signal. We implement both mechanisms in a model and find experimental support in the first mechanism, at least in the case of horizontal saccades. Considering the separate neuroanatomical pathways for horizontal and vertical saccade control, we tentatively suggest the first mechanism to implement horizontal saccade adaptation and the second mechanism to implement vertical saccade adaptation.
25.01.11	Determining Cooperation in Multiagent Systems with Cultural Traits	Stefan Heinrich
Achieving cooperation among autonomous and rational agents is still a major challenge. In the past, altruistic cooperation was generally explained through genetic kinship relations. However, the theory of 'cultural kin' is an approach that tries to explain altruism through cultural relatedness. To promote cooperation among autonomous and rational agents, this work transfers the idea of cultural characteristics, which benefits social behaviour, to multiagent systems (MAS). Accordingly, agents are characterised by cultural traits, which they can imitate from their neighbours and are supposed to solve tasks, for which they need the cooperation of other agents in most cases. The interaction of cooperation and cultural trait propagation will be investigated in a theoretical analysis and in an empirical simulation in a particular developed framework. As a novelty, schemata will be analysed that are beyond the well-studied one-to-one interaction.
24.01.11	Feature reduction and Ensemble-Learning with AdaBoost for Licenceplate Localisation	Nils Meins
In this talk we are analysing the ensemble learning AdaBoost according to the example of the licence plate localization. We analyse the behaviour of the learning method using different classifiers which are combined to an ensemble. One disadvantage of AdaBoost as an ensemble-learning method is the long training duration. We can update AdaBoost a bit by making only minor changes. That way we are able to reduce a large quantity of information. Using less information as a basis we are able to train a good essemble classifier in a short time. To train the final ensemble classifier we use AdaBoost in its unmodified form. Finally we compare our results with two different cascades. One with an exhaustive search and one that was trained with a randomised limited attribute quantity. Using Threshold-, Binning-, FuzzyBinning classifiers and the "Joint Haar-like Feature" we demonstrate that our method is giving good results.
10.01.11	Neural Knowledge Technology	Prof. Dr. Stefan Wermter
In this talk we study the potential of nature-inspired, in particular computational neural and symbolic representations, in order to build new adaptive knowledge-based systems, learning ambient intelligence systems, multimodal neural agents, self organizing knowledge discovery systems and human robot interaction systems. We will give an overview of learning neural symbolic knowledge technologies and robots from a perspective of integrative hybrid intelligent systems. We illustrate some new developments including also examples under development in the new Knowledge Technology lab.
21.12.10	Backpropagation Reloaded & A Project Proposal	Dr. Jens Kleesiek
The seminar is split into two parts. In the first half an extension of the classical backpropagation algorithm will be discussed and preliminary results will be shown. In the second part thoughts about a new project will be presented, hopefully culminating in a lively discussion with valuable comments and suggestions. The classical backpropagation algorithm led to a "renaissance" in the artificial neuronal network community when it was (re-)introduced in the mid-1980s and has been used vastly until then. However, in recent years alternative methods have gained popularity and are often used to solve classification tasks or regression problems. This is due to different causes ranging from scalability issues to difficulties in interpreting the hidden layer. To possibly overcome some of these problems and enhance performance as well as the speed of convergence we suggest to extend the algorithm with the concept of intrinsic plasticity (IP), leading to a sparse firing pattern of the neurons. I will review classical backpropagation, elucidate the concept of IP and show preliminary results of the IP enhanced backpropagation algorithm. In the second part of the talk I will present some thoughts for a new project. Using the concept of Sensorimotor Contingencies (SMCs) and its original idea, namely that "seeing is a way of acting", I want to discuss with you how to convey this notion to a NAO experiment giving rise to object discrimination.
07.12.10	Ksera Workshop - Towards socially assistive robot that helps elderly people	Wenjie Yan
Person tracking is an important topic in ambient living systems as well as in computer vision. Detecting a person from a ceiling-mounted camera is a challenge, since the person's appearance is very different from the top or from a side, and the shape of the person changes significantly when moving around the room. This paper proposes a novel algorithm for real-time person tracking using particle filters and different visual streams as cues for recognition. We present a new architecture to integrate different vision streams into a particle filter with adaptive reliability of each stream. In particular, a short-term memory mechanism is modeled, which enhances the robustness of the tracking system against noise.
16.11.10	Hybrid Neural Systems	Doreen Jirak
02.11.10	Grounding Cognition: Psychological Foundations for Development of Natural Language Dialogues.	Stefan Heinrich
Barsalou (2008) claims that conceptions of grounded cognition reject the standard view that amodal symbols represent knowledge in semantic memory. Furthermore it is unlikely that the brain contains amodal symbols; if it does, they work together with modal representations to create cognition. The talk in the Oberseminar will be a combination of: 1. A presentation of Lawrence W. Barsalou's paper 'Grounded Cognition' in a journal club fashion and 2. A discussion of implications for future research directions and the development of computational models of grounding language. Reference: Lawrence W. Barsalou. Grounded Cognition. Annual review of psychology, 59:617-645, Annual Reviews, 2008
19.10.10	RobotDoC Workshop: Towards Affective Mechanisms and Mirror Neurons in Cognitive Robots	Nicolas Navarro, Junpei Zhong
Multi-robot Exploration Behavior based on Mirror Neuron System by Junpei Zhong The mirror neuron system is a neuron system which is believed to play a role in action understanding and imitation among humans and other primate species. There have been a number of researchers who contributed to build up a computational functional model of mirror neuron system, mostly focused on motor action imitation. However, my research question will be at a higher level in action selection of multi-robot system and examines: how to select an action for robots coordination and how it supports the social organization between agents? Humanoid robot behavior modulated by an artificial self-protective affective system by Nicolas Navarro Affective mechanisms are a result of a distributed neural system affecting several behavioral levels and for this reason there is no easy way to separate them from cognition. Moreover, affective mechanisms could be a requisite for intelligence, which is supported by neuroscientific insights of its key role in long-term memory and allocation of attention, two important characteristics ascribed to intelligent organisms. During years, studies have shown that the amygdala (an old brain structure) is involved in emotional evaluation of stimuli, and furthermore, it is involved in the associations of neutral stimuli with innate responses. Innate responses such as fear, are essential for the survival of any animal. For this reason, we believe they are an interesting start point to build an artificial intelligent system.
14.10.10	The Centre of Intelligent Systems and Robotics: A presentation for master students.	Doreen Jirak, Stefan Heinrich

Summer Semester 2010

Summer Semester 2010
27.09.10	Journal Club: Mirror Neurons and Learning in Bio-Robotics	Junpei Zhong, Nicolas Navarro
21.09.10	Object Affordances in the Context of Sensory Motor Contingencies	Dr. Jens Kleesiek
09.09.10	Cinacs Summer School: Workshop & Lab Visits	Prof. Dr. Stefan Wermter, Dr. Cornelius Weber, Stefan Heinrich
30.08.10	NeuroSoftware Meeting	Dr. Cornelius Weber
25.08.10	PhD Proposal Meeting	Stefan Heinrich, Wenjie Yan, Nicolas Navarro, Junpei Zhong
23.08.10	Journal Club: Emergence of learning; Learning with a self-organizing, dynamic neural system	Nicolas Navarro, Wenjie Yan
13.08.10	Journal Club: Language models; Multi-Robot Learning	Junpei Zhong, Stefan Heinrich
06.08.10	Journal Club: Grounding action words in the interaction with the world	Stefan Heinrich
28.07.10	Initial overview of multi-robot behavior based on mirror neurons	Junpei Zhong
16.07.10	Grasp-oriented visual perception	Christine Upadek
09.07.10	Learning Object Affordances in the Context of SMCs	Dr. Jens Kleesiek
02.07.10	Towards neuronal emotional robotics	Nicolas Navarro
25.06.10	Ksera: Person identification and approaching: a review	Wenjie Yan
18.06.10	Interacting Robots: a review	Stefan Heinrich
11.06.10	POS Tagging and Neural Network approaches	Arne Koehn
04.06.10	Ambient Intelligence and Robots	Felix Lindner
28.05.10	Journal Club: Particle filters	Dr. Cornelius Weber
14.05.10	Parsing and Neural Networks	Niels Beuck
07.05.10	Neuronale Netzwerk Simulationen in Java	Robert Keßler / NeuNet AG
30.04.10	Fehlererkennung u. -korrektur bei String-Matching	Andree Monsees
23.04.10	Neural networks for RobotDoc, Ksera, Nao, ...	Dr. Cornelius Weber
16.04.10	Erste Schritte mit dem Nao Roboter	Dieter Jessen
09.04.10	Einführung	Prof. Dr. Stefan Wermter