Web Interfaces for Language Processing Systems SS2018
The following systems are successfully completed as part of the master project web interfaces for language processing systems in the summer semester 2018.
I. Active Learning Platform for comment classification
Machine Learning to classify comments into good, neutral and bad classes.
Team
Felix Fröhlich
Purpose
Label comments using an active learning approach.
How it works
In the first step, a pre-trained machine learning algorithm predicts probabilities for unlabeled data points. These predictions give an insight into how sure the algorithm is about a specific data point. This gives us the opportunity to preselect these data points and label them. The machine learning algorithm learns more from these data points.
The backend of this web application offers a REST API for interaction with the server. There is an SQLite database with 3 tables. The largest table contains 100 000 unlabeled comments which are a subset of the One Million Posts Corpus, which are user comments submitted on an Austrian newspaper website. The second table contains all the comments which are labeled by the user of this platform. Between them, there is the so-called Staging table. This table contains comments about which the machine learning algorithm is unsure about. The frontend gets its data from this staging table.
Source Code
The source code is available here
II. Speech of the day
This project the "network of the day" project with the audio processing capability
Team
Steffen Stahlhacke
Purpose
Use news-videos as source material and use the Kaldi tool to convert speech to text. The front end component of the network of the day is used to visualize the graphs.
How it works
This project is crawling, downloading and transcribing Youtube playlists and videos of daily published news channels. The project's main aim is to provide the transcribed videos to the Network of the Day project.
The project is dependent on the kaldi-gstreamer project, the CoreNLP project, and youtube-DL.
In the processing pipeline, the YoutubeDL downloader is used to collect and download the videos and playlists. As a postprocessing step, the downloads are converted to mp3 with ffmpg. The mp3 files are then sent to kladi-gstreamer server, to transcribe the audio files to text. For this step, a Kaldi ASR model is necessary (e.g. German Model from UHH-LT). As a third step "Words To Sentence" and "True Case" annotation are done via CoreNLP. The last step is to transform, compress and export the data fitting the requirements of the NoD-Core.
Source Code
The source code is available here.
III. Unspeech Visualization
Team
Robert Repenning
Project Description
The unspeech visualization utilizes the TSNE algorithm to visualize the unspeech embeddings.
More info about unspeech can be found at http://unspeech.net/preview. The unspeech application is divided into three main areas. The middle canvas shows the visualization of the data points. These can be interacted with by moving the mouse over or
clicking them. At the top, you can find the main navigation. On the right side of the title bar (T-SNE Visualization), there are three buttons (RUN, STOP, and Dataset).RUN: Starts the webworker that calculates the relative position of the data points. STOP: Stops/interrupts the webworker. DATASET: Dropdown Menu where the different datasets can be
selected. On the left side is the tooltip and settings ribbon. Under SELECTION you can get the current information about the selected point. It contains the name as well as a timestamp. With play and pause the audio file associated with the data point can be played.
Source Code
The source code is available here.
IV. Thumbnail Annotator
Basic Idea
The idea is to get ‘context-sensitive’ Thumbnails for specific words (nouns, named entities) that appear in an input text and induce a ‘sense’ by editing the priority of the shown Thumbnails. For example, let’s have the sentence “My computer has a mouse and a gaming keyboard”. If a user clicks on the word “mouse”, a list of Thumbnails of animals and technical devices shows up. Now a User can increment or decrement the priority of the Thumbnails to induce the sense for the word “mouse” in this specific context. If we now input the sentence “I have a mouse and a dog.” and the User clicks n the word “mouse” again a list of Thumbnails shows up. Since now the word appears in a different context, the Thumbnails won’t have the prioritization and the User can induce the sense of “mouse” with this context again. The goal of the project is a PoC Application including a Web Frontend as well as a REST API. The most important technologies or frameworks, used to achieve this goal, are be UIMA, DKPro, Spring(Boot), Redis, Bootstrap, Vue.js, and Docker.
V. Argument Search Engine
Analyze, annotate and visualize arguments found in a text using a web interface
Team
Philipp Heidenreich, Oleksiy Oliynyk
Key features
- User Interface for text analyzing and visualizing
- Extend named entity tagger with argument information
- Batch processing of texts to get arguments
- Search through the processed data
Task
- Get and visualize argument structures
- Provide related information for found argument phrases
- Enable deeper investigation
Source Code
The source code is available here.