Web Interfaces for Language Processing Systems SS2017

The following systems are successfully completed as part of the master project web interfaces for language processing systems in the summer semester 2017.

I. AnonML

Machine Learning to anonymize court decisions.

Team

Mirco Franzek and Matthias Schildwächter

Purpose

Court decisions in Germany can only be published in anonymized form.
Anonymization is done by hand.
Only a very small fraction of all decisions is published.
Courts are not capable of anonymizing a larger share.
There is little research about how the law is applied by lower-level courts.

Idea

Create software that is able to anonymize court decisions automatically.
Enable legal research and practitioners to better predict court decisions.
Improve transparency and legitimacy – truly make the world a better place.
Create the foundation for a better database and better search tools.

Requirements

Recognize text fragments that have to be anonymized: names, addresses, locations, license plates, companies, identifiable descriptions etc.
Possibly extract anonymisation rules from anonymized decisions.
Replace or delete these fragments with placeholder.
Calculate confidence scores to require manual control.
No legal knowledge required.

Data

A few hundred decisions from the European Court of Justice

Result

The software suggests possible anonymizations which have to be accepted or declined.
The user can add missing anonymizations.
The application can be retrained with the information received from manual corrected decisions to improve the suggestions and speed up the process.
In the end the anonymized document can be exported.

Documentation

The documentation containing the installation guide and general descriptions is available here.

Source Code

The source code is available here

Demo

II. new/sleak Extension

This project extends the graph and document processing functionalities of the new/s/leak project

Team

Alvin Fazrie and Thorben Wiese

Purpose

Enable adding new entities and keywords in to the system.
Enable adding of new entity types.
Provide keyword graphs alongside the entity graph
Enhance entity and keyword blacklisting
Improve analyzability of connections between entities, keywords and tags.

Documentation

The documentation is available here.

Source Code

The source code is available here.

Demo

The demo is available here, which is based on the Enron Email Dataset

III. News-crawler

Team

Sönke Behrendt

Project Description

This project is crawling, extracting, indexing and processing the content of daily published news articles. The extracted content is indexed in ElasticSearch for further processing. This project also provides tooling to extract and preprocess the content for the NoD project.

Documentation and Source Code

And the link for source code and documentation: https://github.com/thesoenke/news-crawler

Demo

The demo is available here