DIVID-DJ: Data Extraction and Interactive Visualization of Unexplored Textual Datasets for Investigative Data-Driven Journalism
Objectives
Journalists investigate novel and relevant stories in order to report them in article publications. Often, such stories need to be discovered within large text document collections with different levels of confidentiality (e.g. public parliament records, collections
Investigation of unstructured document collections is a laborious task: The sheer amount of content can be vast, e.g., the WikiLeaks PlusD 1 dataset released in 2013 contains around 1.7 million cables. The plain text parts mostly contain uninteresting content that conceals the crucial storylines, and if journalists do not know in advance what to look for in the document collections, they can only vaguely target all people and organizations (“named entities”) of public interest. There
As an additional constraint, the texts have to be analyzed under time pressure, because the journalistic value of each story decreases rapidly if other media publish it first. Moreover, the workload for processing confidential documents is frequently allocated to the few journalists that are allowed accessing them.
To meet these challenges, we identified automated support for immediate extraction of valuable information from large text document collections as vital for journalists.
Methods
In this project, we will combine NLP technology with advanced InfoVis interfaces into a novel tool for investigative data-driven journalism that addresses research questions regarding the combination of NLP and InfoVis techniques.
After document conversion, as provided by Spiegel IT department, NLP tools will extract named entities and their relationships from document collections. This is the only language-dependent
The results of the natural language processing stage will be used as an input for interactive visualization, developed specifically for this purpose in this project. The aggregate view of the extracted entities forms the basis structure of the entity graph, which serves as a visual access to the document collection. The novel visualization will need to show diverse information about entities and their relationships, such as data sources, data frequencies or data changes over time.
Added Value for Science and Journalism
The scientific contribution of the project will push forward both science and investigative journalism. On the scientific side, we will cover both language technologies and information visualization. The main contribution of this project is the connection of these technologies for the use in investigative journalism.
In the information visualization area, we will contribute novel guidelines for effective visual designs and interaction techniques, specifically for the purposes of investigative journalism. This includes the visualization of very large collections, visualization of progression of entities networks over time, and the interaction with the visualization for effective information access.
The key contribution on the natural language processing side is the effective combination of named entity recognition and keyword extraction for the use of collection visualization and information access. Further research questions include the efficient indexing for entering the local view, the ranking and selection of keywords, and the integration of tools that identify and resolve temporal references within the documents.
The added value for journalism is threefold: First, journalists will get their hands on a tool that does not require extensive training, and which allows to quickly spot interesting parts of the dataset. Second, the tool supports sharing particular views,
Follow our blog at http://newsleak.io/ for updates!
Project Data
Funding Body: Volkswagenstiftung
Project volume: 96K Euro
Project Duration: Nov 2015 - Oct 2016
Project Partners
1. Spiegel-Verlag Rudolf Augstein GmbH & Co. KG, Hamburg, Germany
Primary Investigator: Marcel Rosenbach
Coordinator: Dr. Michaela Regneri
Main competence and project contribution: Journalist, Knowledge
2. Technische Universitat Darmstadt, FG Language Technology, Darmstadt, Germany
Primary Investigator: Prof. Dr. Chris Biemann
Executive Staff: Dr. Alexander Panchenko, Seid Muhie Yimam, Uli Fahrer
Main competence and project contribution: Natural language processing of large text corpora
3. Technische Universitat Darmstadt, Interactive Graphics Systems Group, Darmstadt, Germany
Primary Investigator: Dr. Tatiana von Landesberger
Executive Staff: Kathrin Ballweg
Main competence and project contribution: Visual analysis of large datasets,
Publication
- Müller, M., Ballweg, K. von Landesberger, T., Yimam, S.M., Fahrer, U., Biemann, C., Rosenbach, M., Regneri, M., Ulrich, H. (2017). Guidance for Multi-Type Entity Graphs from Text Collections. EuroVis Workshop on Visual Analytics 2017, Barcelona, Spain (pdf)
- Yimam, S.M., Ulrich, H., von Landesberger, T., Rosenbach, M., Regneri, M., Panchenko, A., Lehmann, F., Fahrer, U., Biemann, C. and Ballweg, K. (2016): new/s/leak – Information Extraction and Visualization for Investigative Data Journalists. ACL 2016 Demo Session, Berlin, Germany (pdf)
- Ballweg K., Zouhar F., Wilhelmi-Dworski P., von Landesberger T., Fahrer U., Panchenko A., Yimam S.M. Biemann C., Regneri M., Ulrich H. (2016) new/s/leak – A Tool for Visual Exploration of Large Text Document Collections in the Journalistic Domain, Baltimore, MD, USA, (pdf)