Lexical Chains for German
In a NAACL 2013 paper based on the Master's Thesis of Steffen Remus (supervised by Chris Biemann), 100 German documents from the Tiger/Salsa corpus were each annotated by 2 annotators for lexical chains. Newly developed statistical methods for lexical chaining were evaluated against this data.
Introduction
The identification of lexical chains is an important building block in modern natural language processing applications such as summarization or text-segmentation. In order to extract lexical chains it is is necessary to identify the important lexico-semantic relations in a text.
Traditional approaches mostly rely on the use of knowledge resources such as thesauri or lexical databases like WordNet. Hence, the quality of the extracted lexical chains highly depends on the quality and quantity of the entries in the particular knowledge resource. Statistical methods on the other hand have proven to deliver good results in many natural language processing applications where the only prerequisite is a sufficiently large and qualitatively good data collection.
The thesis examines the suitability of statistical methods for the task of identifying lexico-semantic relations in order to build proper lexical chains. Four algorithms are developed that utilize either latent Dirichlet allocation (LDA) as a probabilistic topic modeling framework or the log-likelihood ratio (LLR) as an indicator for statistically significant co-occurring terms.
An intrinsic evaluation of these algorithms against some trivial and established baseline algorithms is performed which confirms the hypothesis. For this purpose, manual annotation of lexical chains was carried out and a suitable lexical chain comparison measure was developed.
Supporting annotation guidelines were developed and the MMAX annotation toolkit was adapted in order to carry out an annotation project where two annotators annotated 100 documents from the German Salsa 2.0 corpus.
The annotations, as well as the software for lexical chain annotation is freely available from this page.
Documentation
- Annotation Guidelines for Lexical Chains as used in this project (pdf)
- Annotation Companion: how to use the MMAX lexical chaining annotation tool (pdf)
- Steffen Remus (2012): Automatically Identifying Lexical Chains by Means of Statistical Methods - A Knowledge-Free Approach. MA Thesis, TU Darmstadt (pdf)
Downloads
- Lexical Chain data only - for 100 documents, 2 annotators (zip, 170KB)
- LexChainMMAX tool for lexical chain annotation, without automatic lexical chainers (zip, 12MB) [*updated 24.06.2013]
- LexChainMMAX tool for lexical chain annotation, including automatic lexical chainers as described in the thesis (zip, 28MB) [*updated 24.06.2013]
The software is distributed open-source under the Apache 2.0 license.
For operating the tool and to get access to full documents, you will need to download Tiger 2.1 and Salsa 2.0 (subject to licensing terms) see Readme.txt for details.
Citation
- Remus, S., and Biemann, C. (2013): Three Knowledge-Free Methods for Automatic Lexical Chain Extraction. Proceedings of NAACL-2013, Atlanta, GA, USA (pdf)
This data and software was created in the framework of the LOEWE-project "Text as a Product".