German Named Entity Recognition Data

Data and Task Setup

The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties:

The data was sampled from German Wikipedia and News Corpora as a collection of citations.
The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]].

The data are available for download below. This data set is distributed under the CC-BY license creativecommons.org/licenses/by/4.0/.

Data Format

The following snippet shows an example of the TSV format this data is distributed in.

# de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1 Aufgrund O O
2 seiner O O
3 Initiative O O
4 fand O O
5 2001/2002 O O
6 in O O
7 Stuttgart B-LOC O
8 , O O
9 Braunschweig B-LOC O
10 und O O
11 Bonn B-LOC O
12 eine O O
13 große O O
14 und O O
15 publizistisch O O
16 vielbeachtete O O
17 Troia-Ausstellung B-LOCpart O
18 statt O O
19 , O O
20 „ O O
21 Troia B-OTH B-LOC
22 - I-OTH O
23 Traum I-OTH O
24 und I-OTH O
25 Wirklichkeit I-OTH O
26 “ O O
27 . O O

The example sentence, "Aufgrund seiner Initiative fand 2001/2002 in Stuttgart, Braunschweig und Bonn eine große und publizistisch vielbeachtete Troia-Ausstellung statt, „Troia - Traum und Wirklichkeit“. " contains five named entities: the locations Stuttgart, Braunschweig and Bonn, the noun including a location part Troia-Ausstellung, and the title of the event Troia - Traum und Wirklichkeit, which contains an embedded location name Troia.
The sentence is encoded as one token per line, with information provided in tab-separated columns. The first column contains either a #, which signals the source the sentence is cited from and the date it was retrieved, or the token number within the sentence. The second column contains the token.
Name spans are encoded in the BIO-scheme. Outer spans are encoded in the third column, embedded spans in the fourth column. We have refrained from adding a third column for third-level embedded spans for the purpose of this evaluation, since they only occurred very rarely during annotation. See the paper [1] below for more information on the dataset and on the annotation guidelines.

Task Setup

We split the dataset into training, development and test sets and provide the datasets in a tab-separated (TSV) format.

The full dataset can be downloaded here. Local mirror here. The data is split into the following subsets:

Training Set
Development Set
Test Set (in unannotated form and in annotated form)

in an ongoing project, people from ABBYY are correcting annotation errors. You can find their project here: http://www.dialog-21.ru/en/germeval2014/

More details and the official evaluation scripts are available at https://sites.google.com/site/germeval2014ner/. See the publications for more details, the annotation guidelines, and the outcome of the GermEval 2014 challenge.

Publications

Darina Benikova, Chris Biemann, Marc Reznicek. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. Proceedings of LREC 2014, Reykjavik, Iceland (pdf)
Darina Benikova, Chris Biemann, Max Kisselew, Sebastian Pado (2014): GermEval 2014 Named Entity Recognition Shared Task: Companion Paper. In Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition, Hildesheim, Germany (pdf)