German Named Entity Recognition Data
Data and Task Setup
The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties:
- The data was sampled from German Wikipedia and News Corpora as a collection of citations.
- The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
- The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]].
The data are available for download below. This data set is distributed under the CC-BY license creativecommons.org/licenses/by/4.0/.
Data Format
The following snippet shows an example of the TSV format this data is distributed in.
# de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1 Aufgrund O O
2 seiner O O
3 Initiative O O
4
5 2001/2002 O O
6 in O O
7 Stuttgart B-LOC O
8 , O O
9 Braunschweig B-LOC O
10 und O O
11 Bonn B-LOC O
12
13 große O O
14 und O O
15
16
17 Troia-Ausstellung B-LOCpart O
18
19 , O O
20 „ O O
21 Troia B-OTH B-LOC
22 - I-OTH O
23 Traum I-OTH O
24 und I-OTH O
25 Wirklichkeit I-OTH O
26 “ O O
27 . O O
The example sentence, "Aufgrund seiner Initiative
The sentence is encoded as one token per line, with information provided in tab-separated columns. The first column contains either a #, which signals the source the sentence is cited from and the date it was retrieved, or the token number within the sentence. The second column contains the token.
Name spans are encoded in the BIO-scheme. Outer spans are encoded in the third column, embedded spans in the fourth column. We have refrained from adding a third column for third-level embedded spans for the purpose of this
Task Setup
We split the dataset into training, development and test sets and provide the datasets in a tab-separated (TSV) format.
The full dataset can be downloaded here. Local mirror here. The data is split into the following subsets:
- Training Set
- Development Set
- Test Set (in unannotated form and in annotated form)
in an ongoing project, people from ABBYY are correcting annotation errors. You can find their project here: http://www.dialog-21.ru/en/germeval2014/
More details and the official evaluation scripts are available at https://sites.google.com/site/germeval2014ner/. See the publications for more details, the annotation guidelines, and the outcome of the GermEval 2014 challenge.
Publications
- Darina Benikova, Chris Biemann, Marc Reznicek. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. Proceedings of LREC 2014, Reykjavik, Iceland (pdf)
- Darina Benikova, Chris Biemann, Max Kisselew, Sebastian Pado (2014): GermEval 2014 Named Entity Recognition Shared Task: Companion Paper. In Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition, Hildesheim, Germany (pdf)