Complex word identification datasets
Introduction
Complex word Identification (CWI) is a sub-task of lexical simplification (LS), which identifies difficult words or phrases in a text. There are very few CWI datasets available and mostly limited to the English language. To alleviate this problem, we have collected CWI datasets for English, German, and Spanish.
Data set collection procedures
We collected complex word and phrase annotations (sequences of words, up to maximum 50 characters) using the Amazon Mechanical Turk (MTurk) crowdsourcing platform, from native and non-native English, German, and Spanish speakers. We collect annotations using MTurk, on a paragraph-level HIT (Human Intelligence Task), which is 5-10 sentences long.
The English datasets consists of three genres:
- Professionally written news
- News written by amateurs (WikiNews)
- Wikipedia articles
The German and Spanish datasets are compiled from German Wikipedia and Spanish Wikipedia articles.
Data formats
These datasets contain information about complex phrases annotated with some statistics. Each line represents a sentence with one complex phrase (CP) annotation and relevant information, each separated by a TAB character.
- First column shows the HIT ID of the sentence. All sentences with the same ID belong to the same HIT.
- Second column shows the actual sentence where there exists a complex phrase annotation.
- The third and fourth columns display the start and end offsets of the complex phrase annotation in this sentence.
- The fifth column represents the actual complex phrase annotation.
- The sixth, seventh, and eighth columns show the number of native annotators, the number of non-native annotators and the total number of annotators who have marked this complex phrase.
Examples
ID1 Both China and the Philippines flexed their muscles on Wednesday. 31 37 flexed 2 7 9
ID1 Both China and the Philippines flexed their muscles on Wednesday. 31 51 flexed their muscles 4 2 6
Here, we can see that the phrase "flexed" is marked as a complex phrase by 2 native and 7 non-native English speakers whereas the phrase "flexed their muscles" is marked by 4 native and 2 non native English speakers.
Download
The datasets are available from here
Complex Word Identification (CWI) Shared Task 2018 Dataset
The CWI datasets are used for the Complex Word Identification (CWI) Shared Task 2018. An additional dataset for French is included into the datasets. Please read about the formats of the datasets from the CWI 2018 shared task page (here).
The datasets as they are used in the CWI 2018 shared task are available here.
The ranking of the participants for the Binary classification task and Probabilistic classification task is available here and here resp.
CWI Dataset with Language levels
The CWI dataset contains additional information about the language level of the annotators for non-native English speakers. The data format is the same as the CWI shared task dataset plus 4 additional columns. The four columns show the number of advanced, intermediate, beginner, and (not-provided )number of annotators for the non-native speakers. The not-provided column shows where the worker does not provide their English language during the experiments. The data are available here.
License
The data is distributed under CC-BY 4.0 license, see https://creativecommons.org/licenses/by/4.0/ for details
Publications
Please cite one of these publications, if you use the data in your research:
- Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Chris Biemann (2017): CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups. In Proceedings of The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017). Taipei, Taiwan [For English data in different genres]
- Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Chris Biemann (2017): Multilingual and Cross-Lingual Complex Word Identification. In Proceedings of The 2017 International Conference on Recent Advances in Natural Language Processing (RANLP). Varna, Bulgaria [for multilingual data]