TWSI Sense Substituter
TWSI Sense Substituter
Author and Contact: Chris Biemann
First, download the TWSI data: twsi-data-1.0.0.zip (62MB packed, 1.3GB unpacked). See also TWSI data.
Next, you need the TWSI code. You can download the Java source and executables (twsi-1.0.2.zip -- 7.6 MB packed, 43MB unpacked) or, if you are a Maven user, you can use the artifact from the Maven Central
The original release of TWSI (2012-02) is still available:: TWSISenseSubstituter.zip (80MB packed, 1.4GB unpacked); UIMA annotator SenseAnnotator.zip (18MB packed, without models).
TWSI Sense Substituter is software that produces lexical substitutions in context for over 1000 frequent nouns. The software processes English text.
This functionality is realized by a supervised word sense disambiguation system, which is trained by sense-labeled occurrences of target words. A classification model is trained for each word, and used to decide which sense an unseen occurrence most likely belongs to. Associated with senses are lists of substitutions, which are injected into the text using inline annotation.
Example output (target = capital):
The company manages over 10 million dollars in <target="capital" lemma="capital" sense="capital@@3" confidence="0.98863536" substitutions="[money,27][asset,14][fund,14][investment,14][cash,13][financial asset,10][finance,8][stock,8][wealth,6][property,5]"> assets .
When imposing <target="capital" lemma="capital" sense="capital@@5" confidence="0.99133354" substitutions="[major,6][main,5][essential,4][capital offense,3][death penalty,3][execution,2][extreme,2][ultimate,2]"> punishment , the state becomes a murder of sorts .
In the first sentence, the sense of the target is identified as "capital@@3" by the classifier. A list of substitutions is yielded, in this case "[money,27][asset,14][fund,14]...". Numbers indicate how often people supplied the substitution for this sense in an annotation task, see references for details.
Running the Substitution System
The substitution system can be run on the command line. As parameters, it needs a configuration file (which contains paths to the data and the pre-trained classifier models), and the name of a file containing English plain text.
java -Xmx1G -jar SenseSubstituter.jar run -c src/test/resources/Test2/TWSI2_config.conf -t src/main/resources/TWSI2/darmstadt.txt
[-c|--configfile] : configuration file
[-t|--textfile] : text file
Note that a minimum of 1G is needed for the heap space, option -Xmx. In case you get an out-of-memory error, increase the heap space size.
The program loads classifiers as it encounters noun targets. Loading a classifier takes some time -- the program might seem slow in the beginning but speed increases once many classifiers have been loaded.
Most users will merely run the substitution system, or integrate it in their own pipeline, without training their own models.
Training the Substitution System
Train the substitution system if you either have additional data, or your own data, or other cluster features, or if you want to try using another machine learning algorithm. As far as algorithms go, best results have been obtained with Weka's AODE, followed by Weka's SMO. Since the latter produces much smaller models that require less memory, the software is by default using the SMO classifier.
For training, you will have to specify the location of the data in a configuration file. Then, you call the training procedure by:
java -Xmx1G -jar SenseSubstituter.jar train -c path/to/your_config_file.conf -tr path/to/your_sense_annotation_file
[-c|--configfile] : configuration file
[-tr|--trainfile] : text file
The configuration file needs to specify the following:
- taggerModelDir: the directory where the POS tagger models are located
- taggerModelPrefix: filename prefix for the POS tagger models
- mainDir: the directory where all other data is located.
- trainingsData: sentences with sense labels in a 4-column, tab separated format
- ambiguousWordsFile: list of words that have more than 1 sense and thus need classification models
- inventoryFile: file containing filenames of word clusters that are used as features in the classifier. see references for detailst
- substitutionsFile: file that contains the substitutions per sense
- monosemousWordsFile: list of words that have only a single sense. These words still get substitutions, but do not need to be
- disambiguated by a classifier.
- lemmaMapFile: mapping between full forms and lemma targets (e.g. houses -> house)
- modelFolder: folder where the classification models are stored
- wekaClassifier: the name of the WEKA classifier
- wekaOptions: options of the WEKA classifier (default empty)
As an example, check the configuration files in the conf/ folder and the respective files referenced therein.
- ambiguousWords = One word per line
- inventories = first line:#inventories, rest: one filename per line
- 4ColTrainingData = tab separated, Col1:SenseClass Col2:target Col3:Lemma Col4:Sentence
- substitutionFile = tab separated, Col1:SenseClass Col2:target Col3:substitution Col4:#occurrences/weight
- lemmafile = tab separated, Col1:target Col2:fullform
For training and running, the same configuration file should be used.
Distribution and Licenses
The software is distributed under the GNU Public License, see LICENSE.TXT. This is due to the fact that the Weka Machine Learning library and some tools from the Stanford NLP software are referenced in the code, which are both distributed under this license. In case you need a different license, please contact us.
The data used to train the models, the TWSI 2.0, is distributed under a Creative Commons License (share-alike).
Please cite the following paper(s) if you use this software in your research:
- [Technical aspects of data and software] Biemann,C. (2012): Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource for Lexical Substitution. Proceedings of LREC 2012, Istanbul, Turkey (pdf)
- [Software and Methodology] Biemann, C. (2012): Creating a system for lexical substitutions from scratch using crowdsourcing. Lang. Resources & Evaluation Vol. 47, No. 1, pp. 97--112. Springer. DOI 10.1007/s10579-012-9180-5 ( springerlink)