Contextualizing Distributional Similarity and Mapping to Knowledge
This Apache-licensed project is providing a software solution for automatic text expansion using contextualized distributional similarity. Main contributors are
- TU Darmstadt, Computer Science Department, FG Language Technology
- IBM T.J. Watson Research - Watson Technology
Please go to www.jobimtext.org for a detailed project description, or see the demo at http://maggie.lt.informatik.tu-darmstadt.de:10080/jobim/
Download and Contributing
The project is hosted on sourceforge under https://sourceforge.net/p/jobimtext and is available under the Apache 2.0 License. Please contact us if you plan to contribute to, or use the software.
- Chris Biemann (UHH, project lead)
- Bonaventura Coppola (TU DA, frame semantics, large scale engineering)
- Alfio Gliozzo (IBM, project lead)
- Michael R. Glass (IBM, machine learning)
- Martin Riedl (UHH, lead developer, scientific experiments, software engineering)
- Eugen Ruppert (UHH developer, experiments, documentation)
- Riedl M., Steuer R., Biemann C. (2014): Distributed Distributional Similarities of Google Books over Centuries. Proceedings Fourth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland. (pdf, resources)
- Gliozzo A., Biemann C, Riedl M., Coppola B., Glass M. R., Hatem M. (2013): JoBimText Visualizer: A Graph-based Approach to Contextualizing Distributional Similarity. Proceedings of the 8th Workshop on TextGraphs in conjunction with EMNLP 2013 (pdf)
- Biemann C., Riedl M (2013): From Global to Local Similarities: A Graph-Based Contextualization Method using Distributional Thesauri. Proceedings of the 8th Workshop on TextGraphs in conjunction with EMNLP 2013 (pdf)
- Riedl M., Biemann C. (2013): Scaling to Large^3 Data: An efficient and effective method to compute Distributional Thesauri. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 (pdf)
- Journal Article: Biemann, C., Riedl, M. (2013): Text: Now in 2D! A Framework for Lexical Expansion with Contextual Similarity. Journal of Language Modelling 1(1):55--95 (pdf)
- Slides from the Two-Days Tutorial on Watson and the DeepQA Architecture, March 18/19, 2013, TU Darmstadt
- "Beyond Jeopardy! - Adapting Watson to new domains using Distributional Semantics" - slides of Alfio Gliozzo's talk at ICSI Berkeley, November 2012
- "Text: Now in 2D — Lexical Expansion using Contextual Similarity" - slides of Chris Biemann's talk at ETS Princeton, September 2012
The Distributional Semantics Manifesto
What is distributional semantics?
Distributional Semantics is about building a totally unsupervised framework for computational semantics. It addresses traditional computational semantics problems like lexical ambiguity and variability, word sense disambiguation and lexical substitutability, paraphrasing, frame induction and parsing, and textual entailment. Our methodology is to avoid using rule based systems and hand labeled data for supervised learning. The goal is to build a semantic analyzer able to self-adapt to new domains and languages after unsupervised learning from large corpora of raw text. At the same time, the output of distributional semantics is a contextual thesaurus, representing sense clusters and properties characterizing each cluster. Finally, a mayor goal of the Distributional Semantics framework is to map induced linguistic knowledge to existing knowledge bases, such as for example semantic web data and databases, allowing entity linking and disambiguation with respect to pre-conceptualized domain models and enabling a new range of applications.
What is the theory and technology behind distributional semantics?
Distributional semantics is based on very well assessed linguistic theories and on a radical machine learning approach. It has its roots in De Saussure’s structural linguistics hypothesis and in the semiotic principles distinguishing expressions from meaning and reference. Structural semantics claims that meaning can be fully defined by semantic oppositions and relations between words, and in particular syntagmatic and paradigmatic relations. Paradigmatic relations are established in absentia and represents substitubility between words preserving meaning, whereas syntagmatic relations are mostly syntactic relations that can be identified by a syntactic parser. The distributional hypothesis, formulated by Zelling S. Harris claims that paradigmatic relations can be detected by mining distributional properties of syntagmatic relations, allowing us to acquire paradigmatic relations in a fully unsupervised way.
On the other hand, unsupervised learning and complex system are part of the Distributional Semantic framework. We are targeting algorithms that can be parallelized and executed in large computer clusters for scalability. In this way, we are going to build local models of semantic relations rather than global models, which allows to computation to be parallelized and executed using search engine technology like inverse indices and MapReduce.
Some relevant references for this project, ordered by topic.
Linguistics, Distributional Semantics, Structuralism
de Saussure, F. (1916). Cours de linguistique générale. Librairie Payot & Cie, Paris.
Z. Harris. (1954). Distributional Structure. Word 10 (2/3)
G. A. Miller, W. G. Charles (1991): Contextual Correlates of Semantic Similarity. Language and Cognitive Processes 1991, 6 (1) 1-28
Biemann, C. (2011): Structure Discovery in Natural Language. In G. Hirst, E. Hovy and M. Johnson (Series Eds.): Theory and Applications of Natural Language Processing, Springer Heidelberg Dordrecht London New York
Gliozzo, A., Strapparava, C. (2009): Semantic Domains in Computational Linguistics. Springer. ISBN: 978-3-540-68156-4
Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, volume 2 of ACL ’98, pages 768–774, Stroudsburg, PA, USA. Association for Computational Linguistics.
Bär, D., Biemann, C., Gurevych, I., and Zesch, T. (2012). UKP: Computing semantic textual similarity by combining multiple content similarity measures. In Proceedings of the 6th International Workshop on Semantic Evaluation, pages 435–440.
Biemann, C. (2010): Co-occurrence Cluster Features for Lexical Substitutions in Context. Proceedings of the 5th Workshop on TextGraphs in conjunction with ACL 2010, Uppsala, Sweden
Biemann, C. (2006): Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Proceedings of the HLT-NAACL-06 Workshop on Textgraphs-06, New York, USA
Widdows, D. and Dorow, B. (2002): A graph model for unsupervised lexical acquisition. In Proceedings of the 19th international conference on Computational linguistics - Volume 1 (COLING '02), Vol. 1.
Machine Learning for Contextualization
- Viterbi A.J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13 (2): 260–269. doi:10.1109/TIT.1967.1054010
Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika 57 (1): 97–109. doi:10.1093/biomet/57.1.97
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.