Release of a new web-scale dependency-parsed corpus based on CommonCrawl
27 October 2017, by Reinhard Zierke
LT group released DepCC, a new large dependency parsed corpus which was constructed from the web crawls of the CommonCrawl project. DepCC is the largest to date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl. We demonstrated the utility of this corpus on the verb similarity task by showing that a distributional model trained on our corpus yields better results than models trained on smaller corpora, like Wikipedia, outperforming the state of art models of verb similarity trained on smaller corpora. The corpus is available for download or at the Amazon cloud at the S3 file system.
The files are distributed currently from a server at Hamburg University: http://ltdata1.informatik.uni-hamburg.de/depcc. In addition, you can access the corpus from Amazon S3 as our corpus was published permanently by CommonCrawl as a part of their "