Release of a new web-scale dependency-parsed corpus based on CommonCrawl

27 October 2017, by Reinhard Zierke

LT group released DepCC, a new large dependency parsed corpus which was constructed from the web crawls of the CommonCrawl project. DepCC is the largest to date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl. We demonstrated the utility of this corpus on the verb similarity task by showing that a distributional model trained on our corpus yields better results than models trained on smaller corpora, like Wikipedia, outperforming the state of art models of verb similarity trained on smaller corpora. The corpus is available for download or at the Amazon cloud at the S3 file system.

The files are distributed currently from a server at Hamburg University: http://ltdata1.informatik.uni-hamburg.de/depcc. In addition, you can access the corpus from Amazon S3 as our corpus was published permanently by CommonCrawl as a part of their "contrib" resources: https://commoncrawl.s3.amazonaws.com/contrib/depcc/CC-MAIN-2016-07/index.html. If you use the corpus on Amazon you do not need to download it: you can directly use it from the S3 file system for free (in the us-east-1 zone) from any Amazon EC2 instance. The corpus is available for download or at the Amazon cloud at the S3 file system. More details are available on on the official page of the corpus.