WebCorpus in a Nutshell
The WebCorpus execution chain enables the calculation of statistics on large corpora extracted from web crawls by using the massively distributable MapReduce framework Hadoop.
WebCorpus is a Hadoop-based Java tool chain that allows the processing and computation of statistics of large corpora extracted from web crawls. WebCorpus aims to create a system that generates information like n-gram counts, cooccurrence counts, or isolated sentences from a large corpus of webpages for a language of choice.
Parallel processing of such tasks can lead to a huge performance benefit over serial processing. The MapReduce paradigm provides a programming model for parallel processing and the Hadoop framework is a massively distributable execution framework of such MapReduce algorithms. The system is designed as a pipeline various Hadoop MapReduce jobs.
- Extract documents from the raw data and provide them for further tasks in a standardized format aligned with metadata such as document URL and crawl date. We will refer to this job as DocumentJob.
- Web crawling usually leads to a lot of noise, so some basic cleanup tasks need to be performed:
- Deduplication of documents, as a documents might occur multiple times due to recrawling of a document or context variations, like a print page for a document that occurred already as a normal page. (DeduplicationJob, DeduplicationByHostJob)
- Filtering of documents with malformed encodings. (UTF8Job)
- Inner segmentation of documents for further processing:
- Detection of paragraphs in documents. (DocumentJob)
- Detection of sentences in paragraphs. (SentenceJob)
- Filtering of sentences with a language other than the chosen language. (LanguageJob)
- Annotate sentences with tokens and parts of speech. (SentenceAnnotateJob)
- Generate corpus of n-grams. (NGramCountJob, POSNGramCountJob, NGramWithPOSCountJob)
- Generate corpus of cooccurrences. (CooccurrenceJob)
- Extract sentences with clearly detected language in a standardized format. (SentenceExtractJob)
See Figure 1 as a visualization of the proposed execution chain.
Figure 1.: Visualization of the WebCorpus Pipeline
Please see http://sourceforge.net/p/webcorpus for further informtion.