Call for Papers

During the last decades Digital Humanities evolved dramatically, from simple database applications to complex systems involving most recent state-of-the art in Computer Science. Especially Language Technology plays a major role either for processing the metadata of recorded objects or for analyzing and interpreting content.

Applying Language Technology methods to objects from humanities in general and historical archives in particular, is a challenge for NLP-related research: data is heterogeneous (image /text), often incomplete (e.g. OCR errors), multilingual within one document (historic documents with Latin or/and classical Greek paragraphs) and difficult to structure (paragraphs, titles, pages are somewhat different in historical texts).

Corpus-based methods, nowadays standard in NLP research, often cannot be applied as the necessary large training data is missing.

Moreover, requirements for tools in Digital Humanities, especially tools dedicated to cultural heritage objects, are different from the ones applied to modern texts.

Thus, performing research in Digital Humanities involves also: adapting existent NLP tools to the historical variants of languages; developing tools for new languages; making tools robust to syntactic deviation; and adapting semantic resources.

Central and Eastern Europe as well as the Middle East and North Africa were always characterized by a high concentration of languages and cultures, interacting with each other. On a relatively small area texts written with at least 10 alphabets (Arabic, Hebrew, Armenian, Georgian, Greek, Cyrillic, Geez, Syriac and Latin, Coptic) can be found. On the other hand, information within these texts is important beyond the borders of a given language or script. (e.g. often documents in Ge'ez are translations of lost Coptic or ancient Greek texts). Places, Persons, Events have language-dependent denominations but refer to the same individual or geographical location.

Unfortunately, especially in this area many historical documents are in bad condition; many languages or dialects became extinct over the time and their written evidence is rare. Digital methods seem the perfect means for preservation and investigation of this rich cultural heritage asset. However, up to now, concentrated activities seem to be absent, probably also due to the lack of adequate NLP resources and tools. Thus, it is very necessary to evaluate existent technology, monitor current activities, network research teams in this area - all aims of the proposed workshop.

We are looking for original unpublished work related (but not limited to) one of the following topics:

Corpora of diachronic variants and language dialects,
NLP Tools for processing historical documents,
Intelligent search in digital archives,
(Semi-) Automatic (meta)Annotation of historical texts,
Treating uncertain and vague information from historical documents,
Ontologies for historical texts,
Evaluation of current frameworks (CLARIN, DARIAH) on DH-objects related to historical texts;
Machine learning approaches for under-resourced DH objects,
Methods for dealing with incomplete specified objects (e.g. partially known features or values),
Automatic extraction of metadata,
Metadata Interoperability for digital objects
Intelligent search in digital historical archives
Geo- and Time References in historical documents

focusing on languages from the above mentioned area