Modelling Vagueness and Uncertainty in DH

Computerphilology

Recordings of the talks will be available under LEcture3Go platform soon

Conference report

Additional materials here

Full Programme in PDF

About

Digital Humanities (DH) aims not only to archive and make available materials (in particular historical artefacts) but also to introduce a better scientific reflexion into humanities by propagating computational methods. However more than ten years of consequent employment of computer-aided research did not lead to a hermeneutic-adequate digital modelling of historical objects. The main crux remains in most DH-attempts the storage of objects in database architectures designed for natural science application, the annotation with very general metadata, the mark-up with shallow linguistic information no after the language or the purpose of the document and the quantitative analysis. Not only images and texts become artificially precise, but the mutual illumination of texts and other media loses its traditional hermeneutic power.

Vagueness is one of the most important, most significant but most difficult features of historical objects, especially texts and images. Whereas ambiguity – several distinct but clear meanings- and uncertainty – conceptually clear but unknown or forgotten data - are relatively well describable phenomena, vagueness is undefined by semantics or pragmatics.

This workshop aims at bringing together for the first time experts in representation of vagueness and uncertainty and scholars from DH who went beyond state-of the art in their research and tried to apply existent theories like fuzzy logic in their work.

Programme & Abstracts 09.07.2020

10:00 -10:15 Opening session

10:15 – 11:15 Session 1 Chair: Walther v. Hahn

Invited Talk: Manfred Thaller, “The Fog of History”

Historians have frequently to make decisions based on evidence which is incomplete, contradictory and anything but precise: the “Fog of History”. Historical information systems and software systems used for their implementation must therefor be based upon a model of information which reflects these properties of the available sources.

The presentation starts by categorizing the problems contributing to this situation:

(1) Vagueness I: A data point is beyond doubt; but it is an interval or a set, rather than a single value.
(2) Vagueness II: A data point is expressed on a scale, on which a desired operation is undefined for the intuitively underlying data type.
(3) Uncertainty I: Only one of a set of data points can apply.
(4) Uncertainty II: A precise data point exists; but it is doubtful, whether it is correct.
(5) Inconsistency: Contradictory data points are supported.
(6) Incompleteness: No concrete data point is known, though it must exist.
(7) Polyvalence I: A data point is valid within more than one datatype.
(8) Polyvalence II: A data point can be interpreted as a value of more than one conceptual variable.

Examples for each of these problems are briefly discussed.

We propose that technical support for these problems can be provided by solutions for four technical challenges:

(1) A general data structure supporting non-scalar values as default.
(2) Arithmetic for imprecise values, applicable to all data types to be supported.
(3) Support for non binary logic in the evaluation of predicates.
(4) Support for a evaluation of non binary results of evaluations which support delayed backtracking.

Examples for each of the types of solutions are given, referring to particularly promising branches of computer science for their solution.

Finally a rough sketch of a structural model to integrate these solutions is given. It assumes that the basic data structure needed for such a model is graph oriented, using graphs not only for the data to be held by the information system, but also for most of the supporting objects representing the knowledge needed for interpretation. The implementation of these graphs has to support the mathematical structure of the graph into an environment, however, which goes beyond the classical definition of a graph.

11:15 – 11:30 Break

11:30 – 12:15 Session 2 Chair: Michael Piotrowski

Cristina Vertan, Walther v. Hahn, University of Hamburg: „Detecting, Prcessing and Visualising Vagueness and Uncertainty sources in multilingual historical data collections“

Digital methods can facilitate analysis on the reliability of translations but also of the historical facts claimed by the author . In order to be effective these methods must consider an intrinsic feature of natural language: the ability of producing vague utterances. The project HerCoRe- Hermeneutic and Computer –based approached for investigating reliability, consistency and vagueness in historical sources aims at modelling and annotating five levels of vague assertions

the text uncertainty (uncertain readings, losses, translations, multilinguality, etc.),
the linguistic vagueness (metonymies, vague adjectives, comparatives, non-intersectives, hedges, homonyms,),
the author reliability (genres, time style, general recognition),
the factual uncertainty (range expressions, time expressions, geo relations), and
historical change (named entities, abbreviations, meaning changes)

We develop an annotation formalism which allows:

the mark-up of different types of vagueness and its source; the implementation of a set of inference rules for the combination of such vague features to calculate an overall result of their reliability;
the definition of a similarity measurement of the inferred results obtained for the same queries on different translations.

The knowledge base backbone is ensured by a fuzzy ontology modelled in OWL2. We distinguish between fixed concepts and relations (like geographical elements: river, mountain, island) and notions for which several “contexts can be defined. E.g. a geographical notion like “Danube” is within one historical context a border of the administrative notion “Ottoman empire”, and in another one the border to the so called administrative notion “Roman empire”. The historical contexts are specified by further fuzzy data properties (e.g. time , placement).

For the detection of linguistic vagueness we follow a multilingual approach. We collected initially listed indicators in the three languages involved in the project (Latin, German and Romanian). Based on (Pinkal 1980), (Pinkal 1985) we distinguish between:

Vague quantifiers, e.g.: some, most of, a few, about, etc.
Modal adverbs, e.g.: probably, possibly, etc.
Verbs e.g.: to believe, think, prefer, etc.
Lexical quotation markers , e.g. introduced by quotation marks or verbs with explicit meaning (say, write, mention)
Inexact measures and cardinals
Complex quantifiers
Non-intersective adjectives
Implicit syntactic clues: mainly verb moods such as conditional-optative for Romanian, conjunctive mood or imperfect/pluperfect for Latin, all of them indicating a non-reality (doubt, hear-say, possibility, etc.)

The initial collections of linguistic indicators are enriched through synsets in the corresponding Wordnets.

In this contribution we will present the general set-up of the system, the annotation framework as well as the compute-based approach for marking different types of vagueness and uncertainty

12:15 – 12:30 Discussions

12:30 – 13:30 Lunch Break

13:30 -15:45 Session 3 Chair: Cristina Vertan

13:30 – 14:00 Wieslawa Duzy, Polish Academy of Sciences „Unclarity and Uncertainty of Historical Spatial data from Poland“

The paper presents experiences and case studies analysed and elaborated in the Department of Historical Atlas, Institute of History Polish Academy of Sciences. Discussed projects focus mostly on spatial data, covering – among other – place names, types of settlements, and their locality. Our research questionnaire includes also mereological issues and relations, as well as identification and georeferencing historical settlements. Data sets are collected from various historical sources, i.e. 16th century tax registers, 18th and 19th century cadastres, and 20th century national surveys. Last but not least, historical maps are essential for our projects.
Our research focus on a territory of Poland, and we conduct our research with a diachronic approach, with a time span from late medieval times till the beginning of the 20th century. This perspective leads us to numerous problems, including changes in administrative divisions over time, different legal systems in the same period, and even various languages of historical sources. Recently, we started a shift into more detailed research task, and urban space(s) became our main interest.
There are few important issues occurring in our research. Remarks, examples and possible explanations of described historical reality in these particular fields are going to be presented in the paper:

1) data representation in various types of historical sources, including comparisons between written sources and cartography.
2) harmonization of data from various types of sources, including sources written in different languages.
3) ontology building for historical (diachronic) spatial data.
4) dissemination of Polish data and data sets translated into i.e. English.

The paper will focus on unclarity and uncertainty of historical sources with some insight into discussion on building stable identifiers and semantics. The paper will discuss the whole process of collecting historical data and preparing them to be explained and understood properly, and prepared to be joined with external data sets. Some of our former results, outcomes and ongoing research are presented online:

1/ OntoHGIS project https://ontohgis.pl/
2/ HOUSe project: https://atlas.ihpan.edu.pl/gis/urbanonto/
3/ Historical Atlas of Poland web-GIS app: http://atlasfontium.pl/index.php?article=korona

14:00 – 14:30 Davor Lauc, University of Zagreb : „Reasoning about inexact temporal expressions using fuzzy logic and deep learning“

The problem of temporal reasoning is known since antiquity. The contemporary-relevant logical analysis began in the second half of the twentieth century when Arthur Prior developed temporal logic as a variant of modal logic. The alternative approach, based on the use of classical predicate logic, was devised by philosopher Donald Davidson. Davidson's proposal was an inspiration to the calculus of events as well as the interval algebra developed by James Allen. To this day, many other approaches to display and reasoning in this domain have been developed, such as temporal databases, temporal logical programming and many others.

All these approaches represent the temporal determinants either as a point in the timeline or as a uniform interval. Many temporal expressions occurring in a range of social sciences and humanities, particularly history, as well as in everyday reasoning, cannot be adequately represented in this way. For example, if something happened during the Industrial Revolution, it's possible that it was 1745, but less likely than in 1801. Also, if we have the exact record that a person became the mother of 1848, she could be born any day from 1790-1839, but is more likely to be born 1828 than 1800. The second level of uncertainty brings the credibility or reliability of the source. For example, if you are verifying whether two events are concurrent in two historical references, and there is no exact overlap, we will be inclined to choose an event that occurred in the temporal vicinity as more likely, given the possibility of error in the source.

In this research, we are investigating possibilities of representing the semantics of inexact temporal expressions using fuzzy sets. For the purpose of empirical validation of the reasoning models, the first part of the research includes developing a NER system for recognising inexact dates in the text. Although there are well-developed models for identifying temporal expressions such as SuTime and HeidelTime, their support for vague temporal expression is limited. The developed transformer-based model is applied to the English Wikipedia with satisfactory f-score. The second part of the research is the development of a neural model for the generation of a fuzzy set representing the meaning of temporal expression. As fuzzy sets rendering temporal data with the granularity of dates, they are huge. A model for dimensionality reduction is developed to facilitate efficient storage and manipulation of those sets. The third part of the research is the development of different formal models for relations and operation on the inexact dates. The fundamental and the most challenging relationship is that of similarity between two imprecise dates. The relationship is modelled using fuzzy logic t-norm operation with parameters learned from empirical data. The fourth part of the research includes logical and philosophical analysis of the problem of similarity in the context of the temporal reasoning.

14:30 – 15:00 Dimitar Iliev, University of Sofia: „What’s in a Name? Encoding Ambiguity in Ancient Greek Inscriptions from Bulgaria“

The significant ancient epigraphic heritage in Bulgaria includes more than 5,000 Greek inscriptions. A selection of them is currently being encoded in EpiDoc TEI XML and visualized and indexed through a customized EFES platform for the purposes of the Telamon epigraphic collection created by the University of Sofia in the framework of the National CLaDA-BG Consortium (CLARIN+DARIAH). The corpus aims mainly at the encoding and representation of historical figures and events, local names and places, dates, numbers, and sacred or administrative offices. In short, rather than creating a linguistic corpus of inscriptional texts, the focus of the collection is on different sorts of Named Entities. Among them, attested names of persons are probably the most interesting for the target scholarly audience of the collection. They are also the most difficult to approach. On the one hand, this is partly due to the ambiguity of reference inherent in proper names by definition. When dealing with historical names, editors of texts, including of digital corpora, face various methodological problems concerning identification of a historical person behind a name or approaching names according to their onomastic and prosopographical value at the same time. The overlapping linguistic, cultural and political traditions of the Roman province of Thrace in the first centuries CE contribute as well to the exciting complexity of its naming system. The extent of one’s name and the comparative relevance of its different elements and accompanying nicknames or titles is a constantly open issue requiring different approaches to the encoding and processing of digital inscriptions. Finally yet importantly, monuments on stone often provide texts open to different interpretations due to physical damage caused by time or human influence. Thus, the inevitable lacunae are frequently filled with the aid of differing research hypotheses, constructs and paradigms: a range of alternative information that has to be encoded and modelled by the digital editors. The current talk will present various issues connected with these ambiguities and uncertainties around attested names in the ancient Greek inscriptions from Bulgaria. It will also examine the proposed approaches towards encoding them as well as creating RDF’s based on them and indexing them in the front-end service applied in the work on the Telamon corpus (current repository: https://github.com/DHLabUniSofia/Telamon-EFES).

15:00 – 15: 45 Discussions

15:45 – 16:00 Break

16:00 – 17:15 Session 4 Chair: Cristina Vertan

16:00 – 16:30 Alexandra Pouulovassilis, Birkbeck Knowledge Lab: „Managing missing and uncertain data on the UK museum sector“

There is a general consensus that the UK museum sector has a problem with data. Over the last forty years numerous reports have commented on the lack of an authoritative list of museums: that there is no way of identifying how many museums there are in the UK, what they are, when they opened, and what levels of visitors do they have. The Mapping Museums project is in part a response to this issue. The project aims to analyse the emergence and development of the UK museum sector from 1960 until the present day, with particular emphasis on the wave of small independent museums that opened from the mid 1970s onwards. The project has involved extensive archival research, capturing data on over 4,000 museums, conceptualizing that information, and designing ways of searching and visualizing the ensuing knowledge base. Here, we focus on the techniques we adopted within the knowledge base for managing the missing and uncertain data that was encountered during the project’s two-year data collection process.

16:30 – 17:00 Marc Kupietz, IDS Mannheim, „Coping with Uncertainty in Synchronous Corpus Linguistics“

Corpus linguistics, like most empirical disciplines, is fundamentally affected by uncertainty. The typical modus operandi of the empirical disciplines is to extrapolate from observations on sample data to draw more general conclusions about the properties of the population under study. If the sample does not coincide with the population, such inferences are always subject to uncertainty. However, especially in synchronous corpus linguistics there are further and particular sources of uncertainty. Somewhat different from also historical corpus linguistics where the known set of published documents can be regarded as population, but above all different from other disciplines, synchronous corpus linguistics has the fundamental problem that the population typically cannot be defined in a way that is even remotely operationalizable. This means that the quality of a corpus cannot be generally assessed in terms of how well it represents the population as a sample, to what extent it allows inferences, or if inferential statistics is at all applicable (Koplenig 2017). Further problems arise because 1.) language has aspects of an artifact, so that systematicity is only of limited use as a criterion for plausibility (Keibel & Kupietz 2009), 2.) context variables that influence language are unknown and variable, so that the application of stratified sampling techniques to avoid sampling errors is difficult (Kupietz 2016), and 3.) interpretative categories such as parts of speech need to be used in the form of automatic annotations that are not necessarily empirically adequate (Belica et al. 2011). In my paper I will discuss some approaches on how to deal with these difficulties and how, in the context of digital humanities and science, as well as sometimes arts and crafts, synchronous corpus linguistics can be performed with a prospect of knowledge gain, despite the many sources of uncertainty.

References

Belica, Cyril/Kupietz, Marc/Witt, Andreas/Lüngen, Harald (2011): The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls. In: Konopka, Marek/Kubczak, Jacqueline/Mair, Christian/Šticha, František/Waßner, Ulrich Hermann (eds.): Grammar and Corpora 2009. Third International Conference. (= Corpus Linguistics and Interdisciplinary Perspectives on Language 1). Tübingen: Narr, 2011: 451-469.
Keibel, Holger/Kupietz, Marc (2009): Approaching grammar: Towards an empirical linguistic research programme. In: Minegishi, Makoto/Kawaguchi, Yuji (Eds.): Working Papers in Corpus-based Linguistics and Language Education, No. 3. Tokyo: Tokyo University of Foreign Studies (TUFS), 2009: 61-76
Koplenig, Alexander (2017):Against statistical significance testing in corpus linguistics. In: Corpus Linguistics and Linguistic Theory. Berlin: de Gruyter, 2017. 1-26.
Kupietz, Marc (2016): Constructing a Corpus. In: Durkin, Philip: The Oxford Handbook of Lexicography. (= Oxford handbooks in linguistics). Oxford: Oxford University Press, 2016. S. 62-75

17:00 – 17:15 Discussions

Programme & Abstracts 10.07.2020

10:00 – 10:45 Session 5 Chair Walther v. Hahn

Invited Talk: Michael Piotrowski, University of Lausanne: What Are We Uncertain About? The Challenge of Historiographical Uncertainty

When people talk about uncertainty in a historical context in digital humanities, most of the time they talk about questions such as the exact date of birth of a person, whether two names refer to one or two persons, what geographical location a place name refers to, or the location of a person at a specific point in time. These are important questions and it is important to find ways to computationally model the associated uncertainty. However, history is ultimately not about drawing exact maps or time lines, even if those can certainly help: history is about causality. In this talk, I would like to reflect on some of the issues on the level of historical interpretation, i.e., historiographical rather than historical uncertainty.

10:45 – 11:00 Break

11:00 -12:30 Session 6 Chair Manfred Thaller

11:00 – 11:30 Fairouz Zendaoui, Ecole Nationale Supérieure d'Informatique, Alger, Quantifying and Representing Uncertainty of Historical Information

The digital humanities are a field of research, teaching and engineering at the crossroads of computer science and arts, literature, human and social sciences. Historical disciplines focus on digital tools, especially databases. Current interests and efforts focus on the representation of historical knowledge in order to facilitate the diffusion, sharing and exploitation of collective knowledge. Simplifying and structuring qualitatively complex knowledge, quantifying it in a certain way to make it reusable and easily accessible are all aspects that are not new to historians. Computer science is currently approaching a solution to some of these issues, or at least making it easier to work with historical data.
In this context, we proposed a representation model of historical event. The particularity of our model is to represent simultaneously multiple versions of the same event from different sources. It constitutes a field of expertise for new research and investigation problems. Moreover, we extended our model by taking into consideration the quality of imperfection of historical data in terms of uncertainty. To realize this, we based on a multilayer approach in which we distinguished three informational levels: information, source, and belief whose combination allows modeling and modulating historical knowledge. The basic principle of this model is to allow multiple historical sources to represent several versions of a historical event with associated degrees of belief. Furthermore, we differentiated three levels of granularity (attribute, object, relation) to express belief and defined 11 degrees of uncertainty in belief. The proposed model can be the object of various exploitations that fall within the historian’s decision-making support for the plausibility of the history of historical events.

On the other hand, the Web has become the most important information source for most of us and also for historians. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. However, none of the previous works has considered the relevance of sources in inferring truths, knowing that it is the main performance metric used by the majority of search engines.

As application of our proposed model, allowing us to represent quantifying beliefs, we implemented it and attempted to answer the question whether the truth is relevant. We conducted an experimental study on real data from the web relative to the location of world heritage sites. We analyzed and compared the results of two different truth discovery methods: Majority vote and relevance-based sources ranking. We have found that the truth is not always held by the most relevant sources on the web. In some cases, the truth is given by the majority vote of the crowd. In addition, we have proposed a method of presenting the results of truth discovery with gradual degrees of belief. A method that allows to configure and target the desired level of trust.

11:30 – 12:00 Umberto Straccia, Italian National Research Council, „Fuzziness in Semantic Web languages“

I present the state of the art in representing and reasoning with fuzzy knowledge in Semantic Web Languages such as triple languages RDF/RDFS, conceptual languages of the OWL 2 family and rule languages.

12:00- 12:30 Francesca Lisi, University of Bari : „Representing Fuzzy Quantified Sentences in OWL 2“

Fuzzy quantifiers, such as "many" and "most", are used to express imprecise properties on fuzzy information granules. Examples of fuzzy quantified sentences are the following:

1) “Hotel Verdi has a low distance from many attractions”
2) “Many hotels have a low distance from attractions”

The former describes data at a finer level of granulation with respect to the latter.
During the talk I will report on a method for representing both kinds of fuzzy quantified sentences in OWL 2 ontologies.

12:30 – 12:45 Discussions

12:45 – 13:45 Lunch Break

13:45 – 15:30 Session 7 Chair Cristina Vertan

13: 45 – 14:15 Fernnando Bobillo , University of Zaragoza, „Fuzzy ontology reasoning and applications“

Fuzzy ontologies have proved to be very useful in many application domains. One of the reasons for their success is the availability of fuzzy ontology reasoners, i.e.., software tools that are able to discover implicit knowledge that can derived from the axioms of a fuzzy ontology. In this talk, we will firstly examine the fuzzy ontology reasoner fuzzyDL and then we will overview some important examples of applications using fuzzy ontologies.

The first objective of this talk is to provide an overview of fuzzyDL, which is probably the most mature fuzzy ontology reasoner. We will discuss the fuzzy ontology elements that can be represented, as fuzzyDL supports fuzzy extensions of OWL 2 constructors and axioms, but also some traditional fuzzy logic operators, such as aggregation or defuzzification. We will also overview the supported reasoning tasks (some of them specific to the fuzzy case) and their mutual relationships. The different ways to interact with the tool will be reviewed, as different languages (a native syntax and Fuzzy OWL 2) and interfaces (terminal mode, graphical interface, and a Java API) are supported. Finally, we will provide some key implementation details (such as the underlying reasoning algorithm) and analyze some notable differences with other fuzzy ontology reasoners (paying special attention to DeLorean).

The second objective of this talk is to show some notable use cases of fuzzy ontologies in real world problems. In particular, we will focus on some recent developments where the speaker has been involved. This includes recommender systems (where the use of linguistic variables allows users to submit flexible queries), human gait recognition (where fuzzy sets make it possible to deal with the imprecision of the sensors the capture the motion), and blockchain smart contracts (where fuzzy ontologies make it possible to obtain partial agreements between two or more parts).

14:15 – 14:45 Jesús Medina, University of Cadiz, DIGital FORensics: evidence Analysis via intelligent Systems and Practices (DigForASP). COST Action 17124. Goals and intermediate achievements.

COST Action DigForASP (digforasp.uca.es) is focused on fostering the synergy between security forces, related agencies, institutions of the European Union and near neighboring countries, associations, and companies in the field, in order to establish a solid network to introduce new technologies based on Artificial Intelligence and Automatic Reasoning into the digital analysis of evidences, which improves processes and obtains more direct and efficient results. In this talk, we will present the main goals and the current achievements obtained from different members of the Action and the research team Mathematics for Computational Intelligence Systems (M·CIS, harmonic.uca.es/mcis).

14:45 – 15:15 Franziska Diehr, Free University Berlin, Modelling vagueness in the deciphering of Classic Mayan hieroglyphs - A criteria-based approach for the qualitative assessment of readings proposals

The project ‘Text Database and Dictionary of Classic Mayan’ aims at creating a machine-readable corpus of all Maya texts and compiling a dictionary on this basis. The characteristics of this complex writing system pose particular challenges to research, resulting in contradictory and ambiguous deciphering hypotheses. With our approach, we present a system for the qualitative evaluation of reading proposals that is integrated into a digital Sign Catalogue for Mayan hieroglyphs, establishing a novel concept for sign systematisation and classification. In the presentation we focus in particular on the modelling process and thus emphasizes the role of knowledge representation in digital humanities research.

15:15 – 15:30 Discussions

15:30 – 15:45 Break

15:45 – 16:15 Final Discussions

16:15- Closing Session

Modelling Vagueness and Uncertainty in DH

About

Programme & Abstracts 09.07.2020

Programme & Abstracts 10.07.2020

Organisers

Contact