Linked Open Data for Literary History

Extracting, Modeling, Linking and Querying Data on the French Enlightenment Novel

Christof Schöch
(Trier University, Germany)

The International Conference for the Study of the Novel
Cluj-Napoca

21 Jun 2024

Introduction

Thanks

Prof. Adrian Tudurachi and the Institute of Linguistics and Literary History Sextil Pușcariu / Academy of the Romanian People’s Republic.

The Ministry for Research, Education and Culture of Rhineland-Palatinate, Germany, for funding this research (Mining and Modeling Text, 2019-2023).

Thanks to all the project contributors: Maria Hinzmann, Matthias Bremm, Tinghui Duan, Anne Klee, Johanna Konstanciak, Julia Röttgermann and many others.

Overview

1 – Introduction
2 – Linked Open Data for Literary History
3 – Mining: Information Retrieval
4 – Modeling: Linked Open Data
5 – Results: Networked Database
6 – Conclusion

Linked Open Data for Literary History

Two Modes of Data (and Digital Humanities)

Qualitative DH:
- Datasets are typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
- Prototype: digital scholarly editions, e.g. Goethe, Faust Edition
Quantitative DH:
- Datasets are typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
- Prototype: Large Language Models trained on lots of text, e.g. BERT

Third Way: Bigger Smarter Data

Background: What is Machine Learning?

Fundamentally, ML involves detecting relations between features and labels
- Features: simple properties that we can observe in texts (e.g.: word forms)
- Labels: more complex phenomena that are relevant to our research (e.g.: direct speech or metaphors)
We use this approach primarily for information retrieval
- We start with a text collection
- We may annotate part of the data with labels
- Then train a new model, or use an existing model
- Evaluate the performance of the model on the annotated data
- And then derive labels from unannotated text

Background: What is Linked Open Data?

Linked Open Data: Multilingualism

Background: What is Literary History?

Goals of literary history
- Collecting and documenting knowledge of literary history
- Providing explanations for the development of literature
Organizational principles
- Nations, periods, movements/currents, genres
- Authors and works: themes, forms, relationships
- Similarities and differences, continuities and change

Literary History in Linked Open Data

Building blocks
- Subjects, including persons (authors, etc.) and works (primary texts, etc.)
- Objects, e.g. themes, locations, protagonists, literary genre, etc.
- Predicates, as required, including: author_of, about, influenced_by, etc.
- Qualifications, e.g: Source (with type, date, URL)
Some exemplary statement types
- Bibliographic: [person] author_of [work]
- Content-based: [work] about [theme]
- Formal: [work] narrative_form [type]
- Relations: [author] influenced_by [work]
- and many more.

Our project’s key idea: Wikidata for Literary History

Idea: Create a “Wikidata for the history of literature”
- Literary history information system
- LOD-based, with explorative interface and SPARQL endpoint
- Approach of an “atomization” of the historical knowledge
- Linking with other knowledge systems (taxonomies, standard data, knowledge bases)
- Key values: human and machine readable, open, collaborative, multilingual
Compared to Wikidata
- Focused on one domain (French novel, 1750-1800)
- Better coverage / higher density of information for this domain
- Development of a systematic ontology
- much smaller: 300k vs. 1.5 billion statements

The project ‘Mining and Modeling Text’

Mining and Modeling text brought literary history, machine learning and linked open data together.
Mining and Modeling was a project funded by the regional government from 2019 to 2023
This was an interdisciplinary project involving computer science, literary studies, legal studies, computational linguistics, and digital humanities
The key goal of this project was to use computational methods to build a knowledge base about the French novel of the Enlightenment
But more generally, also, to experiment with the idea of “Linked Open Data for Literary History”
The MiMoText project has pursued this goal of collecting literary-historical knowledge, making it machine-readable and networking it using LOD
We use three different sources of information for this purpose
- Bibliographic metadata; in particular the “Bibliographie du genre romanesque francais, 1751-1800”, by Martin, Mylne and Frautschi
- Knowledge from literary history, especially overview chapters from handbooks
- And characteristics of primary texts, in our case based on a corpus of 200 French novels from the period 1751-1800
Work objectives
- Automatically extract relevant information from these sources
- Model this information as LOD and link them together as much as possible
- Analyze this information to learn more about the literature of the period, but also about literary historiography.
- This also means transforming heterogeneous sources into a homogeneous data set
- It also means explicitly modeling everything, and this is where controlled vocabularies, taxonomies, ontologies and their implementation and use come into play.

Mining: Information Extraction

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (novels)

Corpus of 200 French novels (1750-1800)
Encoding: in XML-TEI, with metadata, according to ELTeC schema
Methods of analysis: Topic modeling, NER, stylometry, etc.

Pillar 3: Scholarly Literature

Annotation guidelines => Manual annotations (using INCEpTION)
Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
Creation of statements about authors and works (genres, themes, etc.)
Machine Learning based on the annotated training data

Modeling: Data Modeling

Modular Data Model

Module 1: Theme
Module 2: Space
Module 3: Narrative form
Module 4: Literary work
Module 5: Author
Module 6: Mapping
Module 7: Referencing
Module 8: Versioning & publication
Module 9: Terminology
Module 10: Bibliography
Module 11: Scholarly literature

Example: The module on themes

Example: The module on narrative location

Meta-Statements

Linking with Wikidata for ‘federated queries’

Result: Networked Database

The MiMoTextBase

SPARQL endpoint

MiMoText Base: Query for themes in novels

Some sample queries: simple queries

Example queries: visualizations

Sample queries: networked and federated

Sample queries: comparative queries

Conclusion

Opportunities & challenges

Opportunities
- Linking heterogeneous data from different types of sources
- Modeling, collecting and comparing contradictory statements
- Transparency in knowledge production (sources)
Challenges
- Lack of consensus on relevant statement types in the discipline
- Complexity reduction (triple structure)
- Interoperability (tension ‘Wikiverse’ vs. OWL standard)

Lessons Learned

Federated queries
- Central element of the LOD vision
- => Making it happen is not trivial (data model, infrastructure)
Modeling meta statements
- Very important: perspectives / statements, not facts
- => Very different approaches in different technical contexts
Exchange across communities
- Literary Studies vs. Digital Humanities vs. Wikiverse
- => is essential but needs more development
There is still so much to do!
- => We are continuing this effort in a new project called
  ‘Linked Open Data in the Humanities’ (LODinG)

Many thanks for your kind attention
Vă mulțumim pentru atenție

Slides: https:/dhtrier.quarto.pub/cluj

Further resources

Tutorial: https://docs.mimotext.uni-trier.de
SPARQL endpoint: https://query.mimotext.uni-trier.de
MiMoTextBase: https://data.mimotext.uni-trier.de
MiMoText Ontology: https://github.com/MiMoText/ontology
Reference publication: ‘Smart Modeling for Digital Literary History’
Overview of visuals: mimotext.github.io/MiMoTextBase_Tutorial/visualizations.html

References

Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.

Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.

Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.

Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.

Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.

Linked Open Data for Literary History

Introduction

Thanks

Overview

Linked Open Data for Literary History

Two Modes of Data (and Digital Humanities)

Third Way: Bigger Smarter Data

Background: What is Machine Learning?

Background: What is Linked Open Data?

Linked Open Data: Multilingualism

Background: What is Literary History?

Literary History in Linked Open Data

Our project’s key idea: Wikidata for Literary History

The project ‘Mining and Modeling Text’

Mining: Information Extraction

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (novels)

Pillar 3: Scholarly Literature

Modeling: Data Modeling

Modular Data Model

Example: The module on themes

Example: The module on narrative location

Meta-Statements

Linking with Wikidata for ‘federated queries’

Result: Networked Database

The MiMoTextBase

SPARQL endpoint

MiMoText Base: Query for themes in novels

Some sample queries: simple queries

Example queries: visualizations

Sample queries: networked and federated

Sample queries: comparative queries

Conclusion

Opportunities & challenges

Lessons Learned

Many thanks for your kind attentionVă mulțumim pentru atenție

Further resources

References

Many thanks for your kind attention
Vă mulțumim pentru atenție