Linked Open Data for Literary History

Extracting, Modeling, Linking and Querying Data on the French Enlightenment Novel



Christof Schöch
(Trier University, Germany)

The International Conference for the Study of the Novel
Cluj-Napoca

21 Jun 2024

Introduction

Thanks



Prof. Adrian Tudurachi and the Institute of Linguistics and Literary History Sextil Pușcariu / Academy of the Romanian People’s Republic.



The Ministry for Research, Education and Culture of Rhineland-Palatinate, Germany, for funding this research (Mining and Modeling Text, 2019-2023).



Thanks to all the project contributors: Maria Hinzmann, Matthias Bremm, Tinghui Duan, Anne Klee, Johanna Konstanciak, Julia Röttgermann and many others.

Overview

  • 1 – Introduction
  • 2 – Linked Open Data for Literary History
  • 3 – Mining: Information Retrieval
  • 4 – Modeling: Linked Open Data
  • 5 – Results: Networked Database
  • 6 – Conclusion

Linked Open Data for Literary History

Two Modes of Data (and Digital Humanities)

  • Qualitative DH:
    • Datasets are typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
    • Prototype: digital scholarly editions, e.g. Goethe, Faust Edition
  • Quantitative DH:
    • Datasets are typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
    • Prototype: Large Language Models trained on lots of text, e.g. BERT

Third Way: Bigger Smarter Data

Background: What is Machine Learning?

  • Fundamentally, ML involves detecting relations between features and labels
    • Features: simple properties that we can observe in texts (e.g.: word forms)
    • Labels: more complex phenomena that are relevant to our research (e.g.: direct speech or metaphors)
  • We use this approach primarily for information retrieval
    • We start with a text collection
    • We may annotate part of the data with labels
    • Then train a new model, or use an existing model
    • Evaluate the performance of the model on the annotated data
    • And then derive labels from unannotated text

Background: What is Linked Open Data?

Linked Open Data: Multilingualism

Background: What is Literary History?

  • Goals of literary history
    • Collecting and documenting knowledge of literary history
    • Providing explanations for the development of literature
  • Organizational principles
    • Nations, periods, movements/currents, genres
    • Authors and works: themes, forms, relationships
    • Similarities and differences, continuities and change

Literary History in Linked Open Data

  • Building blocks
    • Subjects, including persons (authors, etc.) and works (primary texts, etc.)
    • Objects, e.g. themes, locations, protagonists, literary genre, etc.
    • Predicates, as required, including: author_of, about, influenced_by, etc.
    • Qualifications, e.g: Source (with type, date, URL)
  • Some exemplary statement types
    • Bibliographic: [person] author_of [work]
    • Content-based: [work] about [theme]
    • Formal: [work] narrative_form [type]
    • Relations: [author] influenced_by [work]
    • and many more.

Our project’s key idea: Wikidata for Literary History

  • Idea: Create a “Wikidata for the history of literature”
    • Literary history information system
    • LOD-based, with explorative interface and SPARQL endpoint
    • Approach of an “atomization” of the historical knowledge
    • Linking with other knowledge systems (taxonomies, standard data, knowledge bases)
    • Key values: human and machine readable, open, collaborative, multilingual
  • Compared to Wikidata
    • Focused on one domain (French novel, 1750-1800)
    • Better coverage / higher density of information for this domain
    • Development of a systematic ontology
    • much smaller: 300k vs. 1.5 billion statements

The project ‘Mining and Modeling Text’

Mining: Information Extraction

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (novels)

  • Corpus of 200 French novels (1750-1800)
  • Encoding: in XML-TEI, with metadata, according to ELTeC schema
  • Methods of analysis: Topic modeling, NER, stylometry, etc.

Pillar 3: Scholarly Literature

  • Annotation guidelines => Manual annotations (using INCEpTION)
  • Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
  • Creation of statements about authors and works (genres, themes, etc.)
  • Machine Learning based on the annotated training data

Modeling: Data Modeling

Modular Data Model

  • Module 1: Theme
  • Module 2: Space
  • Module 3: Narrative form
  • Module 4: Literary work
  • Module 5: Author
  • Module 6: Mapping
  • Module 7: Referencing
  • Module 8: Versioning & publication
  • Module 9: Terminology
  • Module 10: Bibliography
  • Module 11: Scholarly literature

Example: The module on themes

Example: The module on narrative location

Meta-Statements

Linking with Wikidata for ‘federated queries’

Result: Networked Database

The MiMoTextBase

SPARQL endpoint

MiMoText Base: Query for themes in novels

Some sample queries: simple queries

Example queries: visualizations

Sample queries: networked and federated

Sample queries: comparative queries

Conclusion

Opportunities & challenges

  • Opportunities
    • Linking heterogeneous data from different types of sources
    • Modeling, collecting and comparing contradictory statements
    • Transparency in knowledge production (sources)
  • Challenges
    • Lack of consensus on relevant statement types in the discipline
    • Complexity reduction (triple structure)
    • Interoperability (tension ‘Wikiverse’ vs. OWL standard)

Lessons Learned

  • Federated queries
    • Central element of the LOD vision
    • => Making it happen is not trivial (data model, infrastructure)
  • Modeling meta statements
    • Very important: perspectives / statements, not facts
    • => Very different approaches in different technical contexts
  • Exchange across communities
    • Literary Studies vs. Digital Humanities vs. Wikiverse
    • => is essential but needs more development
  • There is still so much to do!
    • => We are continuing this effort in a new project called
      ‘Linked Open Data in the Humanities’ (LODinG)


Many thanks for your kind attention
Vă mulțumim pentru atenție











Slides: https:/dhtrier.quarto.pub/cluj

Further resources

References


Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.
Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.
Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.
Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.