Linked Open Data for Literary History




Christof Schöch
(Trier University, Germany)

Digital Humanities Training Day 2024, FAU Erlangen-Nürnberg

22 Nov 2024



Introduction







Overview

  • 1 – Introduction
  • 2 – TDM + LOD + LitHist = ?
  • 3 – Pilot: Mining and Modeling Text
  • 4 – Example: Epistolarity and Libertinage
  • 5 – Conclusion
  • 6 – Annex: Sample Queries

TDM + LOD + LitHist = ?

Three Modes of Data (and Digital Humanities)

  • Qualitative DH
    • Datasets are typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
    • Prototype: digital scholarly editions, e.g. Goethe, Faust Edition
  • Quantitative DH
    • Datasets are typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
    • Prototype: Topic Modeling on a Project Gutenberg corpus
  • Third way for DH
    • Bigger, smarter data in the humanities.
    • One approach: Text Mining + Linked Open Data

What is TDM / Machine Learning?

  • Fundamentally, ML involves detecting relations between features and labels
    • Features: simple properties that we can observe in texts (e.g.: word forms)
    • Labels: more complex phenomena that are relevant to our research (e.g.: direct speech or metaphors)
  • We use this approach primarily for information retrieval
    • We start with a text collection
    • We may annotate part of the data with labels
    • Then train a new model, or use an existing model
    • Evaluate the performance of the model on the annotated data
    • And then derive labels from unannotated text

What is Literary History?

  • Goals of literary history
    • Collect and document knowledge about literature over time
    • Provide explanations for the ‘evolution’ of literature
    • Contextualize literature, document reception of literature
  • Organizational principles
    • Nations, periods, movements/currents, genres
    • Authors and works: themes, forms, setting, plot, etc.
    • Similarities and differences, continuities and change

What is Linked Open Data?

Linked Open Data + Literary History

  • Building blocks for statements
    • Subjects, including persons (author, etc.) and works (primary text, scholarly literature, etc.)
    • Objects, including works, but also themes, locations, protagonists, literary genre, etc.
    • Predicates, as required, including: author_of, about, sameAs etc.
    • Qualifications, e.g: Source (with type, date, URL)
  • Some exemplary statement types
    • bibliographic: [person] author_of [work]
    • content-related: [work] about [theme]
    • formal: [work] narrative_form [type]
    • and many more.

Aim: Wikidata for literary history

  • Approach: “atomization” of the historical knowledge
  • Key values:
    • Networked
    • Open, human and machine readable data
    • Collaborative
    • Multilingual

Mining and Modeling Text

Overview of Mining and Modeling Text

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (novels)

  • Corpus of 200 French novels (1750-1800)
  • Encoding: in XML-TEI, with metadata, according to ELTeC schema
  • Methods of analysis: Topic modeling, NER, stylometry, etc.

Pillar 3: Scholarly Literature

  • Corpus of chapters from literary histories about the French Eighteenth-Century novel
  • Annotation guidelines => Manual annotations (using INCEpTION)
  • Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
  • Creation of statements about authors and works (genres, themes, etc.)
  • Machine Learning based on the annotated training data

Modeling in LOD

  • Module 1: Theme
  • Module 2: Space
  • Module 3: Narrative form
  • Module 4: Literary work
  • Module 5: Author
  • Module 6: Mapping
  • Module 7: Referencing
  • Module 8: Versioning & publication
  • Module 9: Terminology
  • Module 10: Bibliography
  • Module 11: Scholarly literature

Example: The module on themes

Linking with Wikidata for ‘federated queries’

Result: The MiMoTextBase

SPARQL endpoint

Example: Epistolarity and Libertinage

The issue: Is there a libertine epistolary novel after 1782?

  • Is there libertinage in the epistolary novel after 1782?
    • van Crugten-André (1997): “the epistolary genre is hardly represented in the libertine novel after Laclos [=1782]”.
    • Benoît Melançon (2004): article on the “late libertine epistolary novel”, where he speaks about 8 libertine epistolary novels from after 1782.
  • Consensus on the fact that the “libertine epistolary novel” is a subgenre
  • But lack of clarity on its exact temporal extension

Clarification? Data from the MiMoTextBase

  • Number of novels for 1782-1800 => 647 (Q1)
  • Of those, number of epistolary novels => 91 (Q2)
  • What are their topics? => (Q3, bubble chart)
  • Of those, number about libertinage? => 1 (Q4a)
  • With looser definition of epistolary? => 3 (Q4a)
  • Number of libertine non-epistolary novels? => 14 (Q5)
  • Where do these novels take place? (Q6, federated!)
  • Compared to Melançon
    • Some are not marked ‘epistolary’
    • Some are not marked ‘libertine’
    • Two are missing

Query 3: Topics of epistolary novels post-1782

Query 6: Narrative location of libertine novels post-1782

Conclusion

Challenges

  • Necessary complexity reduction (triple structure) inherent in modeling
  • Lack of consensus on relevant statement types in the discipline
  • Federation: central part of the vision of LOD; not easy to achieve
  • Sustainability: our Wikibase vs. Wikidata vs. RDF on Zenodo

Opportunities

  • Linking heterogeneous data from different types of sources
    (e.g. semantic encoding bridges granularity differences)
  • Modeling, collecting and comparing contradictory statements
    (we model perspectives, not facts)
  • Transparency in knowledge production (sources of statements are included)
  • Multilingual data for a multilingual world (identifier vs. labels)
  • Linking datasets and reusing exisitng resources (avoid redundancy)

Outlook: LOD for other domains in the Humanities




Many thanks for your kind attention 😸

Further resources

References


Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.
Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.
Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.
Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.

Appendix: Sample queries

Some sample queries: simple queries

Example queries: visualizations

Sample queries: networked and federated

Sample queries: comparative queries