Mining and Modeling Text: Leveraging Machine Learning and Linked Open Data to Investigate the French Enlightenment Novel




Prof. Dr. Christof Schöch
Trier University, Germany

Institute for Georgian Literature – Tbilisi State University – Georgia

13 Mar 2025



Introduction: ML + LOD + LitHist = ?







Overview

  • 1 – Introduction: ML + LOD + LitHist = ?
  • 2 – Mining and Modeling Text
  • 3 – Example: Epistolarity and Libertinage
  • 4 – Conclusion
  • 5 – Annex: Sample Queries

Three Modes of Data (and Digital Humanities)

  • Qualitative DH
    • Datasets are typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
    • Prototype: digital scholarly editions, e.g. Goethe, Faust Edition
  • Quantitative DH
    • Datasets are typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
    • Prototype: Topic Modeling on a Project Gutenberg corpus
  • Third way for DH
    • Bigger, smarter data in the humanities.
    • One approach: Machine Learning + Linked Open Data

Literary History 101

  • Goals of literary history
    • Collect and document knowledge about literature over time
    • Contextualize literature, document reception of literature
    • Provide explanations for the ‘evolution’ of literature
  • Organizational principles
    • Nations, periods, movements/currents, genres, auhtors
    • Authors and works: themes, forms, setting, plot, etc.
    • Similarities and differences, continuities and change

Machine Learning for Literary History

  • Fundamentally, ML involves detecting relations between features and labels
    • Features: simple properties that we can observe in texts (e.g.: word forms)
    • Labels: more complex phenomena that are relevant to our research (e.g.: direct speech or themes)
  • We use this approach primarily for literary information retrieval
    • We start with a text collection
    • We may annotate part of the data and train a new model
    • Or we may use an existing model (or unsupervised approaches)
    • Evaluate the performance of the model on the annotated data
    • And then derive labels from unannotated text

What is Linked Open Data?

ML + LOD + Literary History

  • Machine Learning:
    • Generate information about literary history from texts using
    • Use various sources and methods
  • Linked Open Data:
    • Model this information semantically (triples)
    • Merge multiple results into a shared data model
  • Literary History
    • Browse, query, visualize the data
    • “atomization” of the historical knowledge
    • but: generated, sourced, contextual

Mining and Modeling Text

Overview of Mining and Modeling Text

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (novels)

  • Corpus of 200 French novels (1750-1800)
  • Encoding: in XML-TEI, with metadata, according to ELTeC schema
  • Methods of analysis: Topic modeling, NER, stylometry, etc.

Pillar 3: Scholarly Literature

  • Corpus of chapters from literary histories about the French Eighteenth-Century novel
  • Annotation guidelines => Manual annotations (using INCEpTION)
  • Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
  • Creation of statements about authors and works (genres, themes, etc.)
  • Machine Learning based on the annotated training data

Modeling in LOD

  • Module 1: Theme
  • Module 2: Space
  • Module 3: Narrative form
  • Module 4: Literary work
  • Module 5: Author
  • Module 6: Mapping
  • Module 7: Referencing
  • Module 8: Versioning & publication
  • Module 9: Terminology
  • Module 10: Bibliography
  • Module 11: Scholarly literature

Example: The module on themes

Linking with Wikidata for ‘federated queries’

Result: The MiMoTextBase

SPARQL endpoint

Example: Epistolarity and Libertinage

The issue: Is there a libertine epistolary novel after 1782?

  • Is there libertinage in the epistolary novel after 1782?
    • van Crugten-André (1997): “the epistolary genre is hardly represented in the libertine novel after Laclos [=1782]”.
    • Benoît Melançon (2004): article on the “late libertine epistolary novel”, where he speaks about 8 libertine epistolary novels from after 1782.
  • Consensus on the fact that the “libertine epistolary novel” is a subgenre
  • But lack of clarity on its exact temporal extension

Clarification? Data from the MiMoTextBase

Query 1: Novels 1782–1800

Query 2: Epistolary novels 1782–1800

Query 3: Libertine novels 1782–1800

Query 4: Libertine epistolary novels 1782–1800

Query 5: Broader topic: libertinage, crime or passion

Query 6: Broader form: described as “lettre(s)”

Outcomes

  • Of course, it is all a question of definitions
  • However, our modeling allows us to ‘modulate’ definitions
    • Epistolarity: semantic category vs. string ‘lettre(s)’
    • Libertinage: libertinage, passion, crime (gallantery?)
  • As a result, 2 or 7 or 10 instances of post-1782 “libertine epistolary novel”
  • In addition: we did find that 2 relevant novels from Melançon were missing in MiMoText

Conclusion

Challenges

  • Necessary complexity reduction (triple structure) inherent in modeling
  • Lack of consensus on relevant statement types in the discipline
  • Federation: central part of the vision of LOD; very usefol, not easy to achieve
  • Sustainability: our Wikibase vs. Wikidata vs. RDF on Zenodo

Opportunities

  • Linking heterogeneous data from different types of sources
    (e.g. semantic encoding bridges granularity differences)
  • Modeling, collecting and comparing contradictory statements
    and nuanced concepts(we model perspectives, not facts)
  • Transparency in knowledge production (sources of statements are included)
  • Multilingual data for a multilingual world (identifier vs. labels)
  • Linking datasets and reusing exisitng resources (avoid redundancy)

Outlook: ML+LOD for other domains in the Humanities

Closing word: ML/LLMs and LOD/KGs

  • ML/LLMs and LOD/KGs may appear as polar opposites
    • KGs: Careful and explicit semantic modeling
    • LLMs: Statistical approach to language and knowledge generation
  • Rather, we should see their synergy and complementarity
    • ML/LLMs for semantically-aware knowledge extraction
    • KGs for knowledge representation and querying
    • KG provides cumulated context to any analysis result




Many thanks for your kind attention

Further resources

References


Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.
Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.
Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.
Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.

Appendix: Sample queries

Some sample queries: simple queries

Example queries: visualizations

Sample queries: networked and federated

Sample queries: comparative queries