Mining and Modeling Text:
Leveraging Machine Learning and Linked Open Data
to Investigate the French Enlightenment Novel




Prof. Dr. Christof Schöch
Trier University, Germany

Universität Stuttgart

17 Jan 2025



Introduction: ML + LOD + LitHist







Overview

  • 1 – Introduction: ML + LOD + LitHist
  • 2 – Mining and Modeling Text
  • 3 – Example: Epistolarity and Libertinage
  • 4 – Conclusion

Three Modes of Data (and Digital Humanities)

  • Qualitative DH
    • Data is typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
    • Prototype: digital scholarly editions, e.g. Goethe, Faustedition
  • Quantitative DH
    • Data is typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
    • Prototype: Topic Modeling on a Hathi Trust corpus
  • Third way for DH
    • Bigger, smarter data in the humanities.
    • Not just a compromise: bring scale to the details
    • One approach: Machine Learning + Linked Open Data

ML + LOD + Literary History

  • Machine Learning
    • Generate information about authors and works from sources
    • Use multiple kinds of sources and ML approaches
  • Linked Open Data
    • Model this information semantically (triples)
    • Statements of the form subject-predicate-object
    • Merge results into a shared data model
  • Literary History
    • An “atomization” of the historical knowledge
    • Statements, not facts: generated, sourced, contextual
    • Browse, query, visualize the data

Mining and Modeling Text

Overview of Mining and Modeling Text

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (200 novels, 1750–1800)

  • Partly: double-keying, model for OCR4all, full text
  • Encoding: in XML-TEI, with metadata, according to ELTeC schema
  • Methods of analysis: Topic modeling, NER, sentiment, stylometry, etc.

Pillar 3: Scholarly Literature

  • Corpus of chapters from literary histories about the French Eighteenth-Century novel
  • Annotation guidelines => Manual annotations (using INCEpTION)
  • Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
  • Creation of statements about authors and works (genres, themes, etc.)
  • Machine Learning based on the annotated training data

Modeling in LOD

  • Module 1: Theme
  • Module 2: Space
  • Module 3: Narrative form
  • Module 4: Literary work
  • Module 5: Author
  • Module 6: Mapping
  • Module 7: Referencing
  • Module 8: Versioning & publication
  • Module 9: Terminology
  • Module 10: Bibliography
  • Module 11: Scholarly literature

Example: The module on themes

Result: The MiMoTextBase

SPARQL endpoint

Linking with Wikidata for ‘federated queries’

Query with federation: MiMoText + Wikibase

Example: Epistolarity and Libertinage

The issue: Is there a libertine epistolary novel after 1782?

  • Is there libertinage in the epistolary novel after 1782?
    • van Crugten-André (1997): “the epistolary genre is hardly represented in the libertine novel after Laclos [=1782]”.
    • Benoît Melançon (2004): article on the “late libertine epistolary novel”, where he speaks about 8 libertine epistolary novels from after 1782.
  • Consensus on the priviledged relationship between libertinage and epistolarity
  • But lack of clarity on its exact temporal extension

Clarification? Data from the MiMoTextBase

Query 1: Novels 1782–1800

Query 2: Epistolary novels 1782–1800

Query 3: Libertine novels 1782–1800

Query 4: Libertine epistolary novels 1782–1800

Query 5: Broader topic: libertinage, eroticism or passion

Query 6: Broader form: described as “lettre(s)”

Outcomes

  • Of course, it is all a question of definitions
  • However, our modeling allows us to ‘modulate’ definitions
    • Epistolarity: semantic category vs. string ‘lettre(s)’
    • Libertinage: libertinage, eroticism, passion (crime? gallantery?)
  • As a result, 2 or 5 or 8 instances of post-1782 “libertine epistolary novel”
  • In addition: we did find that 2 relevant novels from Melançon were missing in MiMoText

Conclusion

Challenges

  • Necessary complexity reduction (triple structure) inherent in modeling
  • Lack of consensus on relevant statement types in the discipline
  • Federation: central part of the vision of LOD; very useful, not easy to achieve
  • Sustainability: our Wikibase vs. Wikidata vs. RDF on Zenodo

Opportunities

  • Linking heterogeneous data from different types of sources
    (e.g. semantic encoding bridges granularity differences)
  • Modeling, collecting and comparing contradictory statements
    and nuanced concepts (we model perspectives, not facts)
  • Transparency in knowledge production (sources of statements are included)
  • Multilingual data for a multilingual world (identifier vs. labels)
  • Linking datasets and reusing existing resources (avoid redundancy)

Outlook: ML+LOD for other domains in the Humanities

  • Korrespondenzen der Frühromantik (TCDH)
  • Mapping Digital Humanities (DH)
  • Historical Wine Labels (LODinG)

Closing word: ML/LLMs and LOD/KGs

  • ML/LLMs and LOD/KGs may appear as polar opposites
    • KGs: Careful and explicit semantic modeling
    • LLMs: Statistical approach to language and knowledge generation
  • Rather, we should see their synergy and complementarity
    • ML/LLMs for semantically-aware knowledge extraction
    • KGs for knowledge representation and querying
    • KG provides cumulative context to any analysis




Many thanks for your kind attention

Further resources

References


Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.
Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.
Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.
Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.

Appendix: Bonus slides and sample queries

Literary History 101

  • Goals of literary history
    • Collect and document knowledge about literature over time
    • Contextualize literature, document reception of literature
    • Provide explanations for the ‘evolution’ of literature
  • Organizational principles
    • Nations, periods, movements/currents, genres, authors
    • Authors and works: themes, forms, characters, setting, plot, etc.
    • Similarities and differences, continuities and change

Machine Learning for Literary History

  • Fundamentally, ML involves detecting relations between simple and complex properties of texts
    • Simple properties (features): we can observe in texts (e.g.: word forms)
    • Complex phenomena (classes): relevant to our research (e.g.: direct speech or themes)
  • We use this approach primarily for literary information retrieval
    • We may annotate part of our texts and train a new model
    • Or we may use an existing model (or unsupervised approaches)
    • Evaluate the performance of the model on the annotated data
    • And then identify complex phenomena in unannotated text

Linked Open Data for Literary History

Some sample queries: simple queries

Example queries: visualizations

Sample queries: networked and federated

Sample queries: comparative queries