Mining and Modeling Text:
Leveraging Machine Learning and Linked Open Data
to Investigate the French Enlightenment Novel

Prof. Dr. Christof Schöch
Trier University, Germany

Universität Stuttgart

17 Jan 2025

Introduction: ML + LOD + LitHist

Overview

1 – Introduction: ML + LOD + LitHist
2 – Mining and Modeling Text
3 – Example: Epistolarity and Libertinage
4 – Conclusion

Three Modes of Data (and Digital Humanities)

Qualitative DH
- Data is typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
- Prototype: digital scholarly editions, e.g. Goethe, Faustedition
Quantitative DH
- Data is typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
- Prototype: Topic Modeling on a Hathi Trust corpus
Third way for DH
- Bigger, smarter data in the humanities.
- Not just a compromise: bring scale to the details
- One approach: Machine Learning + Linked Open Data

ML + LOD + Literary History

Machine Learning
- Generate information about authors and works from sources
- Use multiple kinds of sources and ML approaches
Linked Open Data
- Model this information semantically (triples)
- Statements of the form subject-predicate-object
- Merge results into a shared data model
Literary History
- An “atomization” of the historical knowledge
- Statements, not facts: generated, sourced, contextual
- Browse, query, visualize the data

Mining and Modeling Text

Overview of Mining and Modeling Text

“Mining and Modeling Text” brought literary history, machine learning and linked open data together.
We use three different sources of information for this purpose
- Bibliographic metadata
- Characteristics of primary texts, in our case based on a corpus of 200 French novels from the period 1751-1800
- Knowledge from literary history, especially overview chapters from handbooks
Work objectives
- Automatically extract relevant information from these sources
- Model this information as LOD and link them together as much as possible
- Analyze this information to learn more about the literature of the period
- This also means transforming heterogeneous sources into a varied, but compatible data set
- It also means explicitly modeling everything, and this is where controlled vocabularies, taxonomies, ontologies and their implementation and use come into play.

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (200 novels, 1750–1800)

Partly: double-keying, model for OCR4all, full text
Encoding: in XML-TEI, with metadata, according to ELTeC schema
Methods of analysis: Topic modeling, NER, sentiment, stylometry, etc.

Pillar 3: Scholarly Literature

Corpus of chapters from literary histories about the French Eighteenth-Century novel
Annotation guidelines => Manual annotations (using INCEpTION)
Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
Creation of statements about authors and works (genres, themes, etc.)
Machine Learning based on the annotated training data

Modeling in LOD

Module 1: Theme
Module 2: Space
Module 3: Narrative form
Module 4: Literary work
Module 5: Author
Module 6: Mapping
Module 7: Referencing
Module 8: Versioning & publication
Module 9: Terminology
Module 10: Bibliography
Module 11: Scholarly literature

Example: The module on themes

Result: The MiMoTextBase

SPARQL endpoint

Linking with Wikidata for ‘federated queries’

Query with federation: MiMoText + Wikibase

Example: Epistolarity and Libertinage

The issue: Is there a libertine epistolary novel after 1782?

Is there libertinage in the epistolary novel after 1782?
- van Crugten-André (1997): “the epistolary genre is hardly represented in the libertine novel after Laclos [=1782]”.
- Benoît Melançon (2004): article on the “late libertine epistolary novel”, where he speaks about 8 libertine epistolary novels from after 1782.
Consensus on the priviledged relationship between libertinage and epistolarity
But lack of clarity on its exact temporal extension

Clarification? Data from the MiMoTextBase

Query 1: Novels 1782–1800

Query 2: Epistolary novels 1782–1800

Query 3: Libertine novels 1782–1800

Query 4: Libertine epistolary novels 1782–1800

Query 5: Broader topic: libertinage, eroticism or passion

Query 6: Broader form: described as “lettre(s)”

Outcomes

Of course, it is all a question of definitions
However, our modeling allows us to ‘modulate’ definitions
- Epistolarity: semantic category vs. string ‘lettre(s)’
- Libertinage: libertinage, eroticism, passion (crime? gallantery?)
As a result, 2 or 5 or 8 instances of post-1782 “libertine epistolary novel”
In addition: we did find that 2 relevant novels from Melançon were missing in MiMoText

Conclusion

Challenges

Necessary complexity reduction (triple structure) inherent in modeling
Lack of consensus on relevant statement types in the discipline
Federation: central part of the vision of LOD; very useful, not easy to achieve
Sustainability: our Wikibase vs. Wikidata vs. RDF on Zenodo

Opportunities

Linking heterogeneous data from different types of sources
(e.g. semantic encoding bridges granularity differences)
Modeling, collecting and comparing contradictory statements
and nuanced concepts (we model perspectives, not facts)
Transparency in knowledge production (sources of statements are included)
Multilingual data for a multilingual world (identifier vs. labels)
Linking datasets and reusing existing resources (avoid redundancy)

Outlook: ML+LOD for other domains in the Humanities

Korrespondenzen der Frühromantik (TCDH)
Mapping Digital Humanities (DH)
Historical Wine Labels (LODinG)

Closing word: ML/LLMs and LOD/KGs

ML/LLMs and LOD/KGs may appear as polar opposites
- KGs: Careful and explicit semantic modeling
- LLMs: Statistical approach to language and knowledge generation
Rather, we should see their synergy and complementarity
- ML/LLMs for semantically-aware knowledge extraction
- KGs for knowledge representation and querying
- KG provides cumulative context to any analysis

Many thanks for your kind attention

Further resources

Tutorial: https://docs.mimotext.uni-trier.de
SPARQL endpoint: https://query.mimotext.uni-trier.de
MiMoTextBase: https://data.mimotext.uni-trier.de
MiMoText Ontology: https://github.com/MiMoText/ontology
Reference publication: ‘Smart Modeling for Digital Literary History’, 2022.

References

Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.

Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.

Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.

Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.

Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.

Appendix: Bonus slides and sample queries

Literary History 101

Goals of literary history
- Collect and document knowledge about literature over time
- Contextualize literature, document reception of literature
- Provide explanations for the ‘evolution’ of literature
Organizational principles
- Nations, periods, movements/currents, genres, authors
- Authors and works: themes, forms, characters, setting, plot, etc.
- Similarities and differences, continuities and change

Machine Learning for Literary History

Fundamentally, ML involves detecting relations between simple and complex properties of texts
- Simple properties (features): we can observe in texts (e.g.: word forms)
- Complex phenomena (classes): relevant to our research (e.g.: direct speech or themes)
We use this approach primarily for literary information retrieval
- We may annotate part of our texts and train a new model
- Or we may use an existing model (or unsupervised approaches)
- Evaluate the performance of the model on the annotated data
- And then identify complex phenomena in unannotated text

Linked Open Data for Literary History

How does “Linked Open Data” actually work?
There are subjects, predicates and objects
Subjects are entities such as authors or works
Objects are other entities, so authors and works; but also topics, locations, values, strings, etc.
Predicates link subject and object, like a verb
This results in very simple statements, so-called “triples”
However, they are modeled “semantically”, which means the meaning of all classes and predicates is formally modeled, and usually linked to authority files.
And this also means it is inherently multilingual: every class, every property has an identifier, but also labels in multiple languages.
The potential of LOD only unfolds with a large number of simple statements
This seemingly simple structure is very powerful, because it is so generic
To illustrate this power, one could liken it to chess: simple rules with lots of possibilities
All the information we identify, using text and data mining, from our texts, is expressed in this way.
A knowledge graph is created, which may become part of the so-called “semantic web”