Mining and Modeling Text: Leveraging Machine Learning and Linked Open Data to Investigate the French Enlightenment Novel

Prof. Dr. Christof Schöch
Trier University, Germany

Institute for Georgian Literature – Tbilisi State University – Georgia

13 Mar 2025

Introduction: ML + LOD + LitHist = ?

Overview

1 – Introduction: ML + LOD + LitHist = ?
2 – Mining and Modeling Text
3 – Example: Epistolarity and Libertinage
4 – Conclusion
5 – Annex: Sample Queries

Three Modes of Data (and Digital Humanities)

Qualitative DH
- Datasets are typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
- Prototype: digital scholarly editions, e.g. Goethe, Faust Edition
Quantitative DH
- Datasets are typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
- Prototype: Topic Modeling on a Project Gutenberg corpus
Third way for DH
- Bigger, smarter data in the humanities.
- One approach: Machine Learning + Linked Open Data

Literary History 101

Goals of literary history
- Collect and document knowledge about literature over time
- Contextualize literature, document reception of literature
- Provide explanations for the ‘evolution’ of literature
Organizational principles
- Nations, periods, movements/currents, genres, auhtors
- Authors and works: themes, forms, setting, plot, etc.
- Similarities and differences, continuities and change

Machine Learning for Literary History

Fundamentally, ML involves detecting relations between features and labels
- Features: simple properties that we can observe in texts (e.g.: word forms)
- Labels: more complex phenomena that are relevant to our research (e.g.: direct speech or themes)
We use this approach primarily for literary information retrieval
- We start with a text collection
- We may annotate part of the data and train a new model
- Or we may use an existing model (or unsupervised approaches)
- Evaluate the performance of the model on the annotated data
- And then derive labels from unannotated text

What is Linked Open Data?

How does “Linked Open Data” actually work?
There are subjects, predicates and objects
Subjects are entities such as authors or works
Objects are other entities, so authors and works; but also topics, locations, values, strings, etc.
Predicates link subject and object, like a verb
This results in very simple statements, so-called “triples”
However, they are modeled “semantically”, which means the meaning of all classes and predicates is formally modeled, and usually linked to authority files.
And this also means it is inherently multilingual: every class, every property has an identifier, but also labels in multiple languages.
The potential of LOD only unfolds with a large number of simple statements
This seemingly simple structure is very powerful, because it is so generic
To illustrate this power, one could liken it to chess: simple rules with lots of possibilities
All the information we identify, using text and data mining, from our texts, is expressed in this way.
A knowledge graph is created, which may become part of the so-called “semantic web”

ML + LOD + Literary History

Machine Learning:
- Generate information about literary history from texts using
- Use various sources and methods
Linked Open Data:
- Model this information semantically (triples)
- Merge multiple results into a shared data model
Literary History
- Browse, query, visualize the data
- “atomization” of the historical knowledge
- but: generated, sourced, contextual

Mining and Modeling Text

Overview of Mining and Modeling Text

“Mining and Modeling Text” brought literary history, machine learning and linked open data together.
We use three different sources of information for this purpose
- Bibliographic metadata
- Characteristics of primary texts, in our case based on a corpus of 200 French novels from the period 1751-1800
- Knowledge from literary history, especially overview chapters from handbooks
Work objectives
- Automatically extract relevant information from these sources
- Model this information as LOD and link them together as much as possible
- Analyze this information to learn more about the literature of the period
- This also means transforming heterogeneous sources into a varied, but compatible data set
- It also means explicitly modeling everything, and this is where controlled vocabularies, taxonomies, ontologies and their implementation and use come into play.

Pillar 1: Bibliographie du genre romanesque français

Pillar 2: primary literature (novels)

Corpus of 200 French novels (1750-1800)
Encoding: in XML-TEI, with metadata, according to ELTeC schema
Methods of analysis: Topic modeling, NER, stylometry, etc.

Pillar 3: Scholarly Literature

Corpus of chapters from literary histories about the French Eighteenth-Century novel
Annotation guidelines => Manual annotations (using INCEpTION)
Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
Creation of statements about authors and works (genres, themes, etc.)
Machine Learning based on the annotated training data

Modeling in LOD

Module 1: Theme
Module 2: Space
Module 3: Narrative form
Module 4: Literary work
Module 5: Author
Module 6: Mapping
Module 7: Referencing
Module 8: Versioning & publication
Module 9: Terminology
Module 10: Bibliography
Module 11: Scholarly literature

Example: The module on themes

Linking with Wikidata for ‘federated queries’

Result: The MiMoTextBase

SPARQL endpoint

Example: Epistolarity and Libertinage

The issue: Is there a libertine epistolary novel after 1782?

Is there libertinage in the epistolary novel after 1782?
- van Crugten-André (1997): “the epistolary genre is hardly represented in the libertine novel after Laclos [=1782]”.
- Benoît Melançon (2004): article on the “late libertine epistolary novel”, where he speaks about 8 libertine epistolary novels from after 1782.
Consensus on the fact that the “libertine epistolary novel” is a subgenre
But lack of clarity on its exact temporal extension

Clarification? Data from the MiMoTextBase

Query 1: Novels 1782–1800

Query 2: Epistolary novels 1782–1800

Query 3: Libertine novels 1782–1800

Query 4: Libertine epistolary novels 1782–1800

Query 5: Broader topic: libertinage, crime or passion

Query 6: Broader form: described as “lettre(s)”

Outcomes

Of course, it is all a question of definitions
However, our modeling allows us to ‘modulate’ definitions
- Epistolarity: semantic category vs. string ‘lettre(s)’
- Libertinage: libertinage, passion, crime (gallantery?)
As a result, 2 or 7 or 10 instances of post-1782 “libertine epistolary novel”
In addition: we did find that 2 relevant novels from Melançon were missing in MiMoText

Conclusion

Challenges

Necessary complexity reduction (triple structure) inherent in modeling
Lack of consensus on relevant statement types in the discipline
Federation: central part of the vision of LOD; very usefol, not easy to achieve
Sustainability: our Wikibase vs. Wikidata vs. RDF on Zenodo

Opportunities

Linking heterogeneous data from different types of sources
(e.g. semantic encoding bridges granularity differences)
Modeling, collecting and comparing contradictory statements
and nuanced concepts(we model perspectives, not facts)
Transparency in knowledge production (sources of statements are included)
Multilingual data for a multilingual world (identifier vs. labels)
Linking datasets and reusing exisitng resources (avoid redundancy)

Outlook: ML+LOD for other domains in the Humanities

Closing word: ML/LLMs and LOD/KGs

ML/LLMs and LOD/KGs may appear as polar opposites
- KGs: Careful and explicit semantic modeling
- LLMs: Statistical approach to language and knowledge generation
Rather, we should see their synergy and complementarity
- ML/LLMs for semantically-aware knowledge extraction
- KGs for knowledge representation and querying
- KG provides cumulated context to any analysis result

Many thanks for your kind attention

Further resources

Tutorial: https://docs.mimotext.uni-trier.de
SPARQL endpoint: https://query.mimotext.uni-trier.de
MiMoTextBase: https://data.mimotext.uni-trier.de
MiMoText Ontology: https://github.com/MiMoText/ontology
Reference publication: ‘Smart Modeling for Digital Literary History’, 2022.

References

Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.

Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.

Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.

Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.

Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.