Bigger Smarter Data

Extracting, Modeling and Linking Data for Literary History




Christof Schöch
(Trier University, Germany)

Korea University
Seoul, South Korea

23 May 2024

Introduction

Thanks



Seung-eun Lee of Korea University / Department of Korean Language and Literature / Humanities Utmost Sharing System, and Byungjun Kim, KAIST, on behalf of KADH (Korean Association for Digital Humanities).



The Ministry for Research, Education and Culture of Rhineland-Palatinate, Germany, for funding this research (Mining and Modeling Text, 2019-2023)



Thanks to all the project contributors: Maria Hinzmann, Matthias Bremm, Tinghui Duan, Anne Klee, Johanna Konstanciak, Julia Röttgermann and many others.

Overview

  • 1 – Introduction
  • 2 – Bigger Smarter Data and Linked Open Data
  • 3 – Mining: Information Retrieval
  • 4 – Modeling: Linked Open Data
  • 5 – Results: Queryable Database
  • 6 – Conclusion
  • Hello and welcome everyone.
  • It is great to be here!
  • My topic today is a vision of Digital Humanities that tries to be both bigger and smarter, without choosing between the two.
  • To do so, I will show how what “Linked Open Data” is.
  • And how LOD can be one way of achieving “bigger smarter data”
  • Specifically, I will use the way we have used LOD in a project on creating a new way of engaging with literary history
  • I hope it will be interesting to you even, whether you are a literary historian, a scholar of history, or just curious about Digital Humanities.
  • First, I will explain what I mean by “bigger smarter data” and by linked open data.
  • Then I will show how we used text mining to collect information on French literary history.
  • And how we used LOD to model this data and link it together, as well as link it to other resources.
  • At the end, I will show what kinds of questions can be addressed using the resulting data.

Bigger Smarter Data:
Linked Open Data

Three Modes of Data (and Digital Humanities)

  • Qualitative DH:
    • Datasets are typically small, curated, heavily annotated, flawless, specialized (‘smart data’)
    • Prototype: digital scholarly editions, e.g. Goethe, Faust Edition
  • Quantitative DH:
    • Datasets are typically large, scraped, unannotated, with errors and biases, generic (‘big data’)
    • Prototype: Large Language Models trained on lots of text, e.g. ChatGPT

Reference: (Schöch 2013).

Third Way: Bigger Smarter Data

  • In my opinion, there is a third way, which I have described already quite some tim ago as “bigger smarter data”.
  • It’s not just a compromise: somewhat bigger, but still reasonably clean data;
  • It’s an attempt to bring scale (quantitative) to the details (qualitative),
  • Primarily using two technologies:
    • text mining or machine learning (for information extraction)
    • and linked open data (for data modeling and querying)

Reference: (Schöch 2013).

Background: What is Machine Learning?

  • Fundamentally, ML involves detecting relations between features and labels
    • Features that we can observe in data
    • Labels, classes, or values that are relevant to our research
  • We use this approach primarily for information retrieval
    • We start from a text collection
    • We may annotate part of the data
    • Then train a new model, or use an existing model
    • Evaluate the performance of the model on the annotated data
    • And then derive labels, classes, values from the unannotated text

Background: What is Linked Open Data?

  • How does “Linked Open Data” actually work?
  • There are subjects, predicates and objects
  • Subjects are entities such as authors or works
  • Objects are other entities, but also topics, locations, etc.
  • Predicates link subject and object, like a verb
  • This results in very simple statements
  • The potential of LOD only unfolds with a large number of simple statements
  • This seemingly simple structure is very powerful, because it is so generic
  • To illustrate this power, one could liken it to chess: simple rules with lots of possibilities
  • All the information we identify, using text and data mining, from our texts, is expressed in this way.
  • A knowledge graph is created, which becomes part of the so-called “semantic web”

Source: Wikidata and datajournalism.com/read/longreads/the-promise-of-wikidata.

Linked Open Data: Multilingualism

  • In LOD, all information is modeled semantically
  • That means, each “unit of meaning” as an identifier
  • And it has a lable and a description
  • More precisely: many lanels and descriptions, in many languages
  • So LOD and Wikidata are inherently multilingual
  • That’s a big asset in a global scholarly world

Background: What is Literary History?

  • Goals of literary history
    • Collecting and documenting knowledge of literary history
    • Providing explanations for the development of literature
  • Organizational principles
    • Nations, periods, movements/currents, genres
    • Authors and works: themes, forms, relationships
    • Similarities and differences, continuities and change
  • Functions
    • Describe and document literary history
    • Explain literary development

Literary History in Linked Open Data

  • Building blocks
    • Subjects, including persons (author, etc.) and works (primary text, scholarly literature, etc.)
    • Objects, including works, but also themes, locations, protagonists, literary genre, etc.
    • Predicates, as required, including: author_of, about, sameAs etc.
    • Qualifications, e.g: Source (with type, date, URL)
  • Some exemplary statement types
    • Bibliographic: [person] author_of [work]
    • Content-based: [work] about [theme]
    • Formal: [work] narrative_form [type]
    • and many more.

Wikidata for Literary History

  • Idea: Create a “Wikidata for the history of literature”
    • Literary history information system
    • LOD-based, with explorative interface and SPARQL endpoint
    • Approach of an “atomization” of the historical knowledge
    • Linking with other knowledge systems (taxonomies, standard data, knowledge bases)
    • Key values: human and machine readable, open, collaborative, multilingual
  • Compared to Wikidata
    • Focused on one domain (French novel, 1750-1800)
    • Better coverage / higher density of information for this domain
    • Development of a systematic ontology
    • much smaller: 300k vs. 1.5 billion statements
  • Digital: algorithmic methods; available online; readable by humans and machines
  • Multilingual: through semantic modeling
  • Collaborative: necessarily interdisciplinary; networked with other resources
  • Open: freely available software; all data and resources freely available; also already in the process

The project ‘Mining and Modeling Text’

  • Mining and Modeling text brought literary history, machine learning and linked open data together.
  • Mining and Modeling was a project funded by the regional government from 2019 to 2023
  • This was an interdisciplinary project involving computer science, literary studies, legal studies, computational linguistics, and digital humanities
  • The key goal of this project was to use computational methods to build a knowledge base about the French novel of the Enlightenment
  • But more generally, also, to experiment with the idea of “Linked Open Data for Literary History”
  • The MiMoText project has pursued this goal of collecting literary-historical knowledge, making it machine-readable and networking it using LOD
  • We use three different sources of information for this purpose
    • Bibliographic metadata; in particular the “Bibliographie du genre romanesque francais, 1751-1800”, by Martin, Mylne and Frautschi
    • Knowledge from literary history, especially overview chapters from handbooks
    • And characteristics of primary texts, in our case based on a corpus of 200 French novels from the period 1751-1800
  • Work objectives
    • Automatically extract relevant information from these sources
    • Model this information as LOD and link them together as much as possible
    • Analyze this information to learn more about the literature of the period, but also about literary historiography.
    • This also means transforming heterogeneous sources into a homogeneous data set
    • It also means explicitly modeling everything, and this is where controlled vocabularies, taxonomies, ontologies and their implementation and use come into play.

More information: (Schöch et al. 2022) and mimotext.uni-trier.de

Mining: Information Extraction

Pillar 1: Bibliographie du genre romanesque français

Reference: (Martin, Mylne, and Frautschi 1977).

Pillar 2: primary literature (novels)

  • Corpus of 200 French novels (1750-1800)
  • Encoding: in XML-TEI, with metadata, according to ELTeC schema
  • Methods of analysis: Topic modeling, NER, stylometry, etc.

Collection of Eighteenth-Century French novels (1750-1800), ed. Julia Röttgermann, github.com/MiMoText/roman18. See also (Röttgermann 2024).

Pillar 3: Scholarly Literature

  • Annotation guidelines => Manual annotations (using INCEpTION)
  • Linking of INCEpTION with MiMoTextBase and Wikidata => disambiguation
  • Creation of statements about authors and works (genres, themes, etc.)
  • Machine Learning based on the annotated training data

Modeling: Data Modeling

Modular Data Model

  • Module 1: Theme
  • Module 2: Space
  • Module 3: Narrative form
  • Module 4: Literary work
  • Module 5: Author
  • Module 6: Mapping
  • Module 7: Referencing
  • Module 8: Versioning & publication
  • Module 9: Terminology
  • Module 10: Bibliography
  • Module 11: Scholarly literature

Repository: github.com/MiMoText/ontology.

Example: The module on themes

Example: The module on narrative location

Meta-Statements

Linking with Wikidata for ‘federated queries’

Result: Queryable Database

The MiMoTextBase

  • Because all statements are available in our Wikibase instance, we can now also search and query this knowledge network
  • I will first show a few examples from the explorative view, similar to a wiki.
  • Author: Not a lot of information, but: “exact_match” (!) => Wikidata
  • Works: much more information, both as short texts and formalized and standardized according to the data model
  • This information comes from various sources, so there are also meta-statements
  • You can also search here, but not in a very precise way => SPARQL:::

See: data.mimotext.uni-trier.de

SPARQL endpoint

  • The Wikibase Query Service allows very flexible queries
  • There are also a lot of visualization options that can be used directly without further development work
  • Our Wikibase instance now contains well over 300,000 statements, so the possibilities are quite broad.
  • However, you need to familiarize yourself with the query language.
  • That’s why we also offer a tutorial and lots of sample queries:::

SPARQL = SPARQL Protocol and RDF Query Language
See: query.mimotext.uni-trier.de

MiMoText Base: Query for themes in novels

  • Every piece of information we capture is identified using authority data
  • When using Wikidata, this means labels and definitions are available in multiple languages
  • For our own taxonomies, we have focused on French, German and English
  • When performing queries, we can mobilise this multilingual potential
  • The above is a query for the dominant themes in the 200 novels of our corpus;
  • We can easily switch between languages for labeling the graph

Live SPARQL query: query.mimotext.uni-trier.de

Some sample queries: simple queries

  • List of novels with information from BGRF
  • Number of works per author (first 25)
  • Themes of novels, in French and in English

Example queries: visualizations

  • Number of novels per year
  • Narrative forms over time (decades)
  • Book history: print formats over time (5 years)

Sample queries: networked and federated

  • Link with catalogue data from French National Library (using BNF id)
  • Narrative locations of novels (map)
  • Authors by birth year, with portrait)
  • Alternative author names from Wikidata infobox
  • Network of influences between authors (using ‘influenced by’)
  • Querying MiMoText from Wikidata (it works both ways)
  • Novels and basic information, from Wikidata

Sample queries: comparative queries

  • Themes from topic modeling compared to themes in BGRF
  • Themes from BGRF vs. Topic Modeling (in one query)

Conclusion

Opportunities & challenges

  • Opportunities
    • Linking heterogeneous data from different types of sources
    • Modeling, collecting and comparing contradictory statements
    • Transparency in knowledge production (sources)
  • Challenges
    • Lack of consensus on relevant statement types in the discipline
    • Complexity reduction (triple structure)
    • Interoperability (tension ‘Wikiverse’ vs. OWL standard)

Lessons Learned

  • Federated queries
    • Central element of the LOD vision
    • => Making it happen is not trivial (data model, infrastructure)
  • Modeling meta statements
    • Very important: perspectives / statements, not facts
    • => Very different approaches in different technical contexts
  • Exchange across communities
    • Literary Studies vs. Digital Humanities vs. Wikiverse
    • => is essential but needs more development
  • There is still so much to do!
    • => We are continuing this effort in a new project called
      ‘Linked Open Data in the Humanities’ (LODinG)


Many thanks for your kind attention

Further resources

  • Tutorial: https://docs.mimotext.uni-trier.de
  • SPARQL endpoint: https://query.mimotext.uni-trier.de
  • MiMoTextBase: https://data.mimotext.uni-trier.de
  • MiMoText Ontology: https://github.com/MiMoText/ontology
  • Reference publication: ‘Smart Modeling for Digital Literary History’
  • Overview of visuals: mimotext.github.io/MiMoTextBase_Tutorial/visualizations.html

References


Martin, Angus, Vivienne G. Mylne, and Richard Frautschi. 1977. Bibliographie du genre romanesque français, 1751-1800. Mandell.
Röttgermann, Julia. 2024. “The Collection of Eighteenth-Century French Novels 1751-1800.” Journal of Open Humanities Data 10 (1): 31. https://doi.org/10.5334/johd.201.
Schöch, Christof. 2013. “Big? Smart? Clean? Messy? Data in the Digital Humanities.” Journal of Digital Humanities 2 (3): 1–19. https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.

https://dhtrier.quarto.pub/ku – CC BY

1
Bigger Smarter Data Extracting, Modeling and Linking Data for Literary History Christof Schöch (Trier University, Germany) Korea University Seoul, South Korea 23 May 2024

  1. Slides

  2. Tools

  3. Close
  • Bigger Smarter Data
  • Introduction
  • Thanks
  • Overview
  • Bigger Smarter Data:Linked Open Data
  • Three Modes of Data (and Digital Humanities)
  • Third Way: Bigger Smarter Data
  • Background: What is Machine Learning?
  • Background: What is Linked Open Data?
  • Linked Open Data: Multilingualism
  • Background: What is Literary History?
  • Literary History in Linked Open Data
  • Wikidata for Literary History
  • The project ‘Mining and Modeling Text’
  • Mining: Information Extraction
  • Pillar 1: Bibliographie du genre romanesque français
  • Pillar 2: primary literature (novels)
  • Pillar 3: Scholarly Literature
  • Modeling: Data Modeling
  • Modular Data Model
  • Example: The module on themes
  • Example: The module on narrative location
  • Meta-Statements
  • Linking with Wikidata for ‘federated queries’
  • Result: Queryable Database
  • The MiMoTextBase
  • SPARQL endpoint
  • MiMoText Base: Query for themes in novels
  • Some sample queries: simple queries
  • Example queries: visualizations
  • Sample queries: networked and federated
  • Sample queries: comparative queries
  • Conclusion
  • Opportunities & challenges
  • Lessons Learned
  • Many thanks for your kind attention
  • Further resources
  • References Martin,...
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help