Genre Analysis in Computational Literary Studies

The First Ten Years




Christof Schöch
(Trier Center for Digital Humanities, Trier University, Germany)

Rewriting Literary History With Algorithms
University of Illinois at Chicago

15 Nov 2024

Opening

Opening up paths…

Overview

  1. Opening
  2. CLiGS #1: French Drama
  3. CLiGS #2: Spanish Novel
  4. Mining and Modeling Text
  5. Zeta and Company
  6. Conclusion

CLiGS #1: French Drama

What was CLiGS all about?

  • An early-career research group: 5 people, 5 years
  • Topic: Computational analysis of literary subgenres
  • Domain: French and Spanish-language literature
  • Outcomes:
    • Several corpora of novels, and the basis for ELTeC
    • Plenty of papers, one edited volume, two books
    • Three people got their Ph.D. (!)

My own first steps: Subgenres of French drama

  • Domain: French drama 1630–1779
  • Corpus: 391 plays: comedies, tragedies, tragicomedies
  • Question: Where does tragicomedy stand, relative to tragedy and comedy?
  • Methods: Topic Modeling, clustering, classification

Key finding: tragicomedy is a kind of tragedy

  • This is the result of a PCA clustering
  • Using the probabilities of 60 topics as the features
  • Tragicomedy, in green, appears to overlap much more with tragedy, in blue, than with comedy, in red.
  • So thematically, tragicomedy is rather a kind of tragedy, than a kind of comedy.

PCA for clustering with topic probabilities as features

CLiGS #2: Spanish Novels

Work by José Calvo Tello

  • Domain: Spanish novels ca. 1880–1936
  • Corpus: CoNSSA, with 358 novels in XML-TEI
  • Approach:
    • very rich metadata
    • full text with annotations
    • classification and clustering

Key finding: a new subgenre!

  • Two clusters of texts that appear to be candidates for new subgenres
  • They don’t map to authorship, period, or generation
  • But they do share many features at different levels
  • They have never been recognized as subgenres
  • Proposed labels:
    • “literary fiction” (cluster 50, weak signal)
    • “bucolic novel” (cluster 217, strong signal)

Mining and Modeling Text

What is MiMoText about?

  • Domain: French Eighteenth-Century Novel (1751–1800)
  • Goal: “Wikidata for French Literary History”
  • Corpus
    • Bibliographie du genre romanesque français (1977)
    • 200 French novels in XML-TEI
    • 25 works of literary history
  • Approach
    • Mining: Information Extraction
    • Modeling: Linked Open Data

The issue? Epistolary novel and libertinage

  • Is there libertinage in the epistolary novel after 1782?
    • van Crugten-André (1997): le genre épistolaire “est assez peu représenté dans le roman du libertinage après Laclos”.
    • Benoît Melançon (2004): article on the “roman par lettres libertin […] tardif”, where he speaks of a “coïncidence structurale entre épistolarité et libertinage”.
  • Consensus on the fact that the “roman libertin par lettres” is a subgenre
  • But lack of clarity on its exact temporal extension

Clarification? Data from the MiMoTextBase

  • How many novels are there for 1782-1800? => 647 (Q1)
  • How many of them are epistolary novels? => 83 (Q2)
  • What are their topics? => see right (Q3)
  • How many of those have a topic related to libertinage? => 1 (Aline et Valcour) (Q4)
  • How many non-epistolary novels have these topics? => 13 (Q5)
  • How many epistolary novels have these topics ≤ 1782? => 2 (Q6)

Clarification or failure? What is going on?

How is this possible when Melançon cites 8 relevant novels in his article?

  • 1 is from 1805 (outside our scope)
  • 2 have a “mixed” form in MMTB (first person with letters)
  • 3 are epistolary, but not libertinage in MMTB (debatable?)
  • 2 are missing from our database (!!)

Zeta and Company

What is/was the Zeta project about?

  • Domain: French novel ca. 1950–1999
  • Goals: Model, implement, evaluate and use keyness measures
  • Corpora
    • Full corpus: currently ~1550 novels
    • Balanced corpus: 320 novels 1970-1999
    • Four groups: crime, sentimental, scifi, literary fiction
  • Methods
    • Multiple keyness measures
    • Qualitative and quantitative evaluation

Approach: Qualitative evaluation of keyness measures

  • Build qualitative subgenre profiles based on reading scholarly literature on the genres: setting, protagonists, themes, style, etc.
  • Generate lists of distinctive words for each subgenre
  • Annotate the lists by matching each word to a category in the subgenre profiles (if possible)
  • Compare the number of words that can be matched between subgenres and measures
  • As expected, “literary fiction” (blanche) is harder to grasp than the popular genres

An unexpected detail: “seasons” in literary fiction

  • See: “hiver, éte, printemps, saison”
  • Zeta keywords not matched to the profile
  • They do appear to be interpretable!
  • However, annotators did not find a matching item in the categories
  • Candidate: theme “time and history”
  • Central concern of literary fiction with seasons is a new detail

Conclusion

Did we perform any rewritings? Well…

  • Tragicomedy: a statistical clarification / confirmation
  • Bucolic fiction: an actual, overlooked subgenre
  • Epistolary libertinage: between clarification and failure
  • Literary fiction: overlooked aspect or lack of annotators’ ingenuity?

What can we learn from this?

  • Each of these findings, big or small, appear useful to some extent
  • And they are not possible without three things:
    • Large, well-designed corpora
    • Rich metadata about each novel
    • Methods whose reliability we can trust
  • Better corpora, metadata and methods => better results
  • And that’s what we’re going to work on in the next ten years!




Thank you for your kind attention!

References

Calvo Tello, José. 2021. The Novel in the Spanish Silver Age. A Digital Analysis of Genre using Machine Learning. Bielefeld: transcript.
Dalen-Oskam, Karina van. 2023. The Riddle of Literary Quality: A Computational Approach. Amsterdam: Amsterdam University Press.
Du, Keli, Julia Dudar, and Christof Schöch. 2022. “Evaluation of Measures of Distinctiveness. Classification of Literary Texts on the Basis of Distinctive Words.” Journal of Computational Literary Studies 1 (1, 1). https://doi.org/10.48694/jcls.102.
Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Anne Klee, Johanna Konstanciak, Julia Röttgermann, Christof Schöch, and Joëlle Weis. 2024. “Patterns in Modeling and Querying a Knowledge Graph for Literary History [Preprint].” October 25, 2024. https://doi.org/10.5281/zenodo.12080340.
Jockers, Matthew L. 2013. Macroanalysis: Digital Methods and Literary History. University of Illinois Press.
Paige, Nicholas D. 2020. Technologies of the Novel: Quantitative Data and the Evolution of Literary Systems. New York: Cambridge University Press.
Schöch, Christof. 2017. “Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama.” Digital Humanities Quarterly 11 (2): §1–53. http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.
Underwood, Ted. 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: The University of Chicago Press.