Towards Computational Comparative Literary Studies

Adressing the Challenges of Multilingualism

Christof Schöch
(Trier University, Germany)

KEASTWEST Conference 2024
Dongguk University, Seoul, South Korea

25 May 2024


Multilingualism and me

Computational Comparative Literary Studies?

  • Literary Studies,
  • but: Computational
  • and: Comparative


  • Challenges of Multilingualism
    • Corpus Building: The Diversity Paradox
    • Data Modeling: Linked Open Data
    • Text Analysis: Multilingual Stylometry
  • Conclusion

(1) Corpus Building:
The Diversity Paradox

The COST Action ‘Distant Reading for European Literary History’

The ‘European Literary Text Collection’ (ELTeC)

A closer look: corpus composition in ELTeC

          ELTeC English

          ELTeC Romanian

The Diversity Paradox

  • ELTeC design goals: enable meaningful cross-language investigations
    • Balance with respect to key text characteristics (text length, author gender, prestige)
    • Inclusivity with respect to language-based literary traditions
  • Consequence: the ‘diversity paradox’
    • If the criteria are too loose, balance is compromimsed (many, but invalid, corpora)
    • If the criteria are too strict, inclusivity is compromised (valid, but few, corpora)
    • In both cases, meaningful cross-language investigations are impossible

(2) Data Modeling:
Linked Open Data

The project ‘Mining and Modeling Text’

Linked Open Data: Simple Statements (S-P-O)

MiMoText Base: Query for themes in novels

(3) Text Analysis:
Multilingual Stylometry

High-profile stylometry cases

William Shakespeare:
Craig and Kinney (2009)

Molière and Corneille:
Cafiero and Camps (2019)

Elena Ferrante:
Tuzzi and Cortelazzo (2018)

Galbraith / Rowling:
Juola (2015)

Multilingual stylometry?

Some early results

More information: Dudar et al. (in progress).


Summary of findings

  • Good, multilingual corpora are rare (and hard to build)
  • Linked Open Data is a huge opportunity for multilingual data modeling
  • Text analysis is still primarily multi-lingual rather than cross-lingual

Lessons learned

  • Multilingual research is multicultural research
  • Computational comparative Literary Studies requires multiple competencies
  • Nobody can learn everything: we need interdisciplinary collaboration
  • Let’s learn from each other: CLS from CL, and CL from CLS

Thank you for your kind attention!


Cafiero, Florian, and Jean-Baptiste Camps. 2019. “Why Molière Most Likely Did Write His Plays.” Science Advances 5 (11): eaax5489.
Craig, Hugh, and Arthur F. Kinney. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press.
Dudar, Julia, Evgeniia Fileva, Artjoms Šeļa, and Christof Schöch. in progress. “Multilingual Stylometry: The Influence of Corpus Composition and Language on the Performance of Authorship Attribution Using Corpora from the European Literary Text Collection (ELTeC).” Tbc, in progress.
Juola, Patrick. 2015. “The Rowling Case: A Proposed Standard Protocol for Authorship Attribution.” Digital Scholarship in the Humanities 30 (suppl. 1): 100–113.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93.
Schöch, Christof, Roxana Patras, Tomaž Erjavec, and Diana Santos. 2021. “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.” Modern Languages Open, no. 1: 25.
Tuzzi, Arjuna, and Michele A. Cortelazzo, eds. 2018. Drawing Elena Ferrante’s Profile: Workshop Proceedings. Padova: Padova UP.