Towards Computational Comparative
Literary Studies

Adressing the Challenges of Multilingualism

Christof Schöch
(Trier University, Germany)

KEASTWEST Conference 2024
Dongguk University, Seoul, South Korea

25 May 2024

Introduction

Multilingualism and me

Trier University

Trier Center for Digital Humanities

Alliance of Digital Humanities Organizations

Multilingualism is a topic that is really close to my heart
And it is highly relevant in my current key roles
- At Trier University, I use computational methods to investigate French literature, in a German context, mostly publishing in English; and Trier is at the borders to France, Luxemburg and Belgium, so in a multilingual setting.
- At the TCDH, we have a broad portfolio of projects on materials in German, English, French and other languages, whether in the original or in translations
- In the context of ADHO as an increasingly global organization, multilingualism plays a role in almost everything that we do.
For this reason, I am very happy to be able to speak about multilingualism in Digital Humanities;
More specifically, about the challenges of multilingualism when attempting to practice what could be called Digital Comparative Literature, or more precisely perhaps: “Computational Comparative Literary Studies”.

Computational Comparative Literary Studies?

What could CCLS be?
- Literary Studies
- but: Computational (using digital data and methods)
- and: Comparative (transnational, transmedial)
Many challenges for conversion
- requires multiple areas of expertise
- significant challenges of multilingualism

Comparative Literature is well-established, of course, and I am not here to tell you what it is;
Computational Literary Studies are emerging (and have a tradition going back to the 1950s);
but the intersection of these two strands of research has so far been quite limited. So the question is: what is holding us back in this area, and what can we do to address these hindrances?
I think there are multiple factors at play:
- mutual lack of knowledge about the other field, for sure
- also the fact that for CCLS, one needs to acquire a very wide range of expertise
- but one is certainly the fact that Comparative Literature deals typically with materials in multiple languages and emerging from multiple cultural and historical contexts;
- and that Computational Literary Studies relies on data and tools that are very much biased towards single-language, English-focused applications.
That is why I believe that the challenges of multilingualism need to be at the heart of any attempt to bring Computation and Comparison together in Literary Studies.

Three attempts at CCLS

Corpus Building: The Diversity Paradox
Data Modeling: Linked Open Data
Text Analysis: Multilingual Stylometry

(1) Corpus Building:
The Diversity Paradox

The COST Action ‘Distant Reading for European Literary History’

The ‘European Literary Text Collection’ (ELTeC)

A closer look: corpus composition in ELTeC

English ELTeC corpus

Romanian ELTeC corpus

The Diversity Paradox

ELTeC design goals: enable meaningful cross-language investigations
- Balance with respect to key text characteristics
  (text length, author gender, prestige)
- Inclusivity with respect to language-based literary traditions
Consequence: the ‘diversity paradox’
- If the criteria are too loose, balance is compromimsed
  (many, but invalid, corpora)
- If the criteria are too strict, inclusivity is compromised
  (valid, but few, corpora)
- In both cases, meaningful cross-language investigations are impossible

(2) Data Modeling:
Linked Open Data

The project ‘Mining and Modeling Text’

Linked Open Data: Simple Statements

Linked Open Data: Multilingualism

MiMoText Base: Query for themes in novels

(3) Text Analysis:
Multilingual Stylometry

High-profile cases of stylometric authorship attribution

William Shakespeare:
Craig and Kinney (2009)

Molière and Corneille:
Cafiero and Camps (2019)

Elena Ferrante:
Tuzzi and Cortelazzo (2018)

Galbraith / Rowling:
Juola (2015)

Multilingual stylometry?

translation ↦ ↧ original	fra	eng	hun	ukr
fra	fra-fra	fra-eng	fra-hun	fra-ukr
eng	eng-fra	eng-eng	eng-hun	eng-ukr
hun	hun-fra	hun-eng	hun-hun	hun-ukr
ukr	ukr-fra	ukr-eng	ukr-hun	ukr-ukr

Using corpora from the European Literary Text Collection (ELTeC)
Translated entirely into the other languages using DeepL

Some first results

More information: Dudar et al. (in progress).

Full interactive showcase

Conclusion

Take-home message

Good, multilingual corpora are rare and hard to build
Linked Open Data is a huge opportunity for multilingual data modeling
Text analysis is still primarily multi-lingual rather than cross-lingual
(but multilingual LLMs are in the process of changing that)

Lessons learned

Multilingual research is multicultural research
‘Computational Comparative Literary Studies’ requires multiple competencies
Nobody can learn everything: we need interdisciplinary collaboration
Let’s learn from each other: Computational and Comparative Literary Studies

Thank you for your kind attention!

References

Cafiero, Florian, and Jean-Baptiste Camps. 2019. “Why Molière Most Likely Did Write His Plays.” Science Advances 5 (11): eaax5489. https://doi.org/10.1126/sciadv.aax5489.

Craig, Hugh, and Arthur F. Kinney. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press.

Dudar, Julia, Evgeniia Fileva, Artjoms Šeļa, and Christof Schöch. in progress. “Multilingual Stylometry: The Influence of Corpus Composition and Language on the Performance of Authorship Attribution Using Corpora from the European Literary Text Collection (ELTeC).” Tbc, in progress.

Juola, Patrick. 2015. “The Rowling Case: A Proposed Standard Protocol for Authorship Attribution.” Digital Scholarship in the Humanities 30 (suppl. 1): 100–113. https://doi.org/10.1093/llc/fqv040.

Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.

Schöch, Christof, Roxana Patras, Tomaž Erjavec, and Diana Santos. 2021. “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.” Modern Languages Open, no. 1: 25. https://doi.org/10.3828/mlo.v0i0.364.

Tuzzi, Arjuna, and Michele A. Cortelazzo, eds. 2018. Drawing Elena Ferrante’s Profile: Workshop Proceedings. Padova: Padova UP.