Multilingual Stylometry

The influence of language, translation and corpus composition on the performance of authorship attribution using corpora from the European Literary Text Collection (ELTeC)


Evgeniia Fileva, Julia Havrylash, Christof Schöch, Artjoms Šeļa

Trier University, Germany | Czech Academy of Science

Session on ‘Digital Comparative Literature’
2025 ICLA Congress, Seoul, South Korea

30 Jul 2025

Overview

  • 1 – Context: CLS INFRA
  • 2 – Topic: Authorship Attribution
  • 3 – Earlier Work
  • 4 – Dataset: ELTeC
  • 5 – Method: Classification Task
  • 6 – Results: Language and translation
  • 7 – Preliminary results: Corpus Composition
  • 8 – Next step: Corpus for Subsampling
  • 9 – Conclusion

Context:
CLS INFRA

Computational Literary Studies Infrastructure

  • Horizon 2020 project funded by the European Commission (2020–2025)
  • Focus on datasets, tools, methods, training, and community building for CLS
  • Aim: Further develop this long-standing and dynamic area within Digital Humanities
  • Related initiative: Journal of Computational Literary Studies (jcls.io)
  • Work area: Methodological considerations, e.g. Survey of Methods in CLS (Schöch, Dudar, and Fileva 2023)

The problem:
Authorship Attribution

Stylometric Authorship Attribution: Basic idea

  • Authors have deep-seated habits of language use
  • These habits manifest in a ‘key profile’ of over- and underuse of words (see Evert et al. 2017)
  • Texts can be represented as vectors based on word frequencies
  • Distances between texts in a multidimensional vector space represent similarity of texts
  • This can be one important source of information on authorship of disputed or anonymous texts
  • Various methods of authorship attribution rely on these assumptions (see Schöch, Dudar, and Fileva 2023)
  • High-profile cases: Shakespeare, Rowling, Molière/Corneille, Ferrante (see Tuzzi and Cortelazzo 2018), etc.

Multiple factors for stylometric similarity of texts

  • Language of texts (fundamentally)
  • Authorship
  • Translation
  • Time period
  • Genre / text type
  • Formal aspects
  • Corpus composition

Open questions: To which extent does each of these factors influence stylometric authorship attribution? How do they interact?

Earlier Work

Rybicki and Eder 2011: “Deeper Delta across Genres and Languages”

  • They investigated feature selection and distance measures
  • Multiple corpora in various languages and genres
  • Explored impact of feature selection and language on authorship attribution
  • But: language and corpus composition vary simultaneously
  • Challenge: Impact of each of these factors on attribution accuracy remains unclear

Data:
Derived from ELTeC

Corpus source and composition

  • Source of data
    • European Literary Text Collection (ELTeC) (see Schöch et al. 2021)
    • Time Period: 1840–1920, 12 languages
  • Subcorpora for this study
    • Languages: English, French, Hungarian, Ukrainian
    • 3 novels each by 8-10 authors per subcorpus
    • Varying corpus sizes due to differences in text length

Data preparation

  • Translation
    • Automated using DeepL Pro (July–Sept. 2023)
    • Translated each corpus into 3 other languages
    • Result: 4 originals + 12 translations = 16 corpora
  • Metadata collection
    • Basics: Authors, publication year, word count
    • Additionally: subgenres, narrative perspective
  • Linguistic annotation (all corpora)
    • Using spaCy 3.7
    • Lemmas and part-of-speech

Method:
Classification Task

Authorship classification task

  • Basic idea
    • Manipulate language independently of corpus composition
      => need for translated corpora
    • Language varies, while corpus composition is held stable
  • Classification method
    • Support Vector Machines (SVM)
    • Weight of each word / dimension is learned to best separate authors
  • Approach
    • Leave-one-out classification
    • Performance evaluation using accuracy and Cohen’s Kappa

Features and sample sizes

  • Features
    • Units: Word / Lemma / POS / Character
    • N-grams: 1-3 (words) / 1-5 (characters)
  • Sample sizes
    • 5,000 – 10,000 + full novels (words)
    • 10,000 – 50,000 (characters)
  • Vector length
    • 50–2.000 most frequent features

Results:
Language and translation

Performance across corpora

  • Differences between languages
    (in the original)
  • Original texts outperform translations
  • Variable loss in performance with translations




  • Parameters: word unigrams, average of 10.000 words + full novels, avg. 100–1000 words.

Results per language

  • English
    • originals 69%
    • translations much lower (52-61%)
  • French
    • originals 71%
    • translations moderately lower (61-65%)
  • Ukrainian
    • originals 76%,
    • translations much lower (48-57%)
  • Hungarian
    • originals 87%
    • translations clearly lower (63-71%)

Showcase: interactive visualization

  • Purpose: Explore the many detailed results at your leisure
  • Heatmap: mean accuracy (color) per corpus, across parameters
  • Parameters
    • X-axis (fixed): most frequent features
    • Y-axis (fixed): sample size
    • Data shown for: selected feature type, n-grams

Showcase: key observations

  • Sample size:
    longer samples are better
  • Number of features:
    more is not better
  • Features: Usually very good performance for:
    • word or lemma unigrams
    • character 5-grams
  • For each corpus / language, different parameters are best

Preliminary results:
Corpus composition

Goal: Estimate the attribution difficulty for a given corpus

  • Work so far:
    • Language (and translation) are two key factors
    • Main limitation: use of machine translation (homogenizing factor)
    • And: Corpus composition across languages varies in our corpora
  • Preliminary results from next step:
    • Investigate influence of corpus composition in existing data
    • Factors: subgenre, narrative form, time period

Preliminary results (1)

  • Metadata vector for each novel (d=7)
    • publication year
    • subgenre
    • narrative perspective
  • => Metadata-based distances between texts
  • Relationship to lexicon-based distances

Preliminary results (2)

  • Average metadata distances
  • Including for translated corpora
  • Relationship to classification accuracy

Next step:
Corpus for subsampling

Aims and assumptions

  • Aims:
    • Go beyond metadata distance
    • Investigate relationship between authorship and other factors
  • Assumptions
    • In a corpus where authorship correlates strongly with other factors, attribution is easier
    • In a corpus where other factors vary strongly with respect to authorship, attribution is more difficult

Corpus design and approach

  • 1600 French novels, 1960–1989
  • Multiple authors represented with many novels
  • Multiple authors who wrote in multiple subgenres over several decades
  • Generate subcorpora with more or less strong correlation authorship / genre + time period + narrative perspective
  • Evaluate classification performance in each case

Conclusion

Main contribution

  • Authorship attribution: Performed with ‘traditional’ machine-learning methods
  • Machine-translation: performed with neural networks / large language models (‘AI’)
  • New insights into the relationship between language, features, and attribution performance

Limitations and next steps

  • Main limitations
    • Homogenizing effect of machine translation
    • Only one corpus per language (corpus-specific effects?)
  • Next steps
    • Repeat same experiments with more corpora
    • Repeat experiments with authentic translations (but translation corpora are rare)
    • Systematically vary corpus composition

Thank you!

References

Resources

References

Evert, St., Fotis Jannidis, Thomas Proisl, Steffen Pielström, Thorsten Vitt, Christof Schöch, and Isabella Reger. 2017. “Understanding and Explaining Distance Measures for Authorship Attribution.” Digital Scholarship in the Humanities 23 (suppl_2). https://doi.org/10.1093/llc/fqx023.
Rybicki, Jan, and Maciej Eder. 2011. “Deeper Delta Across Genres and Languages: Do We Really Need the Most Frequent Words?” Literary and Linguistic Computing.
Schöch, Christof, Julia Dudar, and Evgeniia Fileva, eds. 2023. Survey of Methods in Computational Literary Studies. https://doi.org/10.5281/zenodo.7892112.
Schöch, Christof, Tomas Erjavec, Roxana Patras, and Diana Santos. 2021. Creating the European Literary Text Collection: Challenges and Opportunities.” Modern Languages Open.
Schöch, Christof, Evgeniia Fileva, Julia Havrylash, and Artjoms Sela. 2024. “Multilingual Stylometry.” In Computational Humanities Research 2024: Proceedings.
Tuzzi, Arjuna, and Michele A. Cortelazzo, eds. 2018. Drawing Elena Ferrante’s Profile. https://www.padovauniversitypress.it/en/publications/9788869381300.