Keyness in Computational Literary Studies: History, Definitions and Evaluation



Christof Schöch
With Keli Du, Julia Dudar, Julia Röttgermann, Julian Schröter
(Trier University, Germany)



Untangling Associations: Advances in keyword and collocation analysis
Université Paul Valéry, Montpellier, 22 Sept. 2023.

Overview

  • # 1 – History: Keyness in CLS
  • # 2 – Definitions: Keyness / Distinctiveness
  • # 3 – Evaluation: Classification Task
  • # 4 – Conclusion: Findings and Outlook

History: Keyness in CLS

Burrows’ Zeta in Authorship Attribution

  • The key publication is: John Burrows, “All the way through” (Burrows 2007)
  • He proposed to use Zeta in the context of authorship attribution
  • Zeta is calculated as the difference of the document frequencies of a feature in two contrasting sets of documents, where the documents are segments of full texts.1
  • Zeta: “A simple measure of [an author’s] consistency in the use of each word-type.” (=> dispersion)
  • Focuses “on a single author and seek[s] to identify which of many texts are most likely to be his or hers.” (=> authorship attribution)

Zeta for Authorship Attribution: Shakespeare (Craig and Kinney 2009)

Further uses and discussion of Keyness

  • Uses and discussion of Zeta:
  • Uses of other keyness measures:
    • using Antconc (log-likelihood)
    • or TXM (spécificité)

Keyness for gender (Weidman and O’Sullivan 2018)

Scatterplot of segments by male and female authors, by percentage of markers and anti-markers, for three literary periods.

Keyness for Genre (Schöch 2018)

PCA plot using 50 Zeta-based keywords. Comedies in red, tragedies in blue, tragi-comedies in green.

Zeta and Company (Schöch et al. 2018)

  • Project funded by the German Research Foundation (DFG, 2020-2023)
  • Domain of application: popular subgenres of the 20th-century French novel
  • Inspirations: John Burrows (Burrows 2007), Jeffrey Lijffijt (Lijffijt et al. 2014), MOTIFS (e.g. Kraif and Tutin 2017), Phraséorom (e.g. Diwersy et al. 2021), dispersion (Gries 2008)
  • Fundamental aim: Enable scholars in CLS to make educated choices about what keyness measure to use
  • Also: Bridge the gap between CCL and CLS
  • Activities: modeling, implementing, evaluating and using statistical measures of comparison of two groups of texts.

Definitions: Keyness / Distinctiveness

Traditional definition of keyness

  • Purely quantitative sense: A keyword is “a word which occurs with unusual frequency […] [in a document or corpus] by comparison with a reference corpus”. (Scott 1997)

What is Distinctiveness? (Schröter et al. 2021)

  • (A) Logical vs. statistical sense
    • Purely logical: A feature is distinctive of corpus A if its presence in a document D is a necessary and sufficient condition for D to belong to A and not to B.
    • Statistical: A feature is distinctive of corpus A if it is true that, the higher its keyness in document D, the higher the probability that D is an instance of A and not of B.
  • (B) Salient vs. agnostic
    • Salient: A feature is distinctive iff it is noticed by readers (for confirming or violating their expectations)
    • Agnostic: A feature can be distinctive without being salient in the above-mentioned sense.
  • (C) Qualitative vs. no qualitative content
    • Qualitative content: A feature is distinctive iff it expresses e.g. aboutness or stylistic character (=> interpretability)
    • No qualitative content: A feature can be key regardless of qualitative content (=> discriminatory power)

Measures in Zeta and Company (Du et al. 2022)

Evaluation: Classification Task

Evaluation Task: Genre Classification (Du, Dudar, and Schöch 2022)

  • Downstream classification task: “How reliably can a machine learning classifier, based on words identified using a given measure of distinctiveness, identify the subgenre of a novel when provided only with a short segment of that novel?”
  • Basic setup
    • 4 classifiers
    • Different numbers of keywords (N)
    • Textual units are 5000-word segments
    • 10-fold-cross validation (90/10 split of segments)
    • Baseline: random selection of N words

Results #1 (Du, Dudar, and Schöch 2022)

Classification performance on the French corpus (1980s) with four classifiers, depending on the measure of distinctiveness and the setting of 𝑁.

Results #2 (Du, Dudar, and Schöch 2022)

Distribution of classification performance on the 1980s French corpus with N = 10 using Multinomial Naive Bayes

Conclusion: Findings and Outloook

What have we found out so far?

  • Definition
    • Keyness or distinctiveness, as a concept, can be defined in different ways
    • A match between a certain understanding of distinctiveness and a specific statistical operationalization can be established using a suitable method of evaluation.
  • Evaluation
    • Dispersion-based keyness measures show best performance in a subgenre classification task, especially when the number of features is small
    • Such measures also tend to select medium-frequency words that are highly-interpretable (=> salient, qualitative)

What are the next steps?

  • Perform further experiments, using synthetic texts and test tokens with pre-determined frequency- and/or dispersion-based contrasts
  • Perform an application study that aims to match keywords to generic traits derived from research on popular subgenres (qualitative reference for qualitative understanding of distinctiveness)
  • Add measures: dispersion + LLR (Egbert and Biber 2019), measure based on DPnofreq (Gries 2021), LRC (Evert 2022)
  • Move on to more complex features: multi-word expressions and semantic features (next project phase, 2024-2026)
  • Find a strategy for how to handle a multi-dimensional approach to keyness (multiple meanings, multiple measures), e.g. along the lines proposed in (Gries 2019)

Thank you!

References

Barber, Ros. 2021. “Big Data or Not Enough? Zeta Test Reliability and the Attribution of Henry VI.” Digital Scholarship in the Humanities 36 (3): 542–64. https://doi.org/10.1093/llc/fqaa041.
Burrows, John. 2007. “All the Way Through: Testing for Authorship in Different Frequency Strata.” Literary and Linguistic Computing 22 (1): 27–47. https://doi.org/10.1093/llc/fqi067.
Craig, Hugh, and Arthur F. Kinney. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press.
Diwersy, Sascha, Laetitia Gonon, Vannina Goossens, Olivier Kraif, Iva Novakova, Julie Sorba, and Ilaria Vidotto. 2021. La phraséologie du roman contemporain dans les corpus et les applications de la PhraseoBase.” Corpus, no. 22. https://doi.org/10.4000/corpus.6101.
Du, Keli, Julia Dudar, Cora Rok, and Christof Schöch. 2021. “Zeta & Eta: An Exploration and Evaluation of Two Dispersion-based Measures of Distinctiveness.” In Proceedings of the Conference on Computational Humanities Research 2021, edited by Maud Ehrmann, Folgert Karsdorp, Melvin Wevers, Tara Lee Andrews, Manuel Burghardt, Mike Kestemont, Enrique Manjavacas, Michael Piotrowski, and Joris van Zundert, 2989:181–94. CEUR Workshop Proceedings. Amsterdam, the Netherlands: CEUR.
———. 2022. Kontrastive Textanalyse mit pydistinto - Ein Python-Paket zur Nutzung unterschiedlicher Distinktivitätsmaße.” Potsdam: Zenodo. https://doi.org/10.5281/zenodo.6327967.
Du, Keli, Julia Dudar, and Christof Schöch. 2022. “Evaluation of Measures of Distinctiveness. Classification of Literary Texts on the Basis of Distinctive Words.” Journal of Computational Literary Studies 1 (1). https://doi.org/10.48694/jcls.102.
Egbert, Jesse, and Doug Biber. 2019. “Incorporating Text Dispersion into Keyword Analyses.” Corpora 14 (1): 77–104. https://doi.org/10.3366/cor.2019.0162.
Evert, Stephanie. 2022. “Measuring Keyness.” In Book of Abstracts of the Digital Humanities 2022. Tokyo: ADHO. https://doi.org/10.17605/OSF.IO/CY6MW.
Gries, Stefan Th. 2008. “Dispersions and Adjusted Frequencies in Corpora.” International Journal of Corpus Linguistics 13 (4): 403–37. https://doi.org/10.1075/ijcl.13.4.02gri.
———. 2019. “15 Years of Collostructions: Some Long Overdue Additions/Corrections (to/of Actually All Sorts of Corpus-Linguistics Measures).” International Journal of Corpus Linguistics 24 (3): 385–412. https://doi.org/10.1075/ijcl.00011.gri.
———. 2021. “What Do (Most of) Our Dispersion Measures Measure (Most)? Dispersion?” November. https://doi.org/10.1075/jsls.21029.gri.
Hoover, David L. 2010. “Teasing Out Authorship and Style with t-Tests and Zeta.” In Digital Humanities Conference. London.
———. 2022. “Zeta Revisited.” Digital Scholarship in the Humanities 37 (4): 1002–21. https://doi.org/10.1093/llc/fqab095.
Kraif, Olivier, and Agnès Tutin. 2017. Des motifs séquentiels aux motifs hiérarchiques : l’apport des arbres lexico-syntaxiques récurrents pour le repérage des routines discursives.” Corpus, no. 17 (January). https://doi.org/10.4000/corpus.2889.
Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila. 2014. “Significance Testing of Word Frequencies in Corpora.” Digital Scholarship in the Humanities 31 (2): 374–97. https://doi.org/10.1093/llc/fqu064.
Rizvi, Pervez. 2019a. “An Improvement to Zeta.” Digital Scholarship in the Humanities 34 (2): 419–22. https://doi.org/10.1093/llc/fqy039.
———. 2019b. “The Interpretation of Zeta Test Results.” Digital Scholarship in the Humanities 34 (2): 401–18. https://doi.org/10.1093/llc/fqy038.
———. 2022. “The Interpretation of Zeta Test Results: A Supplement.” Digital Scholarship in the Humanities 37 (4): 1172–78. https://doi.org/10.1093/llc/fqac011.
Schöch, Christof. 2018. Zeta für die kontrastive Analyse literarischer Texte. Theorie, Implementierung, Fallstudie.” In Quantitative Ansätze in den Literatur- und Geisteswissenschaften. Systematische und historische Perspektiven, edited by Toni Bernhart, Sandra Richter, Marcus Lepper, Marcus Willand, and Andrea Albrecht, 77–94. Berlin: de Gruyter.
Schöch, Christof, Daniel Schlör, Albin Zehe, Henning Gebhard, Martin Becker, and Andreas Hotho. 2018. “Burrows : Exploring and Evaluating Variants and .” In Book of Abstracts of the Digital Humanities Conference. Mexico City: ADHO.
Schröter, Julian, Keli Du, Julia Dudar, Cora Rok, and Christof Schöch. 2021. “From Keyness to Distinctiveness Triangulation and Evaluation in Computational Literary Studies.” Journal of Literary Theory 15 (1-2): 81–108. https://doi.org/10.1515/jlt-2021-2011.
Scott, Mike. 1997. PC Analysis of Key Words And Key Key Words.” System 25 (2): 233–45. https://doi.org/10.1016/S0346-251X(97)00011-0.
Weidman, Sean G., and James O’Sullivan. 2018. “The Limits of Distinctive Words: Re-evaluating Literature’s Gender Marker Debate.” Digital Scholarship in the Humanities 33 (2): 374–90.

Bonus slides

Measures with references (Du, Dudar, and Schöch 2022)

All corpora (Du, Dudar, and Schöch 2022)

Correlation between measures (Du, Dudar, and Schöch 2022)

Keyness in stylo: genre (A.C. Doyle)