Legal aspects and scholarly requirements regarding TDM in the digital humanities: current developments in Germany

French-German Meeting
on Copyrighted Works in Digital Libraries

Christof Schöch

Trier Center for Digital Humanities
Trier University, Germany

2023-12-14

Introduction

Overview

Background
Derived Text Formats
- Earlier proposals
- Defining DTFs
- Evaluating DTFs
Conclusion

Background

The TDM Exception: UrhG §60d

The key provision in German Urheberrecht is §60d
Derived from the ‘Directive on Copyright in the Digital Single Market’
Essentially, it allows using in-copyright works for TDM
- in the context of non-commercial research
- for research itself and for quality assurance
- including cleaning, enriching, structuring the data
- on the condition of lawful access to the materials
Crucially, it does not allow sharing data with researchers outside the project team.

URL: UrhG §60d

Details: Raue (2021).

The NFDI Consortium Text+

NFDI = National Research Data Infrastructure
Textplus is one of 30 NFDI consortia; focused on:
- Scholarly Digital Editions
- (Text and Speech) Collections
- Lexicographical Resources
Within Collections, DNB, IDS and TCDH are working on TDM for in-copyright texts
- Standardize “derived text formats”
- Ascertain their legal status
- Evaluate their usefulness for research

URL: textplus.org

Precursors of DTFs

Early Proposal: “Corpus Masking”

Proposal from 2007
Intended for sharing TreeBanks
Basic idea:
- mask some information (word forms)
- retain other information (annotations)
- enable sharing despite copyright or other legal restrictions

See: Rehm et al. (2007).

High-Profile Case: Google NGrams

See: Haber (2012).

HTRC Extracted Features Dataset

The dataset (v2.0) is quite large: v.2.0 has data on
- 17.1 million volumes
- 6 billion pages
- 2.9 trillion tokens
Contains non-consumptive features
- with per-volume metadata
- on a per-page basis
- number of lines (and empty lines)
- part-of-speech tagged term token counts
- in a pretty technical JSON format

See: Jett et al. (2020)

Defining DTFs

A working definition

The basic idea behind derived text formats is essentially the following: It is based on collections of copyright-protected full texts […] to which an institution has legal access. These text collections are transformed into so-called derived text formats through the application of processing routines, which essentially represent both targeted information enrichment (for example through linguistic annotation) and information reduction (for example through the deletion of word forms or the removal of sequence information).
The derived text formats are designed in such a way that the texts in the form then available no longer fall within the scope of copyright on the one hand, but on the other hand still allow the application of the most diverse quantitative analyses of the texts possible. […] Such datasets can be stored without restrictions, used in research, published and reused by third parties. In addition to reuse and publication, the creation of derived text formats is also possible without permission, provided that there is legal access to the original material.

Source: Schöch et al. (2020); see also: Grisse (2020), Jotzo (2020).

Aspects of text to be transformed

Information enrichment through annotation
- Structural information: sentence, chapter, scene, act boundaries
- Linguistic annotation: lemma, POS, named entities, etc.
- Word embedding vectors; etc.
Information reduction
- Masking / replacement of word forms
- Randomization of word order
- Summarization: frequency / dispersion information
Parameters (examples)
- Segment size for randomization (10, 100, 1000?)
- Proportion of masked words (10%, 40%, 80%?)

Example: The Term-Document-Matrix

Example: Segment-Wise Randomization with Annotation

Example: Selective Replacement by POS

Evaluating DTFs

DTFs for Sentiment Analysis: Idea

Key idea: Sentiment Analysis using LLMs requires fine-tuning
Does fine-tuning work with DTFs?
Method:
- Use DistilBERT for Sentiment Analysis
- Fine-tune with an in-domain corpus in the original form and as DTF
- Compare the results
Result: Performence drops surprisingly late, e.g. when more than 50% of words forms are replace by POS (see graph below)
Conclusion: DTFs are useful for fine-tuning LLMs

DTFs for Sentiment Analysis: Results

Source: Du and Schöch (2023).

DTFs for Topic Modeling

Master thesis in DH (Martin Kocula)
Topic Modeling with English novels
Evaluation: topic coherence (Palmetto)
Original text compared to:
- Term-document-matrix
- Segment-wise randomization of word order
- Selective replacement of tokens
Results: distribution of topic coherence over 20 runs
Conclusion: Topic modeling works fine with randomization and replacement, but not with the term-document matrix.

]

DTFs for stylometric authorship attribution: Idea

Authorship attribution task on benchmark corpora
- French Novels (ELTeC-fra)
- German Drama (GerDraCor)
- English-language historical prose (Royal Society Corpus)
Derived Text Formats tested:
- Replacement / masking of word tokens
- 0%, 10%, …, 90%, 100%
Results: Performance drops moderately until 40%, then more drastically (see below)

DTFs for stylometric authorship attribution: Results

Conclusion

Key take-aways

Derived Text Formats are useful for TDM on in-copyright materials
There is a need for standardization, which requires:
- Systematic definition of formats
- Check for copyright-safety
- Evaluation of usefulness for research
Good progress on this in the NFDI consortium Textplus

References

Du, Keli, and Christof Schöch. 2023. “Understanding the Impact of Two Derived Text Formats on DistilBERT-based Binary Sentiment Classification.” In Computational Humanities Conference 2023. Paris.

Grisse, Karina. 2020. “Nutzbarmachung urheberrechtlich geschützter Textbestände für die Forschung durch Dritte Rechtliche Bedingungen und Möglichkeiten.” RuZ - Recht und Zugang 1 (2): 143–59. https://doi.org/10.5771/2699-1284-2020-2-143.

Haber, Peter. 2012. “Zeitgeschichte und Digital HumanitiesZeitgeschichte und Digital Humanities.” Docupedia-Zeitgeschichte. https://doi.org/10.14765/ZZF.DOK.2.269.V1.

Jett, Jacob, Boris Capitanu, Deren Kudeki, Timothy Cole, Yuerong Hu, Peter Organisciak, Ted Underwood, Eleanor Dickson Koehl, Ryan Dubnicek, and J. Stephen Downie. 2020. “The HathiTrust Research Center Extracted Features Dataset (2.0).” HathiTrust Research Center. https://doi.org/10.13012/R2TE-C227.

Jotzo, Florian. 2020. “Der Schutz großer Textbestände nach dem UrhG Die Nutzbarmachung fremder Textbestände für die Forschung.” RuZ - Recht und Zugang 1 (2): 128–42. https://doi.org/10.5771/2699-1284-2020-2-128.

Raue, Benjamin. 2021. “Die Freistellung von Datenanalysen durch die neuen Text und Data Mining-Schranken ( URHG 44b, URHG 60d UrhG).” ZUM, no. 10: 793–802. https://beck-online.beck.de/Bcid/Y-300-Z-ZUM-B-2021-S-793-N-1.

Rehm, Georg, Andreas Witt, Heike Zinsmeister, and Johannes Dellert. 2007. “Corpus Masking: Legally Bypassing Licensing Restrictions for the Free Distribution of Text Collections.” In Digital Humanities 2007: Conference Abstracts.

Schöch, Christof, Frédéric Döhl, Achim Rettinger, Evelyn Gius, Peer Trilcke, Peter Leinen, Fotis Jannidis, Maria Hinzmann, and Jörg Röpke. 2020. “Abgeleitete Textformate: Text und Data Mining mit urheberrechtlich geschützten Textbeständen.” Zeitschrift für digitale Geisteswissenschaften (ZfdG) 5. https://doi.org/10.17175/2020_006.

Danke / Thank you / Merci !

Contact: schoech@uni-trier.de
Social media: fedihum.org/@christof