Legal aspects and scholarly requirements regarding TDM in the digital humanities: current developments in Germany

French-German Meeting
on Copyrighted Works in Digital Libraries

Christof Schöch

Trier Center for Digital Humanities
Trier University, Germany

2023-12-14

Introduction

Overview

  • Background
  • Derived Text Formats
    • Earlier proposals
    • Defining DTFs
    • Evaluating DTFs
  • Conclusion

Background

The TDM Exception: UrhG §60d

  • The key provision in German Urheberrecht is §60d
  • Derived from the ‘Directive on Copyright in the Digital Single Market’
  • Essentially, it allows using in-copyright works for TDM
    • in the context of non-commercial research
    • for research itself and for quality assurance
    • including cleaning, enriching, structuring the data
    • on the condition of lawful access to the materials
  • Crucially, it does not allow sharing data with researchers outside the project team.

URL: UrhG §60d

Details: Raue (2021).

The NFDI Consortium Text+

  • NFDI = National Research Data Infrastructure
  • Textplus is one of 30 NFDI consortia; focused on:
    • Scholarly Digital Editions
    • (Text and Speech) Collections
    • Lexicographical Resources
  • Within Collections, DNB, IDS and TCDH are working on TDM for in-copyright texts
    • Standardize “derived text formats”
    • Ascertain their legal status
    • Evaluate their usefulness for research

Precursors of DTFs

Early Proposal: “Corpus Masking”

  • Proposal from 2007
  • Intended for sharing TreeBanks
  • Basic idea:
    • mask some information (word forms)
    • retain other information (annotations)
    • enable sharing despite copyright or other legal restrictions


See: Rehm et al. (2007).

High-Profile Case: Google NGrams

See: Haber (2012).

HTRC Extracted Features Dataset

  • The dataset (v2.0) is quite large: v.2.0 has data on
    • 17.1 million volumes
    • 6 billion pages
    • 2.9 trillion tokens
  • Contains non-consumptive features
    • with per-volume metadata
    • on a per-page basis
    • number of lines (and empty lines)
    • part-of-speech tagged term token counts
    • in a pretty technical JSON format

See: Jett et al. (2020)

Defining DTFs

A working definition

The basic idea behind derived text formats is essentially the following: It is based on collections of copyright-protected full texts […] to which an institution has legal access. These text collections are transformed into so-called derived text formats through the application of processing routines, which essentially represent both targeted information enrichment (for example through linguistic annotation) and information reduction (for example through the deletion of word forms or the removal of sequence information).
The derived text formats are designed in such a way that the texts in the form then available no longer fall within the scope of copyright on the one hand, but on the other hand still allow the application of the most diverse quantitative analyses of the texts possible. […] Such datasets can be stored without restrictions, used in research, published and reused by third parties. In addition to reuse and publication, the creation of derived text formats is also possible without permission, provided that there is legal access to the original material.

Source: Schöch et al. (2020); see also: Grisse (2020), Jotzo (2020).

Aspects of text to be transformed

  • Information enrichment through annotation
    • Structural information: sentence, chapter, scene, act boundaries
    • Linguistic annotation: lemma, POS, named entities, etc.
    • Word embedding vectors; etc.
  • Information reduction
    • Masking / replacement of word forms
    • Randomization of word order
    • Summarization: frequency / dispersion information
  • Parameters (examples)
    • Segment size for randomization (10, 100, 1000?)
    • Proportion of masked words (10%, 40%, 80%?)

Example: The Term-Document-Matrix