Modeling biographical trajectories
of overseas-educated Chinese as Linked Open Data
利用關聯開放數據建模受過海外教育的華人傳記




Christof Schöch / Queenie K.H. Lam (林韵香)
with contributions by Damir Padieu
Trier Center for Digital Humanities, Trier University, Germany

The Fifth NCCU – UTR Joint Conference
National Chengchi University, Taipei, Taiwan

29 Mar 2025

Introduction / Overview

Overview

  1. Overseas-Educated Chinese
  2. Related Work
  3. Prior work by our group
  4. Linked Open Data
  5. Our Data Model
  6. Conclusion

Preamble / disclaimer / acknowledgements

Overseas-educated Chinese

Overseas-educated Chinese: Definition(s)

  • Narrow definition
    • People born in the territory of today’s PRC who traveled outside China for an extended period of time to enrol in one or several Higher Education Institutions, and permanently returned to China afterwards.
  • Broad definition (our primary source)
    • People who are ethnically Chinese who received their education outside China, and later made a significant contribution to culture, education or politics in China.

Objectives and Research Questions

  • Objectives
    • Longitudinal study: impact of student mobility on China 1847–1993
    • Also less prominent people: focus on networks, not individuals
    • Use appropriate digital methods to enhance our understanding of the data
    • Link existing datasets to German information sources and contexts
  • Research Questions
    • Where in China did Chinese studying overseas come from?
    • Where did they receive education?
    • What was their motivation to study overseas?
    • What subjects did they study, what degrees did they obtain?
    • What was their impact or contribution after returning to China?

Main sources

  • 新中国留学归国学人大辞典 / New China’s Overseas-Educated Returnee Scholars
    • Hubei Education Press, 1993
    • approx. 7,000 entries
  • 中国留学生大辞典​​​​​​ / Overseas-Educated Chinese Biographical Dictionary​​​​​ (OECBD)
    • Coverage: 1847–1978
    • Nanjing University Press, 1999
    • approx. 4,000 entries
  • Further sources
    • 民国人物大辞典 / Republic of China Biographical Dictionary (1991)
    • Biographical Dictionary of Chinese Women (Routledge, 2003)

Sources of authority data (selection!)

  • Wikidata: A hub for basic data and identifiers
  • ETER Project: Data on Higher Education Institutions in Europe
  • ISCED: International Standard Classification of Education (for fields of study, levels of study)
  • EUROSTAT: NUTS (cities and regions) within Europe (for historical data of countries and territories); types of mobility;

MHDB / IISMCC at Academia Sinica

  • IISMCC: Integrated Information System on Modern and Contemporary Characters
  • MHDB: Modern History Databases
  • Developed by the Institute of Modern History, Academia Sinica

Elites, Networks and Power in modern China

  • funded as ERC Advanced Grant (2022–2027)
  • lead by Christian Henriot (historian, Aix-Marseille University)
  • Time period: 1830–1949
  • Geographical focus: Beijing/Tianjin, Guangzhou/Hong Kong, greater Shanghai
  • Technology: various, including HEURIST database
  • But: no authority data, no API access

Prior work by our group

From image to text: scanning and OCR

  • OCR Tool: Tesseract (open source, CLI, Chinese model)
  • Some post-correction required: punctuation, mismatched characters, etc.
  • Also: split text into individual entries

From text to data: OpenRefine

  • Named entities: Extract name(s), destination country, etc.
  • Data types: Recognize numbers as years/dates, etc.
  • Reconcile: match with Wikidata IDs
  • Geocode: get geo-coordinates for place names

Families and provinces: Network Analysis

  • 梁 (Liang family):
    blue (= Jiangsu province)
  • 顾 (Gu family):
    pink (= Guangdong province)

Multi-country mobility pilot study

  • Focus on individuals with multi-country mobility (1847-1978, n=381)
  • Key findings:
    • Great diversity in fields of study among Chinese students
    • Japan as a springboard for mobility to the West (particularly to Germany)
    • Pattern: first France, then Russia (from inner China; Work-Study Movement 1912-1927)

Next step: Linked Open Data

Principles of LOD

  • Information as “triples”: subject, predicate, object
  • Semantic modeling: entities and properties
  • Use of authority data and ontologies
  • Result: Knowledge network

Benefits of LOD

  • Flexibility (cf. relational databases)
  • Multilingualism (via identifier + labels)
  • Federation (query multiple databases at once)

Our Data Model

Aims of the data model

  1. Semantic modeling: explicit ontology of meaningful classes (entities) and relationships (properties).
  2. Linked data: richly interlinked, both within the dataset and with relevant external resources or authority files.
  3. Knowledge provenance: each piece of information has a source with its perspective
  4. Fuzzy modeling: information is often incomplete, uncertain, and fuzzy to varying degrees

Entities (classes)

  • person (e.g. 刀培然)
  • country (e.g. France)
  • city (e.g. Lyon)
  • HEI (e.g. University of Leipzig)
  • field (e.g. Psychology)

Properties

  • Identifiers (authority data)
    • OECBD ID (identifier in the OECBD)
    • Wikidata ID (identifier in Wikidata)
    • ISCEDS ID (identifier in ISCEDS)
  • Names
    • legal name(s)
    • translated / transliterated names
    • alternative names (e.g. pen name)
  • Others
    • family relationships (parents, siblings, etc.)
    • event (e.g.: person, event, birth)
    • residence (e.g.: person, residence, France)
    • education (e.g.: person, education, Sorbonne)
    • contributions (e.g. leadership of institution)

Qualifiers

  • date (e.g. for birth, marriage, death – EDTF, fuzzy)
  • duration (from EDTF to EDTF – fuzzy)
  • location (province or city within country, depending on granularity of source)
  • coordinate location (e.g. cities: lat, long)
  • field (within education at HEI – string and ISCEDS)
  • level of study (within education at HEI)
  • reference (source of information)

Example: entry for 蔡元培 / Cai Yuanpei

  • basic information (name, IDs, events)
  • residence information (historical entities)
  • education information (HEI, location, field)
  • contribution information (leadership roles)
  • bilingual database

Conclusion

Benefits and challenges of the LOD approach

  • Already apparent benefits
    • Easy to provide multilingual data and interface
    • All information has external identifiers and definitions
    • Federation: lots of information is maintained elsewhere (e.g. Wikidata)
  • Already obvious challenges
    • Not easy to identify and formalize the contributions of people
    • Tricky to locate events and places within shifting historical contexts
    • Data from relevant online sources not always accessible and reusable




Thank you! 谢谢 !

References

Hinzmann, Maria, Matthias Bremm, Tinghui Duan, Julia Konstanciak, Julia Röttgermann, Moritz Steffes, Christof Schöch, and Joëlle Weis. 2024. “Patterns in modeling and querying a knowledge graph for literary history [preprint].” In Pattern Theory in Language and Communication, edited by Sabine Arndt-Lappe, Milena Belosevic, Peter Maurer, Claudine Moulin, Achim Rettinger, and Sören Stumpf. Trier: TCLC. https://doi.org/10.5281/zenodo.12080340.
Lam, Queenie K. H., and Damir Padieu. 2023. “Tracing Life Trajectories of Overseas Educated Chinese Students. A Network Analysis of Biographical Data 1847-1991.” In ENIS Conference. Tbilisi, Georgia: European Network on International Student Mobility. https://www.enisnetwork.com/event-details/2024-conference-of-cost-action-ca20115-enis.
Schöch, Christof, Maria Hinzmann, Katharina Dietz, Julia Röttgermann, and Anne Klee. 2020. “Smart Modeling for Digital Literary History.” In Smart Data ╳ Digital Humanities: 11th International Conference of Digital Archives and Digital Humanities (DADH2020). Taipei, Taiwan: Academia Sinica.
Schöch, Christof, Maria Hinzmann, Julia Röttgermann, Katharina Dietz, and Anne Klee. 2022. “Smart Modelling for Literary History.” International Journal of Humanities and Arts Computing 16 (1): 78–93. https://doi.org/10.3366/ijhac.2022.0278.
Zhao, Fudie. 2023. “A Systematic Review of Wikidata in Digital Humanities Projects.” Digital Scholarship in the Humanities 38 (2): 852–74. https://doi.org/10.1093/llc/fqac083.