Quantitative Semantics
Topic Modeling of English and Georgian Texts

Prof. Dr. Christof Schöch
Trier University, Germany

Institute for Georgian Literature – Tbilisi State University – Georgia

13 Mar 2025

Overview

Introduction
Distributional Semantics – Principles and Methods
What are Word Embeddings?
What is Topic Modeling? Examples
Topic Models – the Theory
A Topic Modeling pipeline
First steps doing Topic Modeling
Advanced issues in Topic Modeling
Wrapping up

Introduction

About this workshop

Context, examples, theory, demo, hands-on for Topic Modeling
Python-based, but not a Python workshop
(“read and run” code, rather than write code)
Learning goal: you understand how a Topic Model is created and can run your own Topic Modeling Pipeline
Download code and sample datasets:
https://github.com/dh-trier/topicmodeling

A Topic Modeling pipeline

Introductions to Python

A pretty motivating and extensive video tutorial: Mosh, Python Tutorial for Beginners: https://www.youtube.com/watch?v=_uQrJ0TkZlc
hands-on, interactive tutorial: Folgert Karstorp, Python Programming for the Humanities, https://www.karsdorp.io/python-course/
A useful introductory book: Robert Downey, Think Python, 2nd edition: https://greenteapress.com/wp/think-python-2e/

About myself

Professor of Digital Humanities
Not a computer scientist, not a statistician
French literary scholar by training
Interests in corpus building and quantitative text analysis
see: https://christof-schoech.de/en

About you: raise your hand if…

… you are a literary scholar
… you are a historian
… you are a sociologist
… you are a (computational / corpus) linguist
… you are a computer scientist
… you are a digital humanist
… you are a librarian
… you consider yourself to be a local

Distributional Semantics: Principles and Methods

Basic intuition about distributional semantics

“Her friend’s …… was located on the second floor of the house.”
- “apartment” !
- “room” !
- “balcony” ?
- “cat” ??
- “shark” ???

What does this example tell us?

We are able to rank the likelihood of these words in the given context
We use world knowledge, but also linguistic competency, for this
Computers can learn this too, based on cooccurrence patterns
That’s how distributional semantics works!

Basic idea

The meaning of words depends on their context
“You shall know a word by the company it keeps” (Firth, 1957)
Words frequently appearing in similar contexts have similar meanings
Words that can appear in very similar, specific contexts have similar grammatical functions

Two applications of this idea

Topic Modeling
Word Embeddings

What are Word Embeddings?

Information Retrieval: Vector Space Model

Each document has a certain place in a vector space
That place is determined by the keywords that appear in the document
Each word is a dimension in the vector space
Documents with shared vocabulary end up in the same area of the vector space

Information Retrieval: Vector Space Model

Words in vector space

Example: French Wikipedia Model

1.8 million articles, 750 million words
transform term-document-matrix into dense matrix
“low-dimensional”, dense representation
skip-gram model, 300 dimensions
vector semantics: geometric relations = semantic relations

Similar Words Query

Query:   ['poésie_nom', 10]
Result:  poétique_adj     0.841
         poème_nom        0.790
         prose_nom        0.733
         littérature_nom  0.715
         poète_nom        0.704
         poétique_nom     0.701
         poésie_nam       0.700
         anthologie_nom   0.695
         littéraire_adj   0.655
         sonnet_nom       0.651

(authentic data, Wikipedia model)

Similarity Query

Query: ['prose_nom', 'littérature_nom']
Result: 0.511518681366

Query: ['poésie_nom', 'littérature_nom']
Result: 0.714615326722

(authentic data, Wikipedia model)

Evaluation

Method: Using a “find-the-wrong word”-task
Lists of similar words:
- vert, bleu, jaune, rouge, orange
- billet, monnaie, portemonnaie, payement
Generate lists with an error
- vert, bleu, monnaie, jaune, rouge
Wikipedia model: 90% accuracy in finding the error

Axes of meaning

Axes of meaning: concrete vs. abstract

Axis query

Axis: [["bonheur", "joie"],          # positive
       ["malheur", "tristesse"]]     # negative

Query:   ange
Result:  0.0875

Query:   monstre
Result   -0.1407

(authentic data)

Time for questions!

References

Goldberg, Yoav, und Omer Levy. „word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method“. arXiv.org, 2014. http://arxiv.org/abs/1402.3722.
Heuser, Ryan. „Word Vectors in the Eighteenth Century“. In Digital Humanities 2017: Conference Abstracts, 256–60. Montréal: McGill University & Université de Montréal, 2017.
Mikolov, Tomas, Kai Chen, Greg Corrado, und Jeffrey Dean. „Efficient Estimation of Word Representations in Vector Space“. arXiv.org, 2013. http://arxiv.org/abs/1301.3781.
Pennington, Jeffrey, Richard Socher, und Christopher D. Manning. „Glove: Global vectors for word representation“, 2014. doi:10.1.1.671.1743.
Turney, Peter T., und Patrick Pantel. „From Frequency to Meaning: Vector Space Models of Semantics“. Journal of Artificial Intelligence Research 37 (2010): 141–88. https://arxiv.org/abs/1003.1141.
Widdows, Dominic. Geometry and meaning. CSLI lecture notes, no. 172. Stanford CA: CSLI Publications, 2004.

Bonus slides

CBOW Model

Projection

Relation of words in semantic dimensions

Comparing models (novels vs. Wikipedia)

What is Topic Modeling?

(a) Some fundamentals

Topic Modeling: basic idea

Works on the basis of (large) collections of documents
Each document is understood as a mixture of topics
The purpose is to discover thematic trends and patterns
Discovered through generative probabilistic modeling

Usage scenarios

Information Retrieval: Search not for individual terms, but themes / semantic fields
Recommender Systems: Recommend similar journal articles etc. to users
Exploration of text collections: what is an email or newspaper corpus about?
Research questions from literary studies, cultural studies, history of ideas: topics across authors, genres, time periods

Exploratory Visualization

http://signsat40.signsjournal.org/topic-model/#/model/grid

Existing Studies

Cameron Blevins: “Topic Modeling Martha Ballard’s Diary” (2010): diary
Ted Underwood und Andrew Goldstone (2012): “What can topic models of PMLA teach us…”: history of a discipline
Lisa Rhody, “Topic Modeling and Figurative Language” (2012): ekphrasis in poetry
Matthew Jockers, Macroanalysis (2013): novel, nationality, gender
Ben Schmidt: “Typical TV episodes” (2014): TV shows; temporal development
Christof Schöch, “Topic Modeling Genre” (2017): drama, subgenres

(b) A topic model for French crime fiction

Text collection: 840 French Novels

Crime fiction (prototypical)

Long, narrative, fictional prose (=novel)
Character inventory: investigators, criminals, suspects, witnesses, victims
Plot: violent crime, rational elucidation
Setting: urban space
=> Hypotheses regarding possible topics

Topic and subgenre

Topic 10: detective, inspector, police Distinctive of crime fiction (content & statistics) (p < α=0.01)

##Topic and subgenre Topic 49: death, crime, to kill Distinctive of crime fiction (content & statistics) (p < α=0.01)

##Topic and subgenre Topic 47: door, room, to open

##Topic and subgenre Topic 26: beach, sand, sun Distinctive of non-crime fiction (p < α=0.001)

Topics over text segments

Topic 2: judge, prison, lawyer/attorney Statistically significant (crime fiction): (1,4), (4,5) etc.

Topics over text segments

Topic 33: black, hair, eyes, wear, eye, face Statistically significant: crime fiction all but (2,3); non-crime fiction (1,3), (2,5)

Overall results

A large part of the topics is statistically distinctive: crime fiction (31/80) non-crime fiction (21/80)
Topics are not just themes, but also narrative motives, descriptive elements, character sets
Textual progression: only a few topics have significant trends
Overall: we can detect thematic trends in 840 novels without reading (all of) them!

Time for questions

Bonus slides: visualizations

Topics and subgenres: topic 3

Topics and authors: topic 3

Topics / subgenres heatmap

topic clustering

(top 50 topics, cosine/weighted)

topic clustering (detail)

(top 50 topics, cosine/weighted)

Topics over decades

Topics and authors: clustering

topic-work bimodal network

topic-work bimodal network (detail)

topics over text progression

topics by genre and text progression

PCA based on topic scores (subgenres)

Topic Modeling: Theory

(a) How does a topic model look like?

On a practical level

A topic is a group of words with some (semantic) relation (e.g., common theme, motive, etc.)
Each topic is made up of words of varying importance and relevance to the topic
Each document is made up of several topics in various proportions

On a technical level

A topic model is an abstract representation of all topics and documents in a collection
A topic is a probability distribution over words
A document is a probability distribution over topics
The Dirichlet distribution (in LDA) describes the topic mixture distribution of the model

Words in topic distribution

(Each word has a score in each topic; here ordered by topic/rank)

Topics in document distribution

(Each topic has a score in each document; ordered by document)

(b) How is a Topic Model created?

Some relevant ideas

The most widespread implementation uses ‘Latent Dirichlet Allocation’
Follows the “bag-of-words”-model: word order is irrelevant
No semantic knowledge / dictionary / WordNet etc. is used; language-independent
Based on distributional semantics: “a word is characterized by the company it keeps” (John Firth 1957)
Discovers words which frequently occur together or in similar contexts (=topics)
Infers how important each word is in each topic
Infers how important each topic is in each document

Generative, inverted, iterative

“A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic. Standard statistical techniques can be used to invert this process, inferring the set of topics that were responsible for generating a collection of documents.” (Steyvers and Griffiths 2006)

Inference problem: observed data

Inferred, latent model

Bayesian Statistics

“The computational problem of inferring the hidden topic structure from the documents is the problem of computing the posterior distribution, the conditional distribution of the hidden variables, given the documents.”

(David Blei, “Probabilistic Topic Models”, 2012)

Inference task

p(Z, φ, θ | w, α, β)

Compute the probability p of the latent variables…
- Z = assignments of each word in each document to a topic
- φ (phi) = distribution over words (for each topic)
- θ (theta) = distribution over topics (for each document)
…given our observed variables (input data) and parameters
- w = the data, i.e. the words in each document
- α = parameter of the Dirichlet prior for topics per document
- β = parameter of the Dirichlet prior for words per topic

Dirichlet distributions

Describe the topic mixture distributions of the model. Here several possible distributions with three topics.

The starting point of LDA

We have the documents with their words (e.g. as a word/document frequency matrix)
We are looking for the word distributions per topic, the topic distributions per document, and the topic assignment of each word
Both distributions are dependent on each other (if a topic changes, the topic distributions change)
And both distributions need to fit with the original documents

The generative model behind LDA

For each topic, there is a distribution over words
For each document, there is a distribution over topics
For each word in each document:
- We sample a topic from the topic distribution of that document
- We sample a word from the word distribution of that topic
This can only work if we have the distributions; which we don’t

Random initialization

For each document, we generate a random distribution over topics
For each topic, we generate a random distribution over words
For each word in each document:
- Sample a topic from the topic distribution
- Sample a word from the word distribution of that topic
Now we have a model; but we know it’s most likely wrong (=low confidence)

Inference: iterative approximation

Using the observed data and our (random/erroneous) model, we can improve the model
One among several methods: Gibbs sampling
- For one word in one document, remove the existing topic assignment
- Based on the topic assignments of the word in the document, and its assignments to topics, assign a new topic to the word
- Do this in such a way to optimize the model in line with the Dirichlet distribution (mixture of topics for documents, mixture of topics for words)
- Update the overall model according to this assignment;
Repeat until your time runs out or your evaluation task says it’s ok to stop
See also: Luis Serrano, “Gibbs Sampling”, https://www.youtube.com/watch?v=BaM1uiCpj_E, 2020

Time for questions

Bonus slides

Latent Dirichlet Allocation: plate notation

N = number of words in document d
M = number of documents
α (alpha): Dirichlet prior (hyperparameter: sparse / smooth distribution of topics)
β (beta): Dirichlet prior (hyperparameter: sparse / smooth distribution of words)
θ (theta): distribution over topics (for each document; latent variable)
ϕ (phi): distribution over words (for each topic; latent variable)
z = assignments of words to topics (latent variable)
w = words in a document (observed variable)

A Topic Modeling pipeline

Some parameters

Preprocessing: text segmentation, lemmatization, feature selection
Modeling: number of topics, number of iterations, etc.
Evaluation: model quality measure
Postprocessing: level of metadata / text linkage
Visualization: many options

The pipeline in Python

Preprocessing: NLTK / TextBlob
Corpus ingest, modeling, evaluation: Gensim
Postprocessing: pandas
Visualisation: pyLDAvis, seaborn, wordcloud, etc.

Implementations of Topic Modeling

MALLET: Java-based, LDA, command line, fast, very good results, no visuals
Gensim: Python-based, LDA, scripts, slower, less convincing results, nice visuals
BERTopic: Pyton-based, BERT + clustering, fast, good results, some visuals

Time for questions

Tutorials

Weingart, Scott (2012). “Topic Modeling for Humanists: A Guided Tour”. In: The Scottbot Irregular. http://www.scottbot.net/HIAL/?p=19113
Graham, Shawn, Scott Weingart and Ian Milligan (2012). “Getting Started with Topic Modeling and MALLET”. The Programming Historian. http://programminghistorian.org/lessons/topic-modeling-and-mallet
Riddell, Allen. (2014). “TAToM: Text Analysis with Topic Modeling for Humanities Scholars”. In: DARIAH-DE Schulungsmaterialien. https://de.dariah.eu/tatom/

First steps doing Topic Modeling

Some starting points

This section of the slides
Dataset and scripts package: https://codeberg.org/dhtrier-research/tm-simple

Getting ready

Launch the Python IDE (recommended: VS Codium; alternatively Geany, Spyder, PyCharm)
Please download or clone the “tm-simple” repository linked above
Let’s all run the test script again.
Has everyone got the “OK”s when running the test script?

The workshop data

datasets/
results/
scripts/

The script architecture

each step in the pipeline (input-output) is one module
each module consists of several functions
a “main” function coordinates these functions
the “run_pipeline.py” script coordinates the modules
NB: each module reads and writes data

A closer look at “run_pipeline1.py”

Imports
Files and Folders
Parameters
Functions
Coordinating function

Step by step: preprocessing

Open “preprocessing.py” with Codium
Note the parameters
Note the file structure
Note the flow of the data (input/output)
Run it from “run_pipeline1.py”

Running the pipeline one by one

preprocessing
build_corpus
modeling
postprocessing
make_overview

Practice

Exercise 1: run “run_pipeline1.py”

Use the small “hkpress-test” dataset
Decide on your own parameters
Run the entire pipeline (step by step or in one go)
What error messages do you get, if any?
What kind of results do you get?

Exercise 2: Adapt the commands

Continue using the “hkpress-test” corpus
Decide on a new “identifier” for your model
At your choice, do one of the following
- Modify the stopword list (in: preprocessing.py)
- Use a different number of topics (in: run_pipeline1.py)
Inspect the results and write down any changes you notice

Wrapping up

Summary of what we have covered

A bit of background on distributional semantics
An idea of what a topic model consists of
An intuition of how topic models are inferred
Some avenues for interpreting and visualizing topic models
The overall workflow required for topic modeling
How to use Python for topic modeling

Quantitative SemanticsTopic Modeling of English and Georgian Texts

Overview

Introduction

About this workshop

A Topic Modeling pipeline

Introductions to Python

About myself

About you: raise your hand if…

Distributional Semantics: Principles and Methods

Basic intuition about distributional semantics

What does this example tell us?

Basic idea

Two applications of this idea

What are Word Embeddings?

Information Retrieval: Vector Space Model

Information Retrieval: Vector Space Model

Words in vector space

Example: French Wikipedia Model

Similar Words Query

Similarity Query

Evaluation

Axes of meaning

Axes of meaning: concrete vs. abstract

Axis query

Time for questions!

References

Bonus slides

CBOW Model

Projection

Relation of words in semantic dimensions

Comparing models (novels vs. Wikipedia)

What is Topic Modeling?

(a) Some fundamentals

Topic Modeling: basic idea

Usage scenarios

Exploratory Visualization

Existing Studies

(b) A topic model for French crime fiction

Text collection: 840 French Novels

Crime fiction (prototypical)

Topic and subgenre

Topics over text segments

Topics over text segments

Overall results

Time for questions

Bonus slides: visualizations

Topics and subgenres: topic 3

Topics and authors: topic 3

Topics / subgenres heatmap

topic clustering

topic clustering (detail)

Topics over decades

Topics and authors: clustering

topic-work bimodal network

topic-work bimodal network (detail)

topics over text progression

topics over text progression

topics by genre and text progression

PCA based on topic scores (subgenres)

Topic Modeling: Theory

(a) How does a topic model look like?

On a practical level

On a technical level

Words in topic distribution

Topics in document distribution

(b) How is a Topic Model created?

Some relevant ideas

Generative, inverted, iterative

Inference problem: observed data

Inferred, latent model

Bayesian Statistics

Inference task

Dirichlet distributions

The starting point of LDA

The generative model behind LDA

Random initialization

Inference: iterative approximation

Time for questions

Further Reading: Theory

Bonus slides

Quantitative Semantics
Topic Modeling of English and Georgian Texts