Quantitative Semantics
Topic Modeling of English and Georgian Texts




Prof. Dr. Christof Schöch
Trier University, Germany

Institute for Georgian Literature – Tbilisi State University – Georgia

13 Mar 2025

Overview

  1. Introduction
  2. Distributional Semantics – Principles and Methods
  3. What are Word Embeddings?
  4. What is Topic Modeling? Examples
  5. Topic Models – the Theory
  6. A Topic Modeling pipeline
  7. First steps doing Topic Modeling
  8. Advanced issues in Topic Modeling
  9. Wrapping up

Introduction

About this workshop

  • Context, examples, theory, demo, hands-on for Topic Modeling
  • Python-based, but not a Python workshop
    (“read and run” code, rather than write code)
  • Learning goal: you understand how a Topic Model is created and can run your own Topic Modeling Pipeline
  • Download code and sample datasets:
    https://github.com/dh-trier/topicmodeling

A Topic Modeling pipeline

Introductions to Python

  • A pretty motivating and extensive video tutorial: Mosh, Python Tutorial for Beginners: https://www.youtube.com/watch?v=_uQrJ0TkZlc
  • hands-on, interactive tutorial: Folgert Karstorp, Python Programming for the Humanities, https://www.karsdorp.io/python-course/
  • A useful introductory book: Robert Downey, Think Python, 2nd edition: https://greenteapress.com/wp/think-python-2e/

About myself

  • Professor of Digital Humanities
  • Not a computer scientist, not a statistician
  • French literary scholar by training
  • Interests in corpus building and quantitative text analysis
  • see: https://christof-schoech.de/en

About you: raise your hand if…

  • … you are a literary scholar
  • … you are a historian
  • … you are a sociologist
  • … you are a (computational / corpus) linguist
  • … you are a computer scientist
  • … you are a digital humanist
  • … you are a librarian
  • … you consider yourself to be a local

Distributional Semantics: Principles and Methods

Basic intuition about distributional semantics

  • “Her friend’s …… was located on the second floor of the house.”
    • “apartment” !
    • “room” !
    • “balcony” ?
    • “cat” ??
    • “shark” ???

What does this example tell us?

  • We are able to rank the likelihood of these words in the given context
  • We use world knowledge, but also linguistic competency, for this
  • Computers can learn this too, based on cooccurrence patterns
  • That’s how distributional semantics works!

Basic idea

  • The meaning of words depends on their context
    “You shall know a word by the company it keeps” (Firth, 1957)
  • Words frequently appearing in similar contexts have similar meanings
  • Words that can appear in very similar, specific contexts have similar grammatical functions

Two applications of this idea

  • Topic Modeling
  • Word Embeddings

What are Word Embeddings?

Information Retrieval: Vector Space Model

  • Each document has a certain place in a vector space
  • That place is determined by the keywords that appear in the document
  • Each word is a dimension in the vector space
  • Documents with shared vocabulary end up in the same area of the vector space

Information Retrieval: Vector Space Model

Words in vector space

Example: French Wikipedia Model

  • 1.8 million articles, 750 million words
  • transform term-document-matrix into dense matrix
  • “low-dimensional”, dense representation
  • skip-gram model, 300 dimensions
  • vector semantics: geometric relations = semantic relations

Similar Words Query

Query:   ['poésie_nom', 10]
Result:  poétique_adj     0.841
         poème_nom        0.790
         prose_nom        0.733
         littérature_nom  0.715
         poète_nom        0.704
         poétique_nom     0.701
         poésie_nam       0.700
         anthologie_nom   0.695
         littéraire_adj   0.655
         sonnet_nom       0.651

(authentic data, Wikipedia model)

Similarity Query

Query: ['prose_nom', 'littérature_nom']
Result: 0.511518681366

Query: ['poésie_nom', 'littérature_nom']
Result: 0.714615326722

(authentic data, Wikipedia model)

Evaluation

  • Method: Using a “find-the-wrong word”-task
  • Lists of similar words:
    • vert, bleu, jaune, rouge, orange
    • billet, monnaie, portemonnaie, payement
  • Generate lists with an error
    • vert, bleu, monnaie, jaune, rouge
  • Wikipedia model: 90% accuracy in finding the error

Axes of meaning

Axes of meaning: concrete vs. abstract

Axis query

Axis: [["bonheur", "joie"],          # positive
       ["malheur", "tristesse"]]     # negative

Query:   ange
Result:  0.0875

Query:   monstre
Result   -0.1407

(authentic data)

Time for questions!

References

  • Goldberg, Yoav, und Omer Levy. „word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method“. arXiv.org, 2014. http://arxiv.org/abs/1402.3722.
  • Heuser, Ryan. „Word Vectors in the Eighteenth Century“. In Digital Humanities 2017: Conference Abstracts, 256–60. Montréal: McGill University & Université de Montréal, 2017.
  • Mikolov, Tomas, Kai Chen, Greg Corrado, und Jeffrey Dean. „Efficient Estimation of Word Representations in Vector Space“. arXiv.org, 2013. http://arxiv.org/abs/1301.3781.
  • Pennington, Jeffrey, Richard Socher, und Christopher D. Manning. „Glove: Global vectors for word representation“, 2014. doi:10.1.1.671.1743.
  • Turney, Peter T., und Patrick Pantel. „From Frequency to Meaning: Vector Space Models of Semantics“. Journal of Artificial Intelligence Research 37 (2010): 141–88. https://arxiv.org/abs/1003.1141.
  • Widdows, Dominic. Geometry and meaning. CSLI lecture notes, no. 172. Stanford CA: CSLI Publications, 2004.

Bonus slides

CBOW Model

Projection

Relation of words in semantic dimensions

Comparing models (novels vs. Wikipedia)

What is Topic Modeling?

(a) Some fundamentals

Topic Modeling: basic idea

  • Works on the basis of (large) collections of documents
  • Each document is understood as a mixture of topics
  • The purpose is to discover thematic trends and patterns
  • Discovered through generative probabilistic modeling

Usage scenarios

  • Information Retrieval: Search not for individual terms, but themes / semantic fields
  • Recommender Systems: Recommend similar journal articles etc. to users
  • Exploration of text collections: what is an email or newspaper corpus about?
  • Research questions from literary studies, cultural studies, history of ideas: topics across authors, genres, time periods

Exploratory Visualization

Existing Studies

  • Cameron Blevins: “Topic Modeling Martha Ballard’s Diary” (2010): diary
  • Ted Underwood und Andrew Goldstone (2012): “What can topic models of PMLA teach us…”: history of a discipline
  • Lisa Rhody, “Topic Modeling and Figurative Language” (2012): ekphrasis in poetry
  • Matthew Jockers, Macroanalysis (2013): novel, nationality, gender
  • Ben Schmidt: “Typical TV episodes” (2014): TV shows; temporal development
  • Christof Schöch, “Topic Modeling Genre” (2017): drama, subgenres

(b) A topic model for French crime fiction

Text collection: 840 French Novels

Crime fiction (prototypical)

  • Long, narrative, fictional prose (=novel)
  • Character inventory: investigators, criminals, suspects, witnesses, victims
  • Plot: violent crime, rational elucidation
  • Setting: urban space
  • => Hypotheses regarding possible topics

Topic and subgenre

Topic 10: detective, inspector, police Distinctive of crime fiction (content & statistics) (p < α=0.01)

##Topic and subgenre Topic 49: death, crime, to kill Distinctive of crime fiction (content & statistics) (p < α=0.01)

##Topic and subgenre Topic 47: door, room, to open

##Topic and subgenre Topic 26: beach, sand, sun Distinctive of non-crime fiction (p < α=0.001)

Topics over text segments

Topic 2: judge, prison, lawyer/attorney Statistically significant (crime fiction): (1,4), (4,5) etc.

Topics over text segments

Topic 33: black, hair, eyes, wear, eye, face Statistically significant: crime fiction all but (2,3); non-crime fiction (1,3), (2,5)

Overall results

  • A large part of the topics is statistically distinctive: crime fiction (31/80) non-crime fiction (21/80)
  • Topics are not just themes, but also narrative motives, descriptive elements, character sets
  • Textual progression: only a few topics have significant trends
  • Overall: we can detect thematic trends in 840 novels without reading (all of) them!

Time for questions

Bonus slides: visualizations

Topics and subgenres: topic 3

Topics and authors: topic 3

Topics / subgenres heatmap

topic clustering

(top 50 topics, cosine/weighted)

topic clustering (detail)

(top 50 topics, cosine/weighted)

Topics over decades

Topics and authors: clustering

topic-work bimodal network

topic-work bimodal network (detail)

topics over text progression

topics over text progression

topics by genre and text progression

PCA based on topic scores (subgenres)

Topic Modeling: Theory

(a) How does a topic model look like?

On a practical level

  • A topic is a group of words with some (semantic) relation (e.g., common theme, motive, etc.)
  • Each topic is made up of words of varying importance and relevance to the topic
  • Each document is made up of several topics in various proportions

On a technical level

  • A topic model is an abstract representation of all topics and documents in a collection
  • A topic is a probability distribution over words
  • A document is a probability distribution over topics
  • The Dirichlet distribution (in LDA) describes the topic mixture distribution of the model

Words in topic distribution

(Each word has a score in each topic; here ordered by topic/rank)

Topics in document distribution

(Each topic has a score in each document; ordered by document)

(b) How is a Topic Model created?

Some relevant ideas

  • The most widespread implementation uses ‘Latent Dirichlet Allocation’
  • Follows the “bag-of-words”-model: word order is irrelevant
  • No semantic knowledge / dictionary / WordNet etc. is used; language-independent
  • Based on distributional semantics: “a word is characterized by the company it keeps” (John Firth 1957)
  • Discovers words which frequently occur together or in similar contexts (=topics)
  • Infers how important each word is in each topic
  • Infers how important each topic is in each document

Generative, inverted, iterative

“A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic. Standard statistical techniques can be used to invert this process, inferring the set of topics that were responsible for generating a collection of documents.” (Steyvers and Griffiths 2006)

Inference problem: observed data

Inferred, latent model

Bayesian Statistics

“The computational problem of inferring the hidden topic structure from the documents is the problem of computing the posterior distribution, the conditional distribution of the hidden variables, given the documents.”

(David Blei, “Probabilistic Topic Models”, 2012)

Inference task

p(Z, φ, θ | w, α, β)

  • Compute the probability p of the latent variables…
    • Z = assignments of each word in each document to a topic
    • φ (phi) = distribution over words (for each topic)
    • θ (theta) = distribution over topics (for each document)
  • …given our observed variables (input data) and parameters
    • w = the data, i.e. the words in each document
    • α = parameter of the Dirichlet prior for topics per document
    • β = parameter of the Dirichlet prior for words per topic

Dirichlet distributions

Describe the topic mixture distributions of the model. Here several possible distributions with three topics.

The starting point of LDA

  • We have the documents with their words (e.g. as a word/document frequency matrix)
  • We are looking for the word distributions per topic, the topic distributions per document, and the topic assignment of each word
  • Both distributions are dependent on each other (if a topic changes, the topic distributions change)
  • And both distributions need to fit with the original documents

The generative model behind LDA

  • For each topic, there is a distribution over words
  • For each document, there is a distribution over topics
  • For each word in each document:
    • We sample a topic from the topic distribution of that document
    • We sample a word from the word distribution of that topic
  • This can only work if we have the distributions; which we don’t

Random initialization

  • For each document, we generate a random distribution over topics
  • For each topic, we generate a random distribution over words
  • For each word in each document:
    • Sample a topic from the topic distribution
    • Sample a word from the word distribution of that topic
  • Now we have a model; but we know it’s most likely wrong (=low confidence)

Inference: iterative approximation

  • Using the observed data and our (random/erroneous) model, we can improve the model
  • One among several methods: Gibbs sampling
    • For one word in one document, remove the existing topic assignment
    • Based on the topic assignments of the word in the document, and its assignments to topics, assign a new topic to the word
    • Do this in such a way to optimize the model in line with the Dirichlet distribution (mixture of topics for documents, mixture of topics for words)
    • Update the overall model according to this assignment;
  • Repeat until your time runs out or your evaluation task says it’s ok to stop
  • See also: Luis Serrano, “Gibbs Sampling”, https://www.youtube.com/watch?v=BaM1uiCpj_E, 2020

Time for questions

Further Reading: Theory

Introductory articles * Blei, David M. (2012). “Probabilistic topic models”. In: Communications of the ACM, 55(4): 77–84. http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf * Steyvers, M. and Griffiths, T. (2006). “Probabilistic Topic Models”. In: Landauer, T. et al. (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum.

Video lectures * Jordan Boyd-Graber, “Topic Models”, YouTube.com, 2015. https://www.youtube.com/watch?v=yK7nN3FcgUs * David Blei, “Topic Models”, Videolectures.net, 2012, http://videolectures.net/mlss09uk_blei_tm/

Bonus slides

Latent Dirichlet Allocation: plate notation

Latent Dirichlet Allocation: plate notation

  • N = number of words in document d
  • M = number of documents
  • α (alpha): Dirichlet prior (hyperparameter: sparse / smooth distribution of topics)
  • β (beta): Dirichlet prior (hyperparameter: sparse / smooth distribution of words)
  • θ (theta): distribution over topics (for each document; latent variable)
  • ϕ (phi): distribution over words (for each topic; latent variable)
  • z = assignments of words to topics (latent variable)
  • w = words in a document (observed variable)

A Topic Modeling pipeline

A Topic Modeling pipeline

Some parameters

  • Preprocessing: text segmentation, lemmatization, feature selection
  • Modeling: number of topics, number of iterations, etc.
  • Evaluation: model quality measure
  • Postprocessing: level of metadata / text linkage
  • Visualization: many options

The pipeline in Python

  • Preprocessing: NLTK / TextBlob
  • Corpus ingest, modeling, evaluation: Gensim
  • Postprocessing: pandas
  • Visualisation: pyLDAvis, seaborn, wordcloud, etc.

Implementations of Topic Modeling

  • MALLET: Java-based, LDA, command line, fast, very good results, no visuals
  • Gensim: Python-based, LDA, scripts, slower, less convincing results, nice visuals
  • BERTopic: Pyton-based, BERT + clustering, fast, good results, some visuals

Time for questions

Tutorials

First steps doing Topic Modeling

Some starting points

Getting ready

  • Launch the Python IDE (recommended: VS Codium; alternatively Geany, Spyder, PyCharm)
  • Please download or clone the “tm-simple” repository linked above
  • Let’s all run the test script again.
  • Has everyone got the “OK”s when running the test script?

The workshop data

  • datasets/
  • results/
  • scripts/

The script architecture

  • each step in the pipeline (input-output) is one module
  • each module consists of several functions
  • a “main” function coordinates these functions
  • the “run_pipeline.py” script coordinates the modules
  • NB: each module reads and writes data

A closer look at “run_pipeline1.py”

  • Imports
  • Files and Folders
  • Parameters
  • Functions
  • Coordinating function

Step by step: preprocessing

  • Open “preprocessing.py” with Codium
  • Note the parameters
  • Note the file structure
  • Note the flow of the data (input/output)
  • Run it from “run_pipeline1.py”

Running the pipeline one by one

  • preprocessing
  • build_corpus
  • modeling
  • postprocessing
  • make_overview

Practice

Exercise 1: run “run_pipeline1.py”

  • Use the small “hkpress-test” dataset
  • Decide on your own parameters
  • Run the entire pipeline (step by step or in one go)
  • What error messages do you get, if any?
  • What kind of results do you get?

Exercise 2: Adapt the commands

  • Continue using the “hkpress-test” corpus
  • Decide on a new “identifier” for your model
  • At your choice, do one of the following
    • Modify the stopword list (in: preprocessing.py)
    • Use a different number of topics (in: run_pipeline1.py)
  • Inspect the results and write down any changes you notice

More issues in Topic Modeling

Activity 1: More visualisations

  • Wordles! module “make_wordle” (in “run_pipeline2.py”)
  • Topic probability distribution heatmaps (“make_heatmaps”); depends on “metadata.csv”

Activity 2: Model evaluation

  • “evaluation.py” (in “run_pipeline2.py”)
    • overall model coherence (best: c_v)
    • individual topic coherence (c_v)
  • various measures of model quality
  • many types of evaluations (beyond the code here)

Topic Coherence Measures

Activity 3 / Discussion: bring your own corpus

  • If there is time left…
  • Does anyone have a collection of many short English-language texts?
  • How would you go about to run a model for it?
  • What if your text collection is neither English nor French?

Wrapping up

Summary of what we have covered

  • A bit of background on distributional semantics
  • An idea of what a topic model consists of
  • An intuition of how topic models are inferred
  • Some avenues for interpreting and visualizing topic models
  • The overall workflow required for topic modeling
  • How to use Python for topic modeling

A few things not covered here

  • Implementation details of Gibbs Sampling
  • Precursors of LDA: LSA, pLSA, NNMF, etc.
  • Variants of LDA: hierarchical, labeled, dynamic, etc.
  • Evaluation strategies: human evaluation, external, internal

Your questions and projects

  • What kind of projects / text collections do you have?
  • What kind of research questions do you have?
  • What do you think topic modeling could tell you?




Thank you! | დიდი მადლობა [didi madɫoba]