Large Language Models and Digital Humanities (LALAMDAH)

Christof Schöch

2023-10-05

Session 1: Introduction to the topic

Thematic overview: key questions

How do LLMs work?
How have LLMs developed?
What can we do with LLMs in DH?
How can we use LLMs ourselves?

Very brief summary of Wolfram 2023

What are the essential building blocks of LLMs?
- Data: Large amounts of text (billions of words from the web and from books)
- Computing power: Sufficient amounts of computing power (GPU clusters + heaps of energy)
- Neural networks: principle of input, hidden, output layers of neurons with weights
- Word Embeddings: principle of learned, low-dimensional numerical representation of words
- Transformers / attention: for efficient training over longer dependencies

Questions on Wolfram 2023

ChatGPT’s characteristics, limitations, improvements (1)

Question: When ChatGPT does something like writing an essay and adds a word at a time, where does ChatGPT get those words from? From the web?
Answer
- The training corpus determines the vocabulary
- Words != tokens; tokens are some basic words + many subwords like prefixes and suffixes)
- Subtokens are why ChatGPT can create new words
- The training corpus is based on a lot of text from the web and from scanned books.
- The key process of ChatGPT is how to select the right word in each moment.
- It doesn’t just copy fitting passages from stored text
- Rather, it has a representation of the tokens that help assess what words fit best next.

ChatGPT’s characteristics, limitations, improvements (2)

Question: Is the web-crawled training data for ChatGPT specially selected, or are there certain criteria to be met in order to be used as training data?
Answer
- It is not publicly documented what exactly went into the training corpus
- Some of the sources are: websites (news sites, blogs, forums); books (fiction, non-fiction); Common Crawl (dataset of scraped text); Wikipedia, Twitter, Reddit.
- GPT-3 has been trained on 570 gigabytes of text; that’s about 300 billion words (equivalent to ~3 million novels)

ChatGPT’s characteristics, limitations, improvements (2)

Question: ChatGPT can do other things besides simulating human language. It can for example access databases and write code. How is this done compared to generating language.
Answer
- In its default version, it cannot access databases.
- It generates code just like it generates prose, by having learned what typical sequences of programming code look like.

ChatGPT’s characteristics, limitations, improvements (3)

Question: What are the primary limitations of ChatGPT?
Answer
- Limited to short-range dependencies between text.
- Lack of access to information available online.
- Lack of access to structured information, e.g. knowledge graphs, whether built-in or queryable online.
- Lack of capability to perform logical reasoning.
- Lack of capability to perform mathematical calculations.
- Training data ends in 2021.

ChatGPT’s characteristics, limitations, improvements (4)

Question: GPT seems to be able to produce human like text without having any difficulties although it can sometimes be quite innacurate about certain topics. Do you think in the near feature, a better version of GPT can pass the Turing Test with even more improved abilities?
Answer:
- Improvements are likely to be quite drastic in the next few years
- They could come from better access to unstructured information available for live lookup, on the internet
- Or they could come from integration with knowledge bases containing factual information, such as Wikidata

ChatGPT’s characteristics, limitations, improvements (5)

Question: ChatGPT fails at logic tasks (and mathematical operations). Would it be possible to combine Large Language Models with conventional computing methods to remedy this?
Answer
- I think the fundamental idea of Stephen Wolfram is to do exactly that: combine LLMs with a logical computing module such as Wolfram|Alpha
- There used to be a plugin for ChatGPT that appears to have done that; it is currently unavailable afaik
- Alternatively, ChatGPT could learn to turn unstructured prose into structured data (such as LOD) and then reason on this, and turn its response into prose for output.

Practical aspects (1)

Question: Are there currently any large language models, that can be run on Personal Computers? Can you recommend any?
Answer
- There are many! We will get to know some of them during the course.
- We will train models ourselves, starting with simple Word Embedding Models (using Gensim)
- To run LLMs locally, one option is the transformers library that uses models available on huggingface.com
- Another source of LLMs is LLAMA, a set of models that can be freely downloaded and used
- Using an API, it is possible to use models from OpenAI (such as ChatGPT!)

Practical aspects (2)

Question: How does “temperature work”? I.e., does a temperature 0.8 equal a chance of 80% that lower-ranked words are used? What determines which lower-ranked words are being used?
Answer
- AFAIK, temperature is a hyperparameter set between 0 and 1 (when using the API, or in the Chat version)
- A low value (0.1-0.3) means there is no deviation from the top most likely words: predictable patterns, literal meaning, consistent output
- A high value (0.6-0.9) means there is a lot of deviation from the top most likely words: creative combinations, metaphorical meaning, varying output
- You can try this yourself in ChatGPT; just instruct it to use a certain temperature setting in your prompt.
- For reliable, consisten, working code, a lower temperature is probably best
- For inventive prose or creative ideas, a higher temperature will probably be best

Neural networks / deep learning (1)

Question: How should we understand and interpret the (3D) images that depict how neural nets work?
Answer
- Let’s have a look at it together
- There are two dimensions to the plot: x and y (any point has a value for x and y)
- There are three outputs: -1, 0, +1 (one for each of three regions in the plot)
- We are looking for a function that gives the right output for each input

Word Embeddings (1)

Question: The article mentioned that embeddings are created in a way that words with similar meanings are close to each other in the vector space. Do embeddings take word ambiguity into consideration and if so, how is it done?
Answer
- Simple, static word embeddings do not do this; each type gets one vector, which is a mix of all of its uses.
- More complex, contextual word embeddings do this: each token gets its own vector, depending on its type and its context;
- In such a word embedding, there might be multiple clusters of tokens with similar meaning, each corresponding to one sense of the type.

Word Embeddings (2)

Question: What is a good number of dimensions for word embeddings?

Ethical and legal issues (1)

Question: GPT was trained on a huge dataset, including data from the entire internet, without the consent of users and authors. Doesn’t this sound unethical?
Answer
- Yes, this brings ethical and legal complexities to the issue
- Ethical: Even if it were legal, earning money and gaining influence on the backs of content creators without compensation is unfair (compare to copying!)
- Legal: There are multiple ongoing lawsuits over use of copyrighted materials in training LLMs
- Legal: LLMs cannot claim copyright for their productions, but what if they reproduced copyrighted materials?
- Ethical: Indiscriminate use of scraped materials also brings in many potential biases;
- Ethical: Fine-tuning of responses to avoid biases is now done using feedback from users (for free) or from low-paid laborers in economically-weak countries

Questions for next time

Neural network: technical implementation
- Activation functions of neurons: e.g. LeRU
- Backpropagation, loss function, stochastic gradient descent, local/global minima
- The “attention” mechanism in BERT or GPT
- etc.

Session 2 (Nov. 6, 2023):
Neural networks

Just as a teaser

Source: @macleod, 5 Nov. 2023

Overview of the session

Summarize and discuss Chollet, chapter 1
Research task: comparative terminology
Input and discussion on gradient descent
Try out sample code from Chollet, chapter 2

Your key insights

Development: designing rules and features > learning rules > also learning features
Difference between “shallow” neural networks and “deep” neural networks
Predecessors of DL, e.g. neural networks, decision trees, kernel methods (for me: use of gradient-based opimization in gradient boosting, like later in DL)
DL become popular because of increased data availability (WWW) and technological developments (GPUs), not (primarily) because of new ideas
Are articifial neural networks modeled on how the brain works or not? Wolfram says yes, Chollet says no.

Some recommended readings

Very readable general introduction to Machine Learning: Alpaydin, Introduction to Machine Learning, 2nd edition, 2010.
For the various types of classifiers: Han et al., “Chapter 8 Classification: Basic Concepts”, Data Mining: Concepts and Techniques, 3rd edition, 2012.
Very complete and readable book on Artificial Intelligence: Russel and Norvik, Artificial Intelligence: A Modern Approach, 4th edition, 2021.
More recent book on deep learning: Chollet, Deep Learning with Python, 2nd edition, 2021.

Chollet: AI > ML > DL

Alternative: CS > ML > DL > AI

Some definitions

The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. (John McCarthy, 1956)

AI can be described as the effort to automate intellectual tasks normally performed by humans. (Chollet)

The field of artificial intelligence, or AI, is concerned with not just understanding but also building intelligent entities—machines that can compute how to act effectively and safely in a wide variety of novel situations. (Russel and Norvig, 1995/2021)

The quest for “artificial flight” succeeded when engineers and inventors stopped imitating birds and started using wind tunnels and learning about aerodynamics. (Russel and Norvig, 1995/2021)

Chollet: Programming vs. Machine Learning

Feature engineering for traditional ML (example)

Basic idea: feature engineering in traditional ML <=> representation learning in DL
Example from DH:
- Topic: Direct speech recognition in novels
- Data: 40 French 19th-century novels
- Question: What is the relationship between direct speech and subgenre?
- Method: Traditional Machine Learning with feature engineering
- Separate presentation: 10.5281/zenodo.10072385
Same issue using DL / Transformer architecture: Byszuk et al., “Detecting Direct Speech in Multilingual Collection of 19th-century Novels”, 2020: aclanthology.org/2020.lt4hala-1.15/

Chollet: Various types of ML algorithms (classifiers)

Several frequently-used classifiers
- Naive Bayes
- Logistic Regression
- Decision Trees
- Support Vector Machines
- Ensemble methods, e.g.: Gradient Boosting
Key questions
- How to chose the right classifier for a given problem? Experience wrt to dataset, size, features, goal of the project.

Founded on Bayesian probability theory (prior, conditional, posterior probabilities)
Naive, because it assumes each feature is independent from the others (unlikely)
Training the classifier based on a dataset:
- First, calculates the base probabilities of the target classes (prior probabilities)
- Then, calculates the probabilities for each class given each feature (proportion or Gaussian distribution: mean, std)
Using the classifier:
- For any data item and its feature values, the class probabilities can now be calculated
- These and the base probability are combined for each class
- The class with the higher value is the prediction

Source: StatQuests on Youtube

Decision Trees

Basic idea
Illustration: Titanic dataset
Illustration: Learned DT

Particularly for classification problems
Splits the dataset into hierarchically-organized branches depending on feature values
Proceeds step by step:
- Find the feature with the largest influence on the result
- For this feature, find the best decision boundary (loss function!)
- Then, for each branch, find the next most important feature and its best decision boundary
Avoid risk of overfitting
- By limiting the depth of the tree
- Or by setting a minimum number of features per resulting leaf

SVM

Basic idea
Illustration
“Kernel trick”

The classifier finds a decision boundary between the classes
The decision boundary is a hyperplane separating the classes
The “support vectors” are the points that define the boundary
The aim is to maximize the margin (width of the decision boundary)
There are linear and non-linar versions of SVM
- Linear: simple, with interpretable feature weights
- Non-linear: using the “kernel trick”: more powerful, less interpretable

Source: Kaggle

Here, the additional dimension is calculated as x^2
Source: Kaggle

Ensemble method: Gradient Boosting

Basic idea
Illustration

A type of ensemble method (combine multiple classifiers)
For regression or classification
Chain of multiple “weak learners” that are strong when combined
Requires a loss function (differentiable => gradient)
The gradient tells us how the next function could be improved (boosted)
At each step, the next learner is trained on the mistakes of the previous learners
Each model is the previous model + an improvement reducing the loss
Very flexible: learners and loss functions can be selected as needed
Standard: gradient boosted decision trees (usually work well)

Source: https://neptune.ai/blog/xgboost-vs-lightgbm

Use case: classifier selection

Four classifiers
Results with Naive Bayes

Source: Du et al., “Evaluation of Measures of Distinctiveness”, JCLS, 2022. DOI: 10.48694/jcls.102.

Chollet: Representation learning by coordinate change

Before the transformation, slope and offset are needed to separate the data
After the transformation, a simple rule (value of x) can separate the data (y doesn’t matter)
The raw features are used but transformed automatically (no feature engineering)

Chollet: Neural Network architecture

Conceptual Break

Live research task: comparative definitions of two key terms
Summarize the key similarities, differences, and the relationship between two of the following terms (different participants should choose various combinations): machine learning, neural networks, artificial intelligence, representation learning, word embeddings, (large) language models, deep learning, knowledge graphs, representation learning.
We’ll discuss your findings.

Question: Do artificial NN work like the brain?

Summary
Some similarities
Many differences

aNN have been inspired by the brain
There are some similarities, but also many differences
Even when biological neurons are modeled, and the output matches the brain’s neurons, the mechanisms are usually quite different
Useful summary: Adan, “Do-neural-networks-really-work-like-neurons”, 2018 (see Readings)

Basic unit is the neuron
Neuron: inputs, computation, output
Many inputs: sensory input or from other neurons
Non-linear activation function
Many outputs: to many other neurons

More complex weighting mechanisms
Much more complex architecture
No backpropagation through gradient descent with differentials
Learns much fast from few examples (~ pre-training?)
Much more energy-efficient!

Practical Task

The basis for this task is Chollet, Chapter 2
Use the code you find there to build a simple classifier for the MNIST dataset of numbers
What is the performance on the training and the test set that you achieve?
For the curious:
- How would you reduce the amount of training data, and what happens if you do?
- What if you modify the size of the input layer, making it smaller (or larger) than 512?
- How would you go about to reduce the size of the images? What happens if you do that?
- Would you say the classifier is robust or fragile with respect to such interference?

Session 3 (13 Nov., 2023):
Gradient Descent

Overview of the session

Gradient descent: insights and questions
Implementing a simple differential
Overview of application studies collected so far

(1) Gradient Descent: insights and questions

Key insights from Chollet, chapter 2

Gradient descent is the optimization technique that powers modern neural networks.
I understood that the Gradient Descent is a technique to find the smallest possible values for the loss function step by step which helps to determine the right weights of a neural network.
We can think of gradient descent as a person who is at the top of a mountain. The person looks all around to see where should s/he take a step in order to get to the bottom in the fastest and easiest way. So s/he takes a huge or small step (here we refer to alpha rate) and move towards a direction (and here we refer to derivative of the slope). This process is repeated by the person again and again until s/he reaches the bottom (local minima) and just like this way, gradient descent finally optimizes the model.
Gradient Descent is one of the most common algorithms in machine learning field. It is used to find the best parameters of w and b of a model and this way it optimizes our model. In the formula of gradient descent partial derivatives and alpha rate is used. Alpha rate is used to determine the size of the model’s step to fit. The bigger the size is, the harder it will be to find the best parameters.
In neural networks, gradients are used to update the model’s weights to minimize losses. By calculating the gradient of the loss with respect to the weights, people can determine how to adjust the parameters to reduce the loss. When using gradient descent, the model parameters are updated in the opposite direction of the gradient at each iteration. The step size is an important factor that controls the magnitude of each update.
Gradient descent minimizes the loss function by calculating the derivative of the loss function with respect to the intercept.
Stochastic gradient descent computes the gradient by randomly sampling only one sample in each iteration. Multiple samples can be sampled randomly and uniformly in each iteration to form a mini batch, and then this mini batch can be used to compute the gradient. The learning rate of mini batch stochastic gradient descent can decay itself during the iteration.
The sections about backpropagation, including composition graphs and for- and backward passes, were quite interesting as I strangely have never heard of it in-depth before. What made it more interesting were the visualizations that came with the theory of backpropagation, which, in fact, made the computations “under the surface” clearer.

Question: Generality of Gradient Descent

Question: Assuming that gradient descent is one of the most common algorithm that optimizes a machine learning model, is it applicable in most of the machine learning models that we use or is there any other methods which are much better than gradient descent in certain situations?
Answer
- It is widely applicable, but used mostly in NN architectures
- There are alternative methods of reducing the loss of a model / classifier (e.g. XXXXX)
- There are ML algorithms that don’t need such a loss calculation (e.g. Naive Bayes)

Question: Learning Rate

Question: How can we decide the best learning rate at the beginning of training? Do we have to choose one of random learning rates, and then change it until we find the best if it doesn’t work? / How do you find the “ideal” learning rate, that doesnt get stuck in a local minimum or doesn’t diverge?
Answer
- Learning rate vs. step size
- Estimating a good learning rate
- Adjusting the learning rate

Performance of Gradient Descent

Question: What factors affect the performance of gradient descent? What challenges are faced during optimisation? / Are there cases where Gradient Descent might be suboptimal and another optimization algorithm might work better? What if there are multiple local minima […], will gradient descent find the global minimum or will it get stuck at a local minimum?
Answer
- One theoretical issue are indeed local minima. In practice, due to the many dimensions of typical datasets used in DL, this is not an issue.
- Another possible issue is the step size / learning rate: it should be neither too small nor too large.

Gradient Descent in Practice

Question: In the video and text it is shown how GD is calculated on a mathematical level. How is gradient descent used in practice?
Answer
- In practice, it is easy: It is applied automatically in the training process by the ML library of choice
- The user just sets the relevant parameters: what loss function to use, what optimizer to use, what learning rate to use, how to adjust the learning rate.

Optimizer: Stochastic Gradient Descent

Context
Basic steps
Illustration 1
Illustration 2

The starting points are random weights
The goal is to minimize the loss
This is done by adjusting all of the weights
The “optimizer” determines how the weights need to be updated
It works because all functions in NN are smooth and continuous (i.e., differentiable)
Multidimensional differentials are called gradients
The process is therefore called gradient descent
The whole process is backpropagation, mathematically, and learning, conceptually

Just one input dimension (x) and one output dimension (y)
Linear regression: w₀ (slope) and w₁ (intercept)
Start with random values for slope and intercept
Calculate the loss (e.g. sum of least squares)
Use the derivative to get the slope (or gradient) of the loss function
Parameter: ‘step size’ (fix or flexible, e.g.: step size x ‘learning rate’)
Change w₀ (slope) and w₁ (intercept) according to the derivative;
Repeat until stopping condition: e.g. minimal step size, maximum number of steps

Similar principle as gradient boosting
Source: Russel and Norvig 2021, chapter 19.6.

Loss function as landscape
Source: Andrew Ng, ML 1.2.5

(2) Practical task: Differentials

(3) Application Studies

Task: “Please perform a literature search using relevant services (such as Google Scholar and/or Semantic Scholar) to identify scholarly papers (journal articles or conference papers) that describe applications of Large Language Models to a research question from the Humanities. Identify at least 5 relevant papers.”
Results so far: See wiki page “Applications of LLMs in DH” in StudIP.
Questions
- How did you proceed to find these papers?
- What sources or databases did you consult?
- What keywords or phrases did you use to search?
- What definition of “Digital Humanities” did you assume?
- Which of the papers do you think we should read?
Task: find a few more papers with a specific focus on DH

Session 4 (20. Nov. 2023):
Word Embedding Models

Overview

Introduction to Word Embeddings: https://christofs.github.io/wem/trier.html#/
Discussion of Mikolov et al., “Efficient Estimation of Word Representations in Vector Space”, 2013
Practical part: using Gensim to train, inspect and use word embedding models

Mikolov-Paper: Key insights

Different architectures perform differently on different types of problems, with e.g. Skip-gram performing well on semantic problems and CBOW being slightly stronger on syntactic tasks.
There are diminishing returns in accuracy improvement beyond a certain point, emphasizing the need to balance vector dimensionality and training data.
One epoch of skip-gram or bag of words training with a high vector dimensionality not only takes less computational time but also increases accuracy compared to training with multiple epochs or other models like LSA or LDA.
A unique and interesting application of the models proposed in the paper is odd-word-out tasks.
I hadn’t heard as much about the Skip-Gram model as I had about the bag of words model.
Key goal was to reduce training complexity, in order to be able to train on larger datasets, and increase quality in this way.

Mikolov paper: model architectures

Vector arithmetic

Question: Could you please provide more details on the methodology used to perform vector arithmetic and determine word similarities in the semantic and syntactic tasks?
Answer:
- The key approach here is to measure the similarity of the vectors, e.g. using the cosine between the two vectors: the larger the cosine similarity of two vectors, the more similar the two words represented by the vectors are likely to be.
- The vector length is normalized to unit lenght (0,1); then the angle is calculated. Identical vectors get a value of 1, orthogonal ones a value of 0.

Cosine similarity

Cosine similarity of two vectors = cosine of the angle between the vectors
Cosine = dot product of the vectors / product of their lengths
Dot product = sum of the products of the two values, in each dimension
Length = square root of the dot product of the vector by itself
Example: see code example

Polysemy

Question: The text indicates the criteria for evaluating word embeddings, requiring the closest word to be considered a correct response. How should we determine which meaning of a polysemous word (i.e., a word with more than one meaning) is more appropriate to consider as the closest answer?
Answer
- Static word embeddings (as in the Mikolov paper) do not consider the polysemy of a word other than representing an average of all its meanings.
- Only contextual word embeddings can represent the different meanings of a word form in differing contexts.

Alternative models

Question: Quote from Mikolov et al.: “We decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently.” Does the author mean that his models cannot be regarded as a NN and if so why not? What are the models then?
Answer
- The difference is the depth of the network;
- The classic word2vec models are “shallow” neural networks with only one hidden layer

Data models for text beyond BOW

Question : Are there any established models, besides e.g. BOW, that reseachers use now or try to outperform?
Answer
- BoW is a text-level representation: text as words with their frequencies
- A useful alternative model to BoW would be n-gram models
- word2vec is a word-level representation: each word as a vector
- So BoW and word2vec can be combined
- That’s what happens in sent2vec or doc2vec models: some way to aggregate (“summarize”) the word vectors into a vector representing the sentence or document

Accuracy level of best models today

Question: What is the accuracy of the “best” model today and what did the implementer(s) do different in regards to word representation?
Answer:
- ChatGPT4 is certainly a contender for a strong model
- How? More data, more parameters, more training time, more feedback used
- Multimodal model (text and image)
- Accuracy is task-dependent

Practical part

Very quick introduction to Gensim
Demo of the Gensim code to
1. prepare a corpus
2. train a model
3. inspect a model
4. query a model
Try it out with another corpus and/or other settings
1. Corpus: Gutenberg¹, Doyle², etc.
2. Minimum word frequency
3. Dimensionality

Session 5 (27. Nov. 2023):
The “attention” mechanism

Overview

Key insights and questions about the “Attention” paper
An introduction to performing DH tasks with ChatGPT

Key insights

The key innovations are the positional encoding and the “self-attention mechanism”, which allow to model long-distance dependencies.
The clever integration of these multi-head attention mechanisms in the Transformer facilitates effective information interaction between the encoder and decoder, as well as within each layer of the model. This enables the model to capture intricate relationships in the input sequence and generate corresponding output sequences.
The innovation primarily aims at more efficient computation; here, enabling parallelization and hence faster training. Performance improves significantly as well. (CS)
There is relatively little theoretical justification, but they appear to have tried a lot of architectures; see contributor note: “Niki designed, implemented, tuned and evaluated countless model variants […].” (CS)
Historical hardware: They trained the models on “one machine with 8 NVIDIA P100 GPUs” for 12 hours (small model) to 3.5 days (large model). On launch in 2016, that machine would have cost about 50.000 USD. A P100 (~5000 USD) was able to do 9.4 TFLOPs, compared to a current RTX 4090 (~2000 EUR) which can do 82 TFLOPs (flops = floating point operations per second; T = tera = 10^12). (CS)

Attention paper: questions

What is special about contextual word embeddings?
What is a encoder/decoder architecture?
How does the attention mechanism work?
Are there examples of successful applications of the Transformer model in NLP tasks?
What exactly happens in the second attention-head layer in the output block, compared to the normal attention layer that is placed before it? In other words, in a translation task, how are the words from the different languages get used together in this multi-head attention layer of the output?

Contextual Word Embeddings

Static word embeddings calculate one vector for each type in a vocabulary
Contextual word embeddings calculate one vector for each token, depending on its particular context
Static word embeddings encode a lexicon, whereas contextual word embeddings encode a corpus

Encode/Decoder Architecture

Input is a sequence, output is a sequence: sequence-to-sequence modeling
Encoder: Takes words one by one, encoded as embeddings; encodes a hidden state at each step, outputs the hidden state
Decoder: Takes the hidden state, generates words one by one, based on the hidden state and the previously-generated words

Traditional Encoder / Decoder (Google)

Encoder / Decoder with Attention (Google)

Traditional Encoder / Decoder: details (Google)

Encoder / Decoder with Attention: details (Google)

Transformer architecture (from Vaswani et al. 2017)

Attention mechanism (from Vaswani et al. 2017)

Attention

The challenge is how to condition a word on a large context
If all earlier word vectors are retained, they need to be combined somehow
This could be done by addition, averaging, concatenation, or even TF-IDF (weighting)
What attention does is create a dynamic weighting scheme: for every word, the weights shift
For every word, there is an attention vector over the context
For one particular output word (the query, Q), the weights (values, V) represent the importance of the inputs in the sequence (the keys, K)

Self-Attention

Applications of Transformers / Attention / BERT in NLP

Translation between languages
Sequence annotation, like NER
Question answering: question to answer
Text generation: prompt and response

Resources

Video on “Transformer Neural Networks”: https://www.youtube.com/watch?v=TQQlZhbC5ps
“Encoder-Decoder Architecture”: https://www.youtube.com/watch?v=zbdong_h-x4
“Attention Mechanism”: https://www.youtube.com/watch?v=fjJOgb-E41w
Jay Alammar: “The Illustrated Transformer”, https://jalammar.github.io/illustrated-transformer/

Performing DH research tasks with ChatGPT!?

Tasks
- Use ChatGPT to help you map an excerpt from a Wikipedia article about a historical person using the KML Viewer at https://kmlviewer.nsspot.net/.
- Use ChatGPT to analyze an excerpt from a novel (taken e.g. from Wikisource) and determine what parts of a given paragraph are direct speech. Use a suitable output format for this.
- Use ChatGPT to write a fairy tale in the manner (both regarding content and style) of a famous author.
- Think of an additional task that would be relevant to your own academic interests and try to implement it using ChatGPT.
Be prepared to report on your task, approach, difficulties and results in the discussion phase.

Some ideas regarding prompt engineering for ChatGPT

General advice: be clear and be precise, provide context and examples.
Adopt a persona / role: Ask ChatGPT to adopt a particular perspective or persona for this problem.
Target audience: Ask it to aim at a specific target audience with its response.
Step-by-step: Ask it to break the problem down into smaller steps, and name those steps.
Temperature: You can tell ChatGPT (version 4) to use a specific temperature setting for its reponse
Tone: Specify which level of formality, simplicity, register you need the answer to be.
Instructions: Usually, positive prompts (do this) work better than negative prompts (don’t do that)
Examples: Provide examples or even training data for ChatGPT to consider
Chained prompts: Ask for some task to be completed, then continue with another prompt based on the output.
Input format: Experiment with asking questions, making a statement, or giving instructions for your task.
Output format: Ask for the answer to be given in a specific data format, like JSON or CSV.
Take a breath: Ask it to “take a deep breath and relax” before starting

Session 6 (4 Dec. 2023):
More on Attention

Overview

Your results from using ChatGPT
Reading and discussing the “Illustrated Transformer”

Using ChatGPT

Schmitz

Spielberg

Version 1

Version 2

Zhang

Liu

The Illustrated Transformer

Session 7 (11 Dec. 2023):
MacBERTH paper

Overview

Understanding the MacBERTh paper
First steps in implementing parts of the paper

The MacBERTh paper

Motivation and context: LLMs, NLP, DH
Which models are used?
Which evaluation tasks are defined?
How does the evaluation strategy work in each case?
What is the evaluation dataset in each case?
What are the results?

A first look at model, data and code

Resources
Repository: https://github.com/emanjavacas/macberth-eval
Example text (Macbeth): https://folger-main-site-assets.s3.amazonaws.com/uploads/2022/11/macbeth_TXT_FolgerShakespeare.txt
Steps
- Try out evaluation code from repo (with sample sentence)
- Generate more evaluation data based on the original Macbeth play (see link above)
- Compare performance of bert-base-uncased, bert-large-uncased and MacBERTh
Plan next steps: what else can we do with this approach?
- More evaluation scenarios?
- More models (TuringBERT? MultilingualBERT?)
- other ideas…

Session 8 (Dec. 18):
Running MacBERTh

Overview

Testing the evaluation from the README
Testing the training from sentence-periodization
Next paper: de la Rosa et al., “ALBERTI”, 2023.

Session 10 (Jan. 8, 2023)
The ALBERTI paper

Overview

Read and discuss the ALBERTI paper
Explore ways in which we can reuse the model(s) and dataset(s)
- Run the masked word task on poetry
- Other things?
Using spacy to annotate and train a model

Discussion questions

What is the goal of the paper?
How did the authors proceed?
What kind of a dataset did they use?
What evaluations did they perform, with what results?

What parts of the paper can we try out?

Practical part (1): masked word task

Practical part (2): Using spaCy

Installation (library + some models)
Standard annotation of a text
Creating training data
Performing training
Using a newly trained model

General introduction

See: https://spacy.io/
Friendly interface for Python
Models: small, large, transformers
Kinds of annotations: tokens, lemmas, pos, ner, etc.
Many languages: almost 100
Our use-case: Named Entity Recognition (NER)
Goal: fine-tune a model; https://spacy.io/usage/training/

Generate training data (1): Annotation

Run the annotation pipeline (here, focus on NER)
Save the resulting annotation as a tab-separated IOB file.
- I = token is within a NE
- O = token is not a NE
- B = token is the first token of a NE
Convert the IOB file to spacy’s binary format using “convert”
https://spacy.io/api/cli#convert
Example: python3 -m -s -n 1000 spacy convert dickens-hard.iob train
move result to the right folders: one each into “train” and “dev”

Modify / improve the annotations

Purpose: Provide training and evaluation data (train + dev!)
Several possible goals:
- Improve the overall NER accuracy for a specific corpus
- Improve the NER accuracy for a specific label (e.g. work, org)
- Add NER categories that are not present (e.g. literary work, artefact, building)
This is manual work: go through the IOB files and modify them
- Either: Improve / correct the automatic annotations
- Or: Add annotations for new category

Generate a config.cfg

Use the config widget: https://spacy.io/usage/training/#config
run init script: python3 -m spacy init fill-config data/base_config.cfg data/config.cfg
When you have your train and dev data, add paths to the correct places in the config.cfg
- corpora.dev
- corpora.train
- training
Set other parameters
- max_steps = (higher is better but takes longer)
- eval_frequency = (lower is more informative)

Train a new model with the training data

Run the spacy train command with the config.cfg file
Example: python3 -m spacy train config.cfg -o model --verbose
If you have a GPU: python3 -m spacy train config.cfg -o model --verbose --gpu-id 0
Be patient…!

Use the new model for annotating a text

You can simply replace the model by the folder to model-best
Example: nlp = spacy.load("models/output/model-best")
Everything else can be as usual

Questions for the example: new “OBJ” label

Does the model use the “OBJ” label?
How many times, compared to other labels?
Does it apply it only to words from the training data or also to new words, i.e.: Does it generalize?
Does it make many obvious mistakes?
Are the other labels somehow affected?

Answers

Yes!
Quite a lot of times: 257 times in one novel, compared to 341 PERSON and 212x CARDINAL
Yes, it does generalize! (See examples)
Yes, it makes some mistakes
Yes, the other labels are affected (very negatively, nor sure why)

Training process

Labels found

Annotated words for OBJ category

Counter({'door': 14, 'bottle': 9, 'gloves': 9, 'key': 7, 'book': 6, 'table': 6, 'window': 6, 'pictures': 4, 'waistcoat': 4, 'candle': 4, 'chimney': 4, 'jar': 3, 'telescope': 3, 'box': 3, 'fan': 3, 'pocket': 3, 'thimble': 3, 'windows': 3, 'cabinet': 3, 'egg': 3, 'brick': 3, 'watch': 2, 'cupboards': 2, 'rope': 2, 'cakes': 2, 'crockery': 2, 'booth': 2, 'pair': 2, 'pockets': 2, 'daisy': 1, 'maps': 1, 'pegs': 1, 'shelves': 1, 'saucer': 1, 'lamps': 1, 'doors': 1, 'locks': 1, 'curtain': 1, 'lock': 1, 'fountains': 1, 'doorway': 1, 'telescopes': 1, 'paper': 1, 'knife': 1, 'legs': 1, 'cake': 1, 'spades': 1, 'bed': 1, 'plate': 1, 'lesson': 1, 'cartwheels': 1, 'chimney?—Nay': 1, 'fireplace': 1, 'cart': 1, 'neckcloth': 1, 'multiplication': 1, 'cannon': 1, 'muzzle': 1, 'carpet': 1, 'tables': 1, 'chairs': 1, 'carpets': 1, 'pianoforte': 1, 'board': 1, 'account': 1, 'clamps': 1, 'girders': 1, 'brushes': 1, 'brooms': 1, 'flag': 1, 'fountain': 1, 'eyeglass': 1, 'balloon': 1, 'speaking': 1, 'pigsty': 1, 'tongs': 1, 'glasses': 1, 'dial': 1, 'steeple': 1, 'kettle': 1, 'story': 1, 'hat': 1, 'ladder': 1, 'cabinets': 1, 'appliances': 1, 'slate': 1, 'handkerchief': 1, 'penknife': 1, 'piston': 1, 'bell': 1, 'birdcage': 1, 'bells': 1, 'cap': 1, 'lights': 1})

Found words for OBJ category

Counter({'door': 22, 'window': 13, 'knife': 7, 'hand': 6, 'lamp': 6, 'pocket': 4, 'candle': 3, 'doorway': 3, 'glass': 3, 'England': 3, 'rope': 3, 'papers': 2, 'book': 2, 'lamps': 2, 'page': 2, 'lantern': 2, 'stairs': 2, 'timber': 2, 'stair': 2, 'case': 2, 'Andamans': 2, 'wooden': 2, 'leg': 2, 'bottle': 1, 'mantelpiece': 1, 'Frenchman': 1, 'post': 1, 'keyhole': 1, 'clothes': 1, 'books': 1, 'drawer': 1, 'Vauxhall': 1, 'Thames': 1, 'kitchen': 1, 'wire': 1, 'facts': 1, 'London': 1, 'cupboards': 1, 'carafe': 1, 'lids': 1, 'hinges': 1, 'Number': 1, 'Bishopgate': 1, 'detective': 1, 'foot': 1, 'handkerchief': 1, 'wall': 1, 'stockings': 1, 'beads': 1, 'oven': 1, 'chambers': 1, 'criminals': 1, 'ring': 1, 'Andaman': 1, 'blinds': 1, 'test': 1, 'West': 1, "boat's": 1, 'piece': 1, 'Islander': 1, 'lid': 1, 'barrier': 1, 'handcuffs': 1, 'bracelets': 1, 'flames': 1, 'quarter': 1, 'troopers': 1, 'pair': 1, 'four': 1, 'Feringhee': 1, 'Englishman': 1, 'plunder': 1, 'East': 1, 'mail': 1, 'scoundrel': 1, 'cocaine': 1})

Session 10: ALBERTI

Overview

Our various tests with training a stanza classifier using BERT and/or ALBERTI
Optionally: A look at training a sequence labeling / NER model using spaCy (see slides for previous session)
Next steps in class

Training the stanza classifier

Basic process
- Transform CSV to DataFrame, then Hugginface Datasets format (train/test)
- Tokenize the stanzas and get tensor representation
- Train (= fine-tune) the model as a sequence classifier and evaluate performance
More information
- For details of data, code and output, see folder alberti in the lalamdah repository
- Training time approximately 15 minutes (on AMD Ryzen 7 with RTX 3060)
- Some screenshots below

Output from the process of fine-tuning the classifier using BERT

Output from the process of fine-tuning the classifier using ALBERTI

GPU being very busy while training

Performance comparison (accuracy)

	BERT	ALBERTI
Paper	0.619	0.636
My test (5 epochs)	0.412	0.588

Results from first test. Both models probably need to be trained longer.

Output from the process of fine-tuning the classifier using BERT (2nd run)

Output from the process of fine-tuning the classifier using ALBERTI (2nd run)

Performance comparison (accuracy)

	BERT	ALBERTI
Paper	0.619	0.636
My test (15 epochs)	0.665	0.659

Session 11 (22. Jan. 2024):
Redewiedergabe Project

Overview

Redewiedergabe paper
- Reference: Annelen Brunner, Ngoc Duyen Tanja Tu, Lukas Weimer, Fotis Jannidis: “To BERT or not to BERT - Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of four Types of Speech, Thought and Writing Representation”. SwissText/KONVENS 2020. https://ceur-ws.org/Vol-2624/paper5.pdf.
- Lead questions for the paper (below)
Project repositories
- Corpus: https://github.com/redewiedergabe/corpus
- Tagger: https://github.com/redewiedergabe/tagger
- Lead questions (below)
Plan for next meeting

Lead questions to discuss about the paper

What are the most relevant points regarding previous research?
How did the authors proceed, generally speaking?
How can the (extensive) error analysis be summarized?
What is the key insight you took away from the paper?

Lead questions to discuss about the repositories

Corpus
- What is the corpus design?
- How is the annotation structured?
- Is there anything that might be challenging to us?
Tagger
- What are the requirements?
- What do we need to do to replicate the paper?
- What variations on the paper make sense?
- What do we need to do to annotate our own texts?
- What aspects could be challenging for us?

Plans for our next meeting

https://github.com/redewiedergabe/tagger
Apply one RW model to a German narrative text
- Define virtual environment and install requirements there
- Use the FLAIR model for direct speech
- Download a suitable model
- Execute the script rwtagger.py
- Find a text to annotate (Gutenberg, Deutsches Textarchiv, TextGrid Digitale Bibliothek)
- First, use predict mode
- Correct part of the output manually
- Use this in test mode
- Compare our performance to the paper’s results

RW Protocol (Linux)

Start at: https://github.com/redewiedergabe/tagger
Download or clone the repository to a folder called tagger
Download models (e.g. FLAIR and BERT for direct)
Unzip the models to the folder tagger/rwtagger/models
Install virtualenv for Python pip install virtualenv
Install pyenv for easy installation and switching between Python versions: instructions
Set up a virtual environment in the folder tagger using: virtualenv rwvenv
Activate virtual environment: source rwvenv/local/bin/activate
Install all required packages in their correct version
Python 3.7.5 (!!) using pyenv: pyenv install 3.7.5
Switch to Python 3.7.5: pyenv local 3.7.5
pytorch: pip3.7 install torch==1.10.1 # available for Python 3.7.5
pip3.7 install torchvision==0.12.0 # Hopefully right version
pip3.7 install torchaudio==0.11.0 # Hopefully right version
pip3.7 install flair==0.10
pip3.7 install pandas==1.3.5
pip3.7 install nltk==3.6.7
pip3.7 install pytorch_transformers==1.2.0
pip3.7 install openpyxl==3.0.9 # Optional
Go into python3.7 environment with python3.7, there do import nltk and nltk.download("punkt"). Then quit quit()
Place a plain text file in the folder tagger/rwtagger/plain
Create the folder tagged-pred also in tagger/rwtagger
From terminal, in folder tagger/rwtagger, run: python3.7 rwtagger.py plain tagged-pred -t direct -conf
Inspect the resulting TSV file in the folder tagged-pred
Correct any mistakes you may find and rename the column direct_pred to just direct
Save the corrected file to a new folder gold within rwtagger
Create a folder tagged-test
Run: python3.7 rwtagger.py -m test gold tagged-test -t direct -conf
Inspect the results in tagged-test, including the scores in results_stats
Once all of this works, you are ready to experiment with further models and settings.

Large Language Models and Digital Humanities (LALAMDAH)

Session 1: Introduction to the topic

Thematic overview: key questions

Very brief summary of Wolfram 2023

Questions on Wolfram 2023

ChatGPT’s characteristics, limitations, improvements (1)

ChatGPT’s characteristics, limitations, improvements (2)

ChatGPT’s characteristics, limitations, improvements (2)

ChatGPT’s characteristics, limitations, improvements (3)

ChatGPT’s characteristics, limitations, improvements (4)

ChatGPT’s characteristics, limitations, improvements (5)

Practical aspects (1)

Practical aspects (2)

Neural networks / deep learning (1)

Word Embeddings (1)

Word Embeddings (2)

Ethical and legal issues (1)

Questions for next time

Session 2 (Nov. 6, 2023):Neural networks

Just as a teaser

Overview of the session

Your key insights

Some recommended readings

Chollet: AI > ML > DL

Alternative: CS > ML > DL > AI

Some definitions

Chollet: Programming vs. Machine Learning

Feature engineering for traditional ML (example)

Chollet: Various types of ML algorithms (classifiers)

Classifiers: Naive Bayes

Decision Trees

SVM

Ensemble method: Gradient Boosting

Use case: classifier selection

Chollet: Representation learning by coordinate change

Chollet: Neural Network architecture

Conceptual Break

Question: Do artificial NN work like the brain?

Practical Task

Session 3 (13 Nov., 2023): Gradient Descent

Overview of the session

(1) Gradient Descent: insights and questions

Key insights from Chollet, chapter 2

Question: Generality of Gradient Descent

Question: Learning Rate

Performance of Gradient Descent

Gradient Descent in Practice

Optimizer: Stochastic Gradient Descent

(2) Practical task: Differentials

(3) Application Studies

Session 4 (20. Nov. 2023):Word Embedding Models

Overview

Mikolov-Paper: Key insights

Mikolov paper: model architectures

Vector arithmetic

Cosine similarity

Polysemy

Alternative models

Data models for text beyond BOW

Accuracy level of best models today

Practical part

Session 5 (27. Nov. 2023):The “attention” mechanism

Overview

Key insights

Attention paper: questions

Contextual Word Embeddings

Encode/Decoder Architecture

Traditional Encoder / Decoder (Google)

Encoder / Decoder with Attention (Google)

Traditional Encoder / Decoder: details (Google)

Encoder / Decoder with Attention: details (Google)

Transformer architecture (from Vaswani et al. 2017)

Attention mechanism (from Vaswani et al. 2017)

Attention

Self-Attention

Applications of Transformers / Attention / BERT in NLP

Resources

Performing DH research tasks with ChatGPT!?

Some ideas regarding prompt engineering for ChatGPT

Session 6 (4 Dec. 2023):More on Attention

Session 2 (Nov. 6, 2023):
Neural networks

Session 3 (13 Nov., 2023):
Gradient Descent

Session 4 (20. Nov. 2023):
Word Embedding Models

Session 5 (27. Nov. 2023):
The “attention” mechanism

Session 6 (4 Dec. 2023):
More on Attention

Session 7 (11 Dec. 2023):
MacBERTH paper

Session 8 (Dec. 18):
Running MacBERTh

Session 10 (Jan. 8, 2023)
The ALBERTI paper

Session 11 (22. Jan. 2024):
Redewiedergabe Project