Large Language Models and Digital Humanities (LALAMDAH)

Christof Schöch

2023-10-05

Session 1: Introduction to the topic

Thematic overview: key questions

  1. How do LLMs work?
  2. How have LLMs developed?
  3. What can we do with LLMs in DH?
  4. How can we use LLMs ourselves?

Very brief summary of Wolfram 2023

  • What are the essential building blocks of LLMs?
    • Data: Large amounts of text (billions of words from the web and from books)
    • Computing power: Sufficient amounts of computing power (GPU clusters + heaps of energy)
    • Neural networks: principle of input, hidden, output layers of neurons with weights
    • Word Embeddings: principle of learned, low-dimensional numerical representation of words
    • Transformers / attention: for efficient training over longer dependencies

Questions on Wolfram 2023

ChatGPT’s characteristics, limitations, improvements (1)

  • Question: When ChatGPT does something like writing an essay and adds a word at a time, where does ChatGPT get those words from? From the web?
  • Answer
    • The training corpus determines the vocabulary
    • Words != tokens; tokens are some basic words + many subwords like prefixes and suffixes)
    • Subtokens are why ChatGPT can create new words
    • The training corpus is based on a lot of text from the web and from scanned books.
    • The key process of ChatGPT is how to select the right word in each moment.
    • It doesn’t just copy fitting passages from stored text
    • Rather, it has a representation of the tokens that help assess what words fit best next.

ChatGPT’s characteristics, limitations, improvements (2)

  • Question: Is the web-crawled training data for ChatGPT specially selected, or are there certain criteria to be met in order to be used as training data?
  • Answer
    • It is not publicly documented what exactly went into the training corpus
    • Some of the sources are: websites (news sites, blogs, forums); books (fiction, non-fiction); Common Crawl (dataset of scraped text); Wikipedia, Twitter, Reddit.
    • GPT-3 has been trained on 570 gigabytes of text; that’s about 300 billion words (equivalent to ~3 million novels)

ChatGPT’s characteristics, limitations, improvements (2)

  • Question: ChatGPT can do other things besides simulating human language. It can for example access databases and write code. How is this done compared to generating language.
  • Answer
    • In its default version, it cannot access databases.
    • It generates code just like it generates prose, by having learned what typical sequences of programming code look like.

ChatGPT’s characteristics, limitations, improvements (3)

  • Question: What are the primary limitations of ChatGPT?
  • Answer
    • Limited to short-range dependencies between text.
    • Lack of access to information available online.
    • Lack of access to structured information, e.g. knowledge graphs, whether built-in or queryable online.
    • Lack of capability to perform logical reasoning.
    • Lack of capability to perform mathematical calculations.
    • Training data ends in 2021.

ChatGPT’s characteristics, limitations, improvements (4)

  • Question: GPT seems to be able to produce human like text without having any difficulties although it can sometimes be quite innacurate about certain topics. Do you think in the near feature, a better version of GPT can pass the Turing Test with even more improved abilities?
  • Answer:
    • Improvements are likely to be quite drastic in the next few years
    • They could come from better access to unstructured information available for live lookup, on the internet
    • Or they could come from integration with knowledge bases containing factual information, such as Wikidata

ChatGPT’s characteristics, limitations, improvements (5)

  • Question: ChatGPT fails at logic tasks (and mathematical operations). Would it be possible to combine Large Language Models with conventional computing methods to remedy this?
  • Answer
    • I think the fundamental idea of Stephen Wolfram is to do exactly that: combine LLMs with a logical computing module such as Wolfram|Alpha
    • There used to be a plugin for ChatGPT that appears to have done that; it is currently unavailable afaik
    • Alternatively, ChatGPT could learn to turn unstructured prose into structured data (such as LOD) and then reason on this, and turn its response into prose for output.

Practical aspects (1)

  • Question: Are there currently any large language models, that can be run on Personal Computers? Can you recommend any?
  • Answer
    • There are many! We will get to know some of them during the course.
    • We will train models ourselves, starting with simple Word Embedding Models (using Gensim)
    • To run LLMs locally, one option is the transformers library that uses models available on huggingface.com
    • Another source of LLMs is LLAMA, a set of models that can be freely downloaded and used
    • Using an API, it is possible to use models from OpenAI (such as ChatGPT!)

Practical aspects (2)

  • Question: How does “temperature work”? I.e., does a temperature 0.8 equal a chance of 80% that lower-ranked words are used? What determines which lower-ranked words are being used?
  • Answer
    • AFAIK, temperature is a hyperparameter set between 0 and 1 (when using the API, or in the Chat version)
    • A low value (0.1-0.3) means there is no deviation from the top most likely words: predictable patterns, literal meaning, consistent output
    • A high value (0.6-0.9) means there is a lot of deviation from the top most likely words: creative combinations, metaphorical meaning, varying output
    • You can try this yourself in ChatGPT; just instruct it to use a certain temperature setting in your prompt.
    • For reliable, consisten, working code, a lower temperature is probably best
    • For inventive prose or creative ideas, a higher temperature will probably be best

Neural networks / deep learning (1)

  • Question: How should we understand and interpret the (3D) images that depict how neural nets work?
  • Answer
    • Let’s have a look at it together
    • There are two dimensions to the plot: x and y (any point has a value for x and y)
    • There are three outputs: -1, 0, +1 (one for each of three regions in the plot)
    • We are looking for a function that gives the right output for each input

Word Embeddings (1)

  • Question: The article mentioned that embeddings are created in a way that words with similar meanings are close to each other in the vector space. Do embeddings take word ambiguity into consideration and if so, how is it done?
  • Answer
    • Simple, static word embeddings do not do this; each type gets one vector, which is a mix of all of its uses.
    • More complex, contextual word embeddings do this: each token gets its own vector, depending on its type and its context;
    • In such a word embedding, there might be multiple clusters of tokens with similar meaning, each corresponding to one sense of the type.

Word Embeddings (2)

  • Question: What is a good number of dimensions for word embeddings?

Questions for next time

  • Neural network: technical implementation
    • Activation functions of neurons: e.g. LeRU
    • Backpropagation, loss function, stochastic gradient descent, local/global minima
    • The “attention” mechanism in BERT or GPT
    • etc.

Session 2 (Nov. 6, 2023):
Neural networks

Just as a teaser

Overview of the session

  • Summarize and discuss Chollet, chapter 1
  • Research task: comparative terminology
  • Input and discussion on gradient descent
  • Try out sample code from Chollet, chapter 2

Your key insights

  • Development: designing rules and features > learning rules > also learning features
  • Difference between “shallow” neural networks and “deep” neural networks
  • Predecessors of DL, e.g. neural networks, decision trees, kernel methods (for me: use of gradient-based opimization in gradient boosting, like later in DL)
  • DL become popular because of increased data availability (WWW) and technological developments (GPUs), not (primarily) because of new ideas
  • Are articifial neural networks modeled on how the brain works or not? Wolfram says yes, Chollet says no.

Chollet: AI > ML > DL

Alternative: CS > ML > DL > AI

Some definitions

The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. (John McCarthy, 1956)

AI can be described as the effort to automate intellectual tasks normally performed by humans. (Chollet)

The field of artificial intelligence, or AI, is concerned with not just understanding but also building intelligent entities—machines that can compute how to act effectively and safely in a wide variety of novel situations. (Russel and Norvig, 1995/2021)

The quest for “artificial flight” succeeded when engineers and inventors stopped imitating birds and started using wind tunnels and learning about aerodynamics. (Russel and Norvig, 1995/2021)

Chollet: Programming vs. Machine Learning

Feature engineering for traditional ML (example)

  • Basic idea: feature engineering in traditional ML <=> representation learning in DL
  • Example from DH:
    • Topic: Direct speech recognition in novels
    • Data: 40 French 19th-century novels
    • Question: What is the relationship between direct speech and subgenre?
    • Method: Traditional Machine Learning with feature engineering
    • Separate presentation: 10.5281/zenodo.10072385
  • Same issue using DL / Transformer architecture: Byszuk et al., “Detecting Direct Speech in Multilingual Collection of 19th-century Novels”, 2020: aclanthology.org/2020.lt4hala-1.15/

Chollet: Various types of ML algorithms (classifiers)

  • Several frequently-used classifiers
    • Naive Bayes
    • Logistic Regression
    • Decision Trees
    • Support Vector Machines
    • Ensemble methods, e.g.: Gradient Boosting
  • Key questions
    • How to chose the right classifier for a given problem? Experience wrt to dataset, size, features, goal of the project.

Classifiers: Naive Bayes

  • Founded on Bayesian probability theory (prior, conditional, posterior probabilities)
  • Naive, because it assumes each feature is independent from the others (unlikely)
  • Training the classifier based on a dataset:
    • First, calculates the base probabilities of the target classes (prior probabilities)
    • Then, calculates the probabilities for each class given each feature (proportion or Gaussian distribution: mean, std)
  • Using the classifier:
    • For any data item and its feature values, the class probabilities can now be calculated
    • These and the base probability are combined for each class
    • The class with the higher value is the prediction

Decision Trees

  • Particularly for classification problems
  • Splits the dataset into hierarchically-organized branches depending on feature values
  • Proceeds step by step:
    • Find the feature with the largest influence on the result
    • For this feature, find the best decision boundary (loss function!)
    • Then, for each branch, find the next most important feature and its best decision boundary
  • Avoid risk of overfitting
    • By limiting the depth of the tree
    • Or by setting a minimum number of features per resulting leaf

SVM

  • The classifier finds a decision boundary between the classes
  • The decision boundary is a hyperplane separating the classes
  • The “support vectors” are the points that define the boundary
  • The aim is to maximize the margin (width of the decision boundary)
  • There are linear and non-linar versions of SVM
    • Linear: simple, with interpretable feature weights
    • Non-linear: using the “kernel trick”: more powerful, less interpretable

  • Here, the additional dimension is calculated as x^2
  • Source: Kaggle

Ensemble method: Gradient Boosting

  • A type of ensemble method (combine multiple classifiers)
  • For regression or classification
  • Chain of multiple “weak learners” that are strong when combined
  • Requires a loss function (differentiable => gradient)
  • The gradient tells us how the next function could be improved (boosted)
  • At each step, the next learner is trained on the mistakes of the previous learners
  • Each model is the previous model + an improvement reducing the loss
  • Very flexible: learners and loss functions can be selected as needed
  • Standard: gradient boosted decision trees (usually work well)

Use case: classifier selection

  • Source: Du et al., “Evaluation of Measures of Distinctiveness”, JCLS, 2022. DOI: 10.48694/jcls.102.

Chollet: Representation learning by coordinate change

  • Before the transformation, slope and offset are needed to separate the data
  • After the transformation, a simple rule (value of x) can separate the data (y doesn’t matter)
  • The raw features are used but transformed automatically (no feature engineering)

Chollet: Neural Network architecture

Conceptual Break

  • Live research task: comparative definitions of two key terms
  • Summarize the key similarities, differences, and the relationship between two of the following terms (different participants should choose various combinations): machine learning, neural networks, artificial intelligence, representation learning, word embeddings, (large) language models, deep learning, knowledge graphs, representation learning.
  • We’ll discuss your findings.

Question: Do artificial NN work like the brain?

  • aNN have been inspired by the brain
  • There are some similarities, but also many differences
  • Even when biological neurons are modeled, and the output matches the brain’s neurons, the mechanisms are usually quite different
  • Useful summary: Adan, “Do-neural-networks-really-work-like-neurons”, 2018 (see Readings)
  • Basic unit is the neuron
  • Neuron: inputs, computation, output
  • Many inputs: sensory input or from other neurons
  • Non-linear activation function
  • Many outputs: to many other neurons
  • More complex weighting mechanisms
  • Much more complex architecture
  • No backpropagation through gradient descent with differentials
  • Learns much fast from few examples (~ pre-training?)
  • Much more energy-efficient!

Practical Task

  • The basis for this task is Chollet, Chapter 2
  • Use the code you find there to build a simple classifier for the MNIST dataset of numbers
  • What is the performance on the training and the test set that you achieve?
  • For the curious:
    • How would you reduce the amount of training data, and what happens if you do?
    • What if you modify the size of the input layer, making it smaller (or larger) than 512?
    • How would you go about to reduce the size of the images? What happens if you do that?
    • Would you say the classifier is robust or fragile with respect to such interference?

Session 3 (13 Nov., 2023):
Gradient Descent

Overview of the session

  1. Gradient descent: insights and questions
  2. Implementing a simple differential
  3. Overview of application studies collected so far

(1) Gradient Descent: insights and questions

Key insights from Chollet, chapter 2

  • Gradient descent is the optimization technique that powers modern neural networks.
  • I understood that the Gradient Descent is a technique to find the smallest possible values for the loss function step by step which helps to determine the right weights of a neural network.
  • We can think of gradient descent as a person who is at the top of a mountain. The person looks all around to see where should s/he take a step in order to get to the bottom in the fastest and easiest way. So s/he takes a huge or small step (here we refer to alpha rate) and move towards a direction (and here we refer to derivative of the slope). This process is repeated by the person again and again until s/he reaches the bottom (local minima) and just like this way, gradient descent finally optimizes the model.
  • Gradient Descent is one of the most common algorithms in machine learning field. It is used to find the best parameters of w and b of a model and this way it optimizes our model. In the formula of gradient descent partial derivatives and alpha rate is used. Alpha rate is used to determine the size of the model’s step to fit. The bigger the size is, the harder it will be to find the best parameters.
  • In neural networks, gradients are used to update the model’s weights to minimize losses. By calculating the gradient of the loss with respect to the weights, people can determine how to adjust the parameters to reduce the loss. When using gradient descent, the model parameters are updated in the opposite direction of the gradient at each iteration. The step size is an important factor that controls the magnitude of each update.
  • Gradient descent minimizes the loss function by calculating the derivative of the loss function with respect to the intercept.
  • Stochastic gradient descent computes the gradient by randomly sampling only one sample in each iteration. Multiple samples can be sampled randomly and uniformly in each iteration to form a mini batch, and then this mini batch can be used to compute the gradient. The learning rate of mini batch stochastic gradient descent can decay itself during the iteration.
  • The sections about backpropagation, including composition graphs and for- and backward passes, were quite interesting as I strangely have never heard of it in-depth before. What made it more interesting were the visualizations that came with the theory of backpropagation, which, in fact, made the computations “under the surface” clearer.

Question: Generality of Gradient Descent

  • Question: Assuming that gradient descent is one of the most common algorithm that optimizes a machine learning model, is it applicable in most of the machine learning models that we use or is there any other methods which are much better than gradient descent in certain situations?
  • Answer
    • It is widely applicable, but used mostly in NN architectures
    • There are alternative methods of reducing the loss of a model / classifier (e.g. XXXXX)
    • There are ML algorithms that don’t need such a loss calculation (e.g. Naive Bayes)

Question: Learning Rate

  • Question: How can we decide the best learning rate at the beginning of training? Do we have to choose one of random learning rates, and then change it until we find the best if it doesn’t work? / How do you find the “ideal” learning rate, that doesnt get stuck in a local minimum or doesn’t diverge?
  • Answer
    • Learning rate vs. step size
    • Estimating a good learning rate
    • Adjusting the learning rate

Performance of Gradient Descent

  • Question: What factors affect the performance of gradient descent? What challenges are faced during optimisation? / Are there cases where Gradient Descent might be suboptimal and another optimization algorithm might work better? What if there are multiple local minima […], will gradient descent find the global minimum or will it get stuck at a local minimum?
  • Answer
    • One theoretical issue are indeed local minima. In practice, due to the many dimensions of typical datasets used in DL, this is not an issue.
    • Another possible issue is the step size / learning rate: it should be neither too small nor too large.

Gradient Descent in Practice

  • Question: In the video and text it is shown how GD is calculated on a mathematical level. How is gradient descent used in practice?
  • Answer
    • In practice, it is easy: It is applied automatically in the training process by the ML library of choice
    • The user just sets the relevant parameters: what loss function to use, what optimizer to use, what learning rate to use, how to adjust the learning rate.

Optimizer: Stochastic Gradient Descent

  • The starting points are random weights
  • The goal is to minimize the loss
  • This is done by adjusting all of the weights
  • The “optimizer” determines how the weights need to be updated
  • It works because all functions in NN are smooth and continuous (i.e., differentiable)
  • Multidimensional differentials are called gradients
  • The process is therefore called gradient descent
  • The whole process is backpropagation, mathematically, and learning, conceptually
  • Just one input dimension (x) and one output dimension (y)
  • Linear regression: w0 (slope) and w1 (intercept)
  • Start with random values for slope and intercept
  • Calculate the loss (e.g. sum of least squares)
  • Use the derivative to get the slope (or gradient) of the loss function
  • Parameter: ‘step size’ (fix or flexible, e.g.: step size x ‘learning rate’)
  • Change w0 (slope) and w1 (intercept) according to the derivative;
  • Repeat until stopping condition: e.g. minimal step size, maximum number of steps

  • Similar principle as gradient boosting
  • Source: Russel and Norvig 2021, chapter 19.6.

  • Loss function as landscape
  • Source: Andrew Ng, ML 1.2.5

(2) Practical task: Differentials

(3) Application Studies

  • Task: “Please perform a literature search using relevant services (such as Google Scholar and/or Semantic Scholar) to identify scholarly papers (journal articles or conference papers) that describe applications of Large Language Models to a research question from the Humanities. Identify at least 5 relevant papers.”
  • Results so far: See wiki page “Applications of LLMs in DH” in StudIP.
  • Questions
    • How did you proceed to find these papers?
    • What sources or databases did you consult?
    • What keywords or phrases did you use to search?
    • What definition of “Digital Humanities” did you assume?
    • Which of the papers do you think we should read?
  • Task: find a few more papers with a specific focus on DH

Session 4 (20. Nov. 2023):
Word Embedding Models

Overview

  1. Introduction to Word Embeddings: https://christofs.github.io/wem/trier.html#/
  2. Discussion of Mikolov et al., “Efficient Estimation of Word Representations in Vector Space”, 2013
  3. Practical part: using Gensim to train, inspect and use word embedding models

Mikolov-Paper: Key insights

  1. Different architectures perform differently on different types of problems, with e.g. Skip-gram performing well on semantic problems and CBOW being slightly stronger on syntactic tasks.
  2. There are diminishing returns in accuracy improvement beyond a certain point, emphasizing the need to balance vector dimensionality and training data.
  3. One epoch of skip-gram or bag of words training with a high vector dimensionality not only takes less computational time but also increases accuracy compared to training with multiple epochs or other models like LSA or LDA.
  4. A unique and interesting application of the models proposed in the paper is odd-word-out tasks.
  5. I hadn’t heard as much about the Skip-Gram model as I had about the bag of words model.
  6. Key goal was to reduce training complexity, in order to be able to train on larger datasets, and increase quality in this way.

Mikolov paper: model architectures

Vector arithmetic

  • Question: Could you please provide more details on the methodology used to perform vector arithmetic and determine word similarities in the semantic and syntactic tasks?
  • Answer:
    • The key approach here is to measure the similarity of the vectors, e.g. using the cosine between the two vectors: the larger the cosine similarity of two vectors, the more similar the two words represented by the vectors are likely to be.
    • The vector length is normalized to unit lenght (0,1); then the angle is calculated. Identical vectors get a value of 1, orthogonal ones a value of 0.

Cosine similarity

  • Cosine similarity of two vectors = cosine of the angle between the vectors
  • Cosine = dot product of the vectors / product of their lengths
  • Dot product = sum of the products of the two values, in each dimension
  • Length = square root of the dot product of the vector by itself
  • Example: see code example

Polysemy

  • Question: The text indicates the criteria for evaluating word embeddings, requiring the closest word to be considered a correct response. How should we determine which meaning of a polysemous word (i.e., a word with more than one meaning) is more appropriate to consider as the closest answer?
  • Answer
    • Static word embeddings (as in the Mikolov paper) do not consider the polysemy of a word other than representing an average of all its meanings.
    • Only contextual word embeddings can represent the different meanings of a word form in differing contexts.

Alternative models

  • Question: Quote from Mikolov et al.: “We decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently.” Does the author mean that his models cannot be regarded as a NN and if so why not? What are the models then?
  • Answer
    • The difference is the depth of the network;
    • The classic word2vec models are “shallow” neural networks with only one hidden layer

Data models for text beyond BOW

  • Question : Are there any established models, besides e.g. BOW, that reseachers use now or try to outperform?
  • Answer
    • BoW is a text-level representation: text as words with their frequencies
    • A useful alternative model to BoW would be n-gram models
    • word2vec is a word-level representation: each word as a vector
    • So BoW and word2vec can be combined
    • That’s what happens in sent2vec or doc2vec models: some way to aggregate (“summarize”) the word vectors into a vector representing the sentence or document

Accuracy level of best models today

  • Question: What is the accuracy of the “best” model today and what did the implementer(s) do different in regards to word representation?
  • Answer:
    • ChatGPT4 is certainly a contender for a strong model
    • How? More data, more parameters, more training time, more feedback used
    • Multimodal model (text and image)
    • Accuracy is task-dependent

Practical part

  1. Very quick introduction to Gensim
  2. Demo of the Gensim code to
    1. prepare a corpus
    2. train a model
    3. inspect a model
    4. query a model
  3. Try it out with another corpus and/or other settings
    1. Corpus: Gutenberg1, Doyle2, etc.
    2. Minimum word frequency
    3. Dimensionality

Session 5 (27. Nov. 2023):
The “attention” mechanism

Overview

  • Key insights and questions about the “Attention” paper
  • An introduction to performing DH tasks with ChatGPT

Key insights

  • The key innovations are the positional encoding and the “self-attention mechanism”, which allow to model long-distance dependencies.
  • The clever integration of these multi-head attention mechanisms in the Transformer facilitates effective information interaction between the encoder and decoder, as well as within each layer of the model. This enables the model to capture intricate relationships in the input sequence and generate corresponding output sequences.
  • The innovation primarily aims at more efficient computation; here, enabling parallelization and hence faster training. Performance improves significantly as well. (CS)
  • There is relatively little theoretical justification, but they appear to have tried a lot of architectures; see contributor note: “Niki designed, implemented, tuned and evaluated countless model variants […].” (CS)
  • Historical hardware: They trained the models on “one machine with 8 NVIDIA P100 GPUs” for 12 hours (small model) to 3.5 days (large model). On launch in 2016, that machine would have cost about 50.000 USD. A P100 (~5000 USD) was able to do 9.4 TFLOPs, compared to a current RTX 4090 (~2000 EUR) which can do 82 TFLOPs (flops = floating point operations per second; T = tera = 10^12). (CS)

Attention paper: questions

  • What is special about contextual word embeddings?
  • What is a encoder/decoder architecture?
  • How does the attention mechanism work?
  • Are there examples of successful applications of the Transformer model in NLP tasks?
  • What exactly happens in the second attention-head layer in the output block, compared to the normal attention layer that is placed before it? In other words, in a translation task, how are the words from the different languages get used together in this multi-head attention layer of the output?

Contextual Word Embeddings

  • Static word embeddings calculate one vector for each type in a vocabulary
  • Contextual word embeddings calculate one vector for each token, depending on its particular context
  • Static word embeddings encode a lexicon, whereas contextual word embeddings encode a corpus

Encode/Decoder Architecture

  • Input is a sequence, output is a sequence: sequence-to-sequence modeling
  • Encoder: Takes words one by one, encoded as embeddings; encodes a hidden state at each step, outputs the hidden state
  • Decoder: Takes the hidden state, generates words one by one, based on the hidden state and the previously-generated words

Traditional Encoder / Decoder (Google)

Encoder / Decoder with Attention (Google)

Traditional Encoder / Decoder: details (Google)

Encoder / Decoder with Attention: details (Google)

Transformer architecture (from Vaswani et al. 2017)

Attention mechanism (from Vaswani et al. 2017)

Attention

  • The challenge is how to condition a word on a large context
  • If all earlier word vectors are retained, they need to be combined somehow
  • This could be done by addition, averaging, concatenation, or even TF-IDF (weighting)
  • What attention does is create a dynamic weighting scheme: for every word, the weights shift
  • For every word, there is an attention vector over the context
  • For one particular output word (the query, Q), the weights (values, V) represent the importance of the inputs in the sequence (the keys, K)

Self-Attention

Applications of Transformers / Attention / BERT in NLP

  • Translation between languages
  • Sequence annotation, like NER
  • Question answering: question to answer
  • Text generation: prompt and response

Resources

Performing DH research tasks with ChatGPT!?

  • Tasks
    • Use ChatGPT to help you map an excerpt from a Wikipedia article about a historical person using the KML Viewer at https://kmlviewer.nsspot.net/.
    • Use ChatGPT to analyze an excerpt from a novel (taken e.g. from Wikisource) and determine what parts of a given paragraph are direct speech. Use a suitable output format for this.
    • Use ChatGPT to write a fairy tale in the manner (both regarding content and style) of a famous author.
    • Think of an additional task that would be relevant to your own academic interests and try to implement it using ChatGPT.
  • Be prepared to report on your task, approach, difficulties and results in the discussion phase.

Some ideas regarding prompt engineering for ChatGPT

  1. General advice: be clear and be precise, provide context and examples.
  2. Adopt a persona / role: Ask ChatGPT to adopt a particular perspective or persona for this problem.
  3. Target audience: Ask it to aim at a specific target audience with its response.
  4. Step-by-step: Ask it to break the problem down into smaller steps, and name those steps.
  5. Temperature: You can tell ChatGPT (version 4) to use a specific temperature setting for its reponse
  6. Tone: Specify which level of formality, simplicity, register you need the answer to be.
  7. Instructions: Usually, positive prompts (do this) work better than negative prompts (don’t do that)
  8. Examples: Provide examples or even training data for ChatGPT to consider
  9. Chained prompts: Ask for some task to be completed, then continue with another prompt based on the output.
  10. Input format: Experiment with asking questions, making a statement, or giving instructions for your task.
  11. Output format: Ask for the answer to be given in a specific data format, like JSON or CSV.
  12. Take a breath: Ask it to “take a deep breath and relax” before starting

Session 6 (4 Dec. 2023):
More on Attention

Overview

  • Your results from using ChatGPT
  • Reading and discussing the “Illustrated Transformer”

Using ChatGPT

Schmitz

Spielberg

Version 1

Version 2

Zhang

Liu

The Illustrated Transformer

Session 7 (11 Dec. 2023):
MacBERTH paper

Overview

  • Understanding the MacBERTh paper
  • First steps in implementing parts of the paper

The MacBERTh paper

  • Motivation and context: LLMs, NLP, DH
  • Which models are used?
  • Which evaluation tasks are defined?
  • How does the evaluation strategy work in each case?
  • What is the evaluation dataset in each case?
  • What are the results?

A first look at model, data and code

Session 8 (Dec. 18):
Running MacBERTh

Overview

  • Testing the evaluation from the README
  • Testing the training from sentence-periodization
  • Next paper: de la Rosa et al., “ALBERTI”, 2023.

Session 10 (Jan. 8, 2023)
The ALBERTI paper

Overview

  • Read and discuss the ALBERTI paper
  • Explore ways in which we can reuse the model(s) and dataset(s)
    • Run the masked word task on poetry
    • Other things?
  • Using spacy to annotate and train a model

Discussion questions

  • What is the goal of the paper?
  • How did the authors proceed?
  • What kind of a dataset did they use?
  • What evaluations did they perform, with what results?

What parts of the paper can we try out?

Practical part (1): masked word task

Practical part (2): Using spaCy

  • Installation (library + some models)
  • Standard annotation of a text
  • Creating training data
  • Performing training
  • Using a newly trained model

General introduction

  • See: https://spacy.io/
  • Friendly interface for Python
  • Models: small, large, transformers
  • Kinds of annotations: tokens, lemmas, pos, ner, etc.
  • Many languages: almost 100
  • Our use-case: Named Entity Recognition (NER)
  • Goal: fine-tune a model; https://spacy.io/usage/training/

Generate training data (1): Annotation

  • Run the annotation pipeline (here, focus on NER)
  • Save the resulting annotation as a tab-separated IOB file.
    • I = token is within a NE
    • O = token is not a NE
    • B = token is the first token of a NE
  • Convert the IOB file to spacy’s binary format using “convert”
  • https://spacy.io/api/cli#convert
  • Example: python3 -m -s -n 1000 spacy convert dickens-hard.iob train
  • move result to the right folders: one each into “train” and “dev”

Modify / improve the annotations

  • Purpose: Provide training and evaluation data (train + dev!)
  • Several possible goals:
    • Improve the overall NER accuracy for a specific corpus
    • Improve the NER accuracy for a specific label (e.g. work, org)
    • Add NER categories that are not present (e.g. literary work, artefact, building)
  • This is manual work: go through the IOB files and modify them
    • Either: Improve / correct the automatic annotations
    • Or: Add annotations for new category

Generate a config.cfg

  • Use the config widget: https://spacy.io/usage/training/#config
  • run init script: python3 -m spacy init fill-config data/base_config.cfg data/config.cfg
  • When you have your train and dev data, add paths to the correct places in the config.cfg
    • corpora.dev
    • corpora.train
    • training
  • Set other parameters
    • max_steps = (higher is better but takes longer)
    • eval_frequency = (lower is more informative)

Train a new model with the training data

  • Run the spacy train command with the config.cfg file
  • Example: python3 -m spacy train config.cfg -o model --verbose
  • If you have a GPU: python3 -m spacy train config.cfg -o model --verbose --gpu-id 0
  • Be patient…!

Use the new model for annotating a text

  • You can simply replace the model by the folder to model-best
  • Example: nlp = spacy.load("models/output/model-best")
  • Everything else can be as usual

Questions for the example: new “OBJ” label

  • Does the model use the “OBJ” label?
  • How many times, compared to other labels?
  • Does it apply it only to words from the training data or also to new words, i.e.: Does it generalize?
  • Does it make many obvious mistakes?
  • Are the other labels somehow affected?

Answers

  • Yes!
  • Quite a lot of times: 257 times in one novel, compared to 341 PERSON and 212x CARDINAL
  • Yes, it does generalize! (See examples)
  • Yes, it makes some mistakes
  • Yes, the other labels are affected (very negatively, nor sure why)

Training process

Labels found

Annotated words for OBJ category

Counter({'door': 14, 'bottle': 9, 'gloves': 9, 'key': 7, 'book': 6, 'table': 6, 'window': 6, 'pictures': 4, 'waistcoat': 4, 'candle': 4, 'chimney': 4, 'jar': 3, 'telescope': 3, 'box': 3, 'fan': 3, 'pocket': 3, 'thimble': 3, 'windows': 3, 'cabinet': 3, 'egg': 3, 'brick': 3, 'watch': 2, 'cupboards': 2, 'rope': 2, 'cakes': 2, 'crockery': 2, 'booth': 2, 'pair': 2, 'pockets': 2, 'daisy': 1, 'maps': 1, 'pegs': 1, 'shelves': 1, 'saucer': 1, 'lamps': 1, 'doors': 1, 'locks': 1, 'curtain': 1, 'lock': 1, 'fountains': 1, 'doorway': 1, 'telescopes': 1, 'paper': 1, 'knife': 1, 'legs': 1, 'cake': 1, 'spades': 1, 'bed': 1, 'plate': 1, 'lesson': 1, 'cartwheels': 1, 'chimney?—Nay': 1, 'fireplace': 1, 'cart': 1, 'neckcloth': 1, 'multiplication': 1, 'cannon': 1, 'muzzle': 1, 'carpet': 1, 'tables': 1, 'chairs': 1, 'carpets': 1, 'pianoforte': 1, 'board': 1, 'account': 1, 'clamps': 1, 'girders': 1, 'brushes': 1, 'brooms': 1, 'flag': 1, 'fountain': 1, 'eyeglass': 1, 'balloon': 1, 'speaking': 1, 'pigsty': 1, 'tongs': 1, 'glasses': 1, 'dial': 1, 'steeple': 1, 'kettle': 1, 'story': 1, 'hat': 1, 'ladder': 1, 'cabinets': 1, 'appliances': 1, 'slate': 1, 'handkerchief': 1, 'penknife': 1, 'piston': 1, 'bell': 1, 'birdcage': 1, 'bells': 1, 'cap': 1, 'lights': 1})

Found words for OBJ category

Counter({'door': 22, 'window': 13, 'knife': 7, 'hand': 6, 'lamp': 6, 'pocket': 4, 'candle': 3, 'doorway': 3, 'glass': 3, 'England': 3, 'rope': 3, 'papers': 2, 'book': 2, 'lamps': 2, 'page': 2, 'lantern': 2, 'stairs': 2, 'timber': 2, 'stair': 2, 'case': 2, 'Andamans': 2, 'wooden': 2, 'leg': 2, 'bottle': 1, 'mantelpiece': 1, 'Frenchman': 1, 'post': 1, 'keyhole': 1, 'clothes': 1, 'books': 1, 'drawer': 1, 'Vauxhall': 1, 'Thames': 1, 'kitchen': 1, 'wire': 1, 'facts': 1, 'London': 1, 'cupboards': 1, 'carafe': 1, 'lids': 1, 'hinges': 1, 'Number': 1, 'Bishopgate': 1, 'detective': 1, 'foot': 1, 'handkerchief': 1, 'wall': 1, 'stockings': 1, 'beads': 1, 'oven': 1, 'chambers': 1, 'criminals': 1, 'ring': 1, 'Andaman': 1, 'blinds': 1, 'test': 1, 'West': 1, "boat's": 1, 'piece': 1, 'Islander': 1, 'lid': 1, 'barrier': 1, 'handcuffs': 1, 'bracelets': 1, 'flames': 1, 'quarter': 1, 'troopers': 1, 'pair': 1, 'four': 1, 'Feringhee': 1, 'Englishman': 1, 'plunder': 1, 'East': 1, 'mail': 1, 'scoundrel': 1, 'cocaine': 1})

Session 10: ALBERTI

Overview

  • Our various tests with training a stanza classifier using BERT and/or ALBERTI
  • Optionally: A look at training a sequence labeling / NER model using spaCy (see slides for previous session)
  • Next steps in class

Training the stanza classifier

  • Basic process
    • Transform CSV to DataFrame, then Hugginface Datasets format (train/test)
    • Tokenize the stanzas and get tensor representation
    • Train (= fine-tune) the model as a sequence classifier and evaluate performance
  • More information
    • For details of data, code and output, see folder alberti in the lalamdah repository
    • Training time approximately 15 minutes (on AMD Ryzen 7 with RTX 3060)
    • Some screenshots below

Output from the process of fine-tuning the classifier using BERT

Output from the process of fine-tuning the classifier using ALBERTI

GPU being very busy while training

Performance comparison (accuracy)

BERT ALBERTI
Paper 0.619 0.636
My test (5 epochs) 0.412 0.588

Results from first test. Both models probably need to be trained longer.

Output from the process of fine-tuning the classifier using BERT (2nd run)

Output from the process of fine-tuning the classifier using ALBERTI (2nd run)

Performance comparison (accuracy)

BERT ALBERTI
Paper 0.619 0.636
My test (15 epochs) 0.665 0.659

Session 11 (22. Jan. 2024):
Redewiedergabe Project

Overview

Lead questions to discuss about the paper

  • What are the most relevant points regarding previous research?
  • How did the authors proceed, generally speaking?
  • How can the (extensive) error analysis be summarized?
  • What is the key insight you took away from the paper?

Lead questions to discuss about the repositories

  • Corpus
    • What is the corpus design?
    • How is the annotation structured?
    • Is there anything that might be challenging to us?
  • Tagger
    • What are the requirements?
    • What do we need to do to replicate the paper?
    • What variations on the paper make sense?
    • What do we need to do to annotate our own texts?
    • What aspects could be challenging for us?

Plans for our next meeting

  • https://github.com/redewiedergabe/tagger
  • Apply one RW model to a German narrative text
    • Define virtual environment and install requirements there
    • Use the FLAIR model for direct speech
    • Download a suitable model
    • Execute the script rwtagger.py
    • Find a text to annotate (Gutenberg, Deutsches Textarchiv, TextGrid Digitale Bibliothek)
    • First, use predict mode
    • Correct part of the output manually
    • Use this in test mode
    • Compare our performance to the paper’s results

RW Protocol (Linux)

  1. Start at: https://github.com/redewiedergabe/tagger
  2. Download or clone the repository to a folder called tagger
  3. Download models (e.g. FLAIR and BERT for direct)
  4. Unzip the models to the folder tagger/rwtagger/models
  5. Install virtualenv for Python pip install virtualenv
  6. Install pyenv for easy installation and switching between Python versions: instructions
  7. Set up a virtual environment in the folder tagger using: virtualenv rwvenv
  8. Activate virtual environment: source rwvenv/local/bin/activate
  9. Install all required packages in their correct version
  10. Python 3.7.5 (!!) using pyenv: pyenv install 3.7.5
  11. Switch to Python 3.7.5: pyenv local 3.7.5
  12. pytorch: pip3.7 install torch==1.10.1 # available for Python 3.7.5
  13. pip3.7 install torchvision==0.12.0 # Hopefully right version
  14. pip3.7 install torchaudio==0.11.0 # Hopefully right version
  15. pip3.7 install flair==0.10
  16. pip3.7 install pandas==1.3.5
  17. pip3.7 install nltk==3.6.7
  18. pip3.7 install pytorch_transformers==1.2.0
  19. pip3.7 install openpyxl==3.0.9 # Optional
  20. Go into python3.7 environment with python3.7, there do import nltk and nltk.download("punkt"). Then quit quit()
  21. Place a plain text file in the folder tagger/rwtagger/plain
  22. Create the folder tagged-pred also in tagger/rwtagger
  23. From terminal, in folder tagger/rwtagger, run: python3.7 rwtagger.py plain tagged-pred -t direct -conf
  24. Inspect the resulting TSV file in the folder tagged-pred
  25. Correct any mistakes you may find and rename the column direct_pred to just direct
  26. Save the corrected file to a new folder gold within rwtagger
  27. Create a folder tagged-test
  28. Run: python3.7 rwtagger.py -m test gold tagged-test -t direct -conf
  29. Inspect the results in tagged-test, including the scores in results_stats
  30. Once all of this works, you are ready to experiment with further models and settings.