Topic Modelling

Original song: Duck Tales Theme (Disney)

YouTube link: https://www.youtube.com/watch?v=nqZ_Cb2slBw

Download: TEI

Need some help with distant reading? 1
Topic modelling!
“Race cars, lasers, aeroplanes”
That’s a topic 2

It works for fishery 3
And legal history 4
Topic modelling!
Woo-hoo!
Texts are combinations of N topics 5
Woo-hoo!
Find out the proportions with statistics

D-D-Dirichlet allocation 6
Is useful for this situation
How to use it?
Read the ProgHist lesson 7
Woo-hoo!

MALLET gives you keywords in a table
Woo-hoo!
Which you must interpret and hand-label 8
Woo-hoo!

Then run again, with different N 9
To cross-check
Woo-hoo!

Notes

  1. “Distant reading ... is set in opposition to the form of literary analysis known as close reading. Instead of focusing on textual minutiae, distant reading focuses on the generalities of a text or texts, often via computational means.”
    Scott B. Weingart, Susan Grunewald, Matthew Lincoln et al. (eds.). The Digital Humanities Literacy Guidebook. Carnegie Mellon University, updated April 03, 2022. https://cmu-lib.github.io/dhlg/topics/#topic_distantreading Back to text
  2. Topic modelling calculates the probability that each word in a text belongs to a topic. A common form of output is the list of top keywords for each topic. Back to text
  3. Osmar J Luiz, et al, “Trait-based ecology of fishes: A quantitative assessment of literature trends and knowledge gaps using topic modelling”, Fish and Fisheries (2019) 20, 1100–1110 https://onlinelibrary.wiley.com/doi/abs/10.1111/faf.12399 Back to text
  4. P. Grajzl and P. Murrell, “Using Topic-Modeling in Legal History, with an Application to Pre-Industrial English Case Law on Finance”. Law and History Review (2022), 40(2), 189-228. https://doi.org/10.1017/S0738248022000153 Back to text
  5. Topic modelling assumes that the each text in a corpus is made up of a mixture of N possible topics. The number N must be chosen by the modeller. Different values for N will result in different topics. Back to text
  6. Latent Dirichlet Allocation (LDA) is a popular algorithm for topic modelling. It was introduced in David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation”. Journal of Machine Learning Research (2003) 3, 993–1022. https://dl.acm.org/doi/10.5555/944919.944937 Back to text
  7. The Programming Historian (@ProgHist on Twitter) has a tutorial on how to use MALLET, a popular topic modelling tool. It is not necessary to fully understand the LDA algorithm to use MALLET. Shawn Graham, Scott Weingart, and Ian Milligan, "Getting Started with Topic Modeling and MALLET," Programming Historian 1 (2012), https://doi.org/10.46430/phen0017. Back to text
  8. Each topic in a topic model need manual evaluation to determine whether it is a coherent topic (not a statistical artifact), and if so what it represents.Back to text
  9. The optimal number of topics will depend on the size of the corpus, and the research question. While there are statistical methods for choosing a number, usually it is simpler to run a topic model with different numbers of topics, and compare the output. See e.g. Maria Antoniak, Topic Modelling for the People (2022) https://maria-antoniak.github.io/2022/07/27/topic-modeling-for-the-people.html#test-different-numbers-of-topics Back to text