(Related posts: An intro to topic models for text analysis, Making sense of topic models, Overcoming the limitations of topic models with a semi-supervised approach, Interpreting and validating topic models and How keyword oversampling can help with text analysis)
In my final post in this series, I’ll discuss the surprising lessons that we learned from a 2018 Pew Research Center study in which we attempted to use topic models to measure themes in a set of open-ended survey responses about the sources of meaning in Americans’ lives.
But first, a quick recap of the steps we took to carry out that study. To begin, we trained semi-supervised topic models on our data and iteratively developed lists of anchor words for the topics to improve the models until the topics looked coherent. We then interpreted the topics by giving them labels and classified samples of responses ourselves using those labels, ultimately arriving at a set of 24 topics that our researchers could label reliably. Our final step was to compare the topic models’ decisions against our own. If the topic models agreed well enough with our own researchers, we hoped to use them to automatically classify our entire dataset of survey responses instead of having to label more responses ourselves.
The results were somewhat disappointing. The topic models generally performed well, but they failed to achieve acceptable agreement (Cohen’s Kappa of at least 0.7) with our own researchers for six of our 24 final topics. Despite our best efforts to make them as coherent and interpretable as possible using a semi-supervised approach with custom anchor word lists, the topic models’ output simply failed to align with our own interpretation of these six topics.