An up to date publication for TOPCAT is in preparation. In the meantime, if you use TOPCAT, kindly make sure to cite the following in any reports, presentations, or publications:
@misc{Resnik_TOPCAT_Topic-Oriented_Protocol_2024,
author = {Resnik, Philip and Ma, Bolei and Hoyle, Alexander and Goel, Pranav and Sarkar, Rupak and Gearing, Maeve and Bruce, Carol and Haensch, Anna-Carolina and Kreuter, Frauke},
booktitle = {Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)},
editor = {Card, Dallas and Field, Anjalie and Hovy, Dirk and Keith, Katherine},
month = jun,
publisher = {Association for Computational Linguistics},
title = {{TOPCAT: Topic-Oriented Protocol for Content Analysis of Text – A Preliminary Study}},
url = {https://aclanthology.org/2024.nlpcss-1.0/},
note = "Poster",
year = {2024}
}
Follow the directions at Shawn Graham, Scott Weingart, and Ian Milligan, "Getting Started with Topic Modeling and MALLET," Programming Historian 1 (2012), https://doi.org/10.46430/phen0017.
Before installing TOPCAT, you need:
Step 1: Clone the repository
git clone https://github.com/psresnik/topcat.git
cd topcatStep 2: Create conda environment
# Create the topcat environment (includes all dependencies)
conda env create -f code/topcat.ymlStep 3: Install spaCy language model
# Activate the topcat environment and download English language model
conda activate topcat
python -m spacy download en_core_web_smStep 4: Set up configuration
# Copy the template configuration file
cp templates/config_template.ini config.ini
# Edit config.ini and update relevant variables for your system and analysisNote that you can
Step 5: Validate installation
# Activate the topcat environment and test that everything is working
conda activate topcat
python validate_installation.pyIf you chose a name other than config.ini for your local, analysis-specific configuration file, you can call the validation code this way instead:
python validate_installation.py --config <your_config_file>If validation passes, you're ready to use TOPCAT!
TOPCAT uses a Python driver that reads parameters from a configuration file (default is config.ini).
Key parameters you'll typically need to edit:
| Parameter | Description |
|---|---|
topcatdir |
Directory containing this TOPCAT repository |
malletdir |
Directory containing your MALLET installation |
rootdir |
Directory where analysis output files will be created |
csv |
Full path to your CSV file containing documents to analyze |
textcol |
Column number containing your text documents (1-indexed: first column = 1) |
modelname |
Name for your analysis (used in output filenames) |
granularities |
Space separated topic model sizes to try, e.g. 10 20 30 |
Advanced parameters (usually don't need to change):
| Parameter | Description |
|---|---|
stoplist |
Stopwords file (defaults to MALLET's English stoplist) |
numiterations |
MALLET training iterations (default: 1000) |
maxdocs |
Maximum documents per topic in curation materials (default: 100) |
seed |
Random seed for reproducible results (default: 13) |
debug |
Enable debug mode (default: false) |
For the granularities parameter, choose topic model sizes based on your dataset size. See Guidance on Topic Model Granularity below for recommendations.
When debug = true in your configuration file, TOPCAT will automatically overwrite existing model directories from previous runs. This allows for easy re-running during development and testing. However, be aware that:
- Re-running the same analysis will replace all previous results
- Each topic granularity (10, 20, 30, etc.) has separate directories, so they won't interfere with each other
- Consider setting
debug = falseto prevent accidental overwrites
The TOPCAT pipeline performs the following steps:
- Extract and clean documents from your CSV file
- Apply NLP preprocessing with spaCy (tokenization, phrase detection, stopword removal)
- Train topic models using MALLET for each specified granularity
- Generate human curation materials (Excel files, PDF word clouds)
To run TOPCAT:
# Test your configuration first with dry-run mode
python code/driver.py --dry-run --config config.ini
# Run the full analysis
python code/driver.py --config config.ini
# or simply (config.ini is the default):
python code/driver.py
# Safety option: exit if output directories already exist
python code/driver.py --output-safe --config config.iniWhat to expect:
- Processing time: depends on dataset size and number of topics
- Progress indicators: You'll see preprocessing progress and MALLET progress updates
- Output: Files will be created in your configured output directory
Safety options:
--output-safe: Exit if output directories already exist (safer behavior for production runs)- Default behavior: Will overwrite existing directories in debug mode, exit in production mode
In the OUTDIR directory specified in the driver, you will find one subdirectory per granularity in GRANULARITIES. In each directory you will find the following three files to be used during the human curation process.
| Output file | Description |
|---|---|
| GRANULARITY_categories.xlsx | Top-words bar-chart and top documents for each topic |
| GRANULARITY_clouds.pdf | Cloud representation for each topic |
| GRANULARITY_alldocs.xlsx | Document-topic distribution with one document per row (in the text column) |
In the example directory, you'll find a smaller (2K documents) dataset and a larger (10K documents) dataset sampled from public comments that were submitted to the U.S. Food and Drug Administration (FDA) in response to a 2021 request for public comments about emergency use authorization for a child COVID-19 vaccine. Note that some comments can contain upsetting language.
(Some research using these broader public comments was published in Alexander Hoyle, Rupak Sarkar, Pranav Goel, and Philip Resnik. 2023. Natural Language Decompositions of Implicit Content Enable Better Text Representations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13188–13214, Singapore. Association for Computational Linguistics. However, note that neither of these datasets exactly match the data used in that paper.)
By default (as specified in templates/config.ini the configuration will run on the 10K dataset. You can also modify your config to use the 2K dataset instead.
To run the example:
python code/driver.py --config config.ini
# or simply (config.ini is the default):
python code/driver.pyThis will process the example dataset and create topic models with granularities of 10, 20, and 30 topics (as specified in the default configuration).
Expected outputs:
- Processing time: ~5 minutes for the 10K example dataset on a 2021 M1 Mac
- Output location: In your configured output directory (default:
analysis/out/) - Files created: Excel files and PDF word clouds for human curation
Validation: You can compare your results with the reference output in the example/ directory. Results won't be identical due to the randomness in topic modeling, but topic themes should be similar.
Troubleshooting: If you encounter issues, see INSTALL_TROUBLESHOOTING.md for solutions to common problems.
Note: The original comments are publicly available here. Some comments may contain upsetting language or content.
See these instructions for model selection.
There are two steps in model curation.
Independent coding scheme creation. First, two independent analysts familiar with the subject matter (which we often refer to as subject matter experts or SMEs) go through the process for reviewing and labeling categories in these instructions. This can be viewed as having the SMEs independently engage in coding scheme/category creation guided by the bottom-up topic model analysis.
Creating a consensus coding scheme. Second, analysts look at the two independently-created sets of categories, following these instructions in order to arrive at a consensus set of categories. These can be two other SMEs, or it can be the SMEs who worked independently in the previous step. (Note: the consensus instructions have not yet been updated to be consistent with the most recent versions of file names, etc.)
The end result of this curation process is a set of categories and descriptions that have been guided via an automatic, scalable process that is bottom-up and thus minimizes human bias, while still retaining human quality control.
It is often useful to select a set of good examples for codes in a coding scheme. This is straightforward using the files already created by the TOPCAT process. In the materials used for human curation, each automatically created topic was accompanied by a set of its "top" documents. These can be considered a set of ranked candidates for verbatims for the code created using that topic.
Topic models require you to specify in advance the number of categories you would like to automatically create, which we will refer to as the granularity of the model; in the literature this value is conventionally referred to as K.
The best granularity varies from analysis to analysis, and at present there are no fully reliable methods to optimize that number for any given collection of text (although we're working on that). For now, the TOPCAT approach involves running multiple models at different granularities and an efficient human-centered process for selecting which one is the best starting point for more detailed curation.
We generally recommend creating three (or at most up to five) models with different granularities, and these are heuristics we generally follow (anecdotally consistent withg what we have heard from a number of other frequent topic model practitioners).
-
For a document collection with fewer than 500 documents, we would typically try K=5,10,15, though note that LDA may or may not produce anything of use at all for collections that small.
-
For 500-to-1000 documents (K=10,15,20 or 10,20,30)
-
For 1000-to-10000 (K=15,20,40 or 20,30,50)
-
For 10000- to-200000 (K=75,100,150)
These recommendations are anecdotally consistent with what we have heard from a number of other frequent topic model practitioners. Crucially, the human curation process reduces the burden to view any particular model size as optimal; in general we tend to err mildly on the side of more rather than fewer top- ics since our process permits less-good topics to be discarded and fine-grained topics can be merged under a single label and description.