Skip to content

psresnik/topcat

Repository files navigation

Topic-Oriented Protocol for Content Analysis of Text (TOPCAT)

Table of Contents

Citation

An up to date publication for TOPCAT is in preparation. In the meantime, if you use TOPCAT, kindly make sure to cite the following in any reports, presentations, or publications:

@misc{Resnik_TOPCAT_Topic-Oriented_Protocol_2024,
  author = {Resnik, Philip and Ma, Bolei and Hoyle, Alexander and Goel, Pranav and Sarkar, Rupak and Gearing, Maeve and Bruce, Carol and Haensch, Anna-Carolina and Kreuter, Frauke},
  booktitle = {Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)},
  editor = {Card, Dallas and Field, Anjalie and Hovy, Dirk and Keith, Katherine},
  month = jun,
  publisher = {Association for Computational Linguistics},
  title = {{TOPCAT: Topic-Oriented Protocol for Content Analysis of Text – A Preliminary Study}},
  url = {https://aclanthology.org/2024.nlpcss-1.0/},
  note = "Poster",
  year = {2024}
}

The software

Installing MALLET

Follow the directions at Shawn Graham, Scott Weingart, and Ian Milligan, "Getting Started with Topic Modeling and MALLET," Programming Historian 1 (2012), https://doi.org/10.46430/phen0017.

Installing the TOPCAT code

Prerequisites

Before installing TOPCAT, you need:

  • conda or miniconda: If you don't have conda installed, install miniconda or full anaconda

Installation Steps

Step 1: Clone the repository

git clone https://github.com/psresnik/topcat.git
cd topcat

Step 2: Create conda environment

# Create the topcat environment (includes all dependencies)
conda env create -f code/topcat.yml

Step 3: Install spaCy language model

# Activate the topcat environment and download English language model
conda activate topcat
python -m spacy download en_core_web_sm

Step 4: Set up configuration

# Copy the template configuration file
cp templates/config_template.ini config.ini

# Edit config.ini and update relevant variables for your system and analysis

Note that you can

Step 5: Validate installation

# Activate the topcat environment and test that everything is working
conda activate topcat
python validate_installation.py

If you chose a name other than config.ini for your local, analysis-specific configuration file, you can call the validation code this way instead:

python validate_installation.py --config <your_config_file>

If validation passes, you're ready to use TOPCAT!

Configuration

TOPCAT uses a Python driver that reads parameters from a configuration file (default is config.ini).

Key parameters you'll typically need to edit:

Parameter Description
topcatdir Directory containing this TOPCAT repository
malletdir Directory containing your MALLET installation
rootdir Directory where analysis output files will be created
csv Full path to your CSV file containing documents to analyze
textcol Column number containing your text documents (1-indexed: first column = 1)
modelname Name for your analysis (used in output filenames)
granularities Space separated topic model sizes to try, e.g. 10 20 30

Advanced parameters (usually don't need to change):

Parameter Description
stoplist Stopwords file (defaults to MALLET's English stoplist)
numiterations MALLET training iterations (default: 1000)
maxdocs Maximum documents per topic in curation materials (default: 100)
seed Random seed for reproducible results (default: 13)
debug Enable debug mode (default: false)

For the granularities parameter, choose topic model sizes based on your dataset size. See Guidance on Topic Model Granularity below for recommendations.

⚠️ Important Note about Re-running Analyses:

When debug = true in your configuration file, TOPCAT will automatically overwrite existing model directories from previous runs. This allows for easy re-running during development and testing. However, be aware that:

  • Re-running the same analysis will replace all previous results
  • Each topic granularity (10, 20, 30, etc.) has separate directories, so they won't interfere with each other
  • Consider setting debug = false to prevent accidental overwrites

Running the driver

The TOPCAT pipeline performs the following steps:

  • Extract and clean documents from your CSV file
  • Apply NLP preprocessing with spaCy (tokenization, phrase detection, stopword removal)
  • Train topic models using MALLET for each specified granularity
  • Generate human curation materials (Excel files, PDF word clouds)

To run TOPCAT:

# Test your configuration first with dry-run mode
python code/driver.py --dry-run --config config.ini

# Run the full analysis
python code/driver.py --config config.ini
# or simply (config.ini is the default):
python code/driver.py

# Safety option: exit if output directories already exist
python code/driver.py --output-safe --config config.ini

What to expect:

  • Processing time: depends on dataset size and number of topics
  • Progress indicators: You'll see preprocessing progress and MALLET progress updates
  • Output: Files will be created in your configured output directory

Safety options:

  • --output-safe: Exit if output directories already exist (safer behavior for production runs)
  • Default behavior: Will overwrite existing directories in debug mode, exit in production mode

What the automatic processing produces

In the OUTDIR directory specified in the driver, you will find one subdirectory per granularity in GRANULARITIES. In each directory you will find the following three files to be used during the human curation process.

Output file Description
GRANULARITY_categories.xlsx Top-words bar-chart and top documents for each topic
GRANULARITY_clouds.pdf Cloud representation for each topic
GRANULARITY_alldocs.xlsx Document-topic distribution with one document per row (in the text column)

Example run

In the example directory, you'll find a smaller (2K documents) dataset and a larger (10K documents) dataset sampled from public comments that were submitted to the U.S. Food and Drug Administration (FDA) in response to a 2021 request for public comments about emergency use authorization for a child COVID-19 vaccine. Note that some comments can contain upsetting language.

(Some research using these broader public comments was published in Alexander Hoyle, Rupak Sarkar, Pranav Goel, and Philip Resnik. 2023. Natural Language Decompositions of Implicit Content Enable Better Text Representations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13188–13214, Singapore. Association for Computational Linguistics. However, note that neither of these datasets exactly match the data used in that paper.)

By default (as specified in templates/config.ini the configuration will run on the 10K dataset. You can also modify your config to use the 2K dataset instead.

To run the example:

python code/driver.py --config config.ini
# or simply (config.ini is the default):
python code/driver.py

This will process the example dataset and create topic models with granularities of 10, 20, and 30 topics (as specified in the default configuration).

Expected outputs:

  • Processing time: ~5 minutes for the 10K example dataset on a 2021 M1 Mac
  • Output location: In your configured output directory (default: analysis/out/)
  • Files created: Excel files and PDF word clouds for human curation

Validation: You can compare your results with the reference output in the example/ directory. Results won't be identical due to the randomness in topic modeling, but topic themes should be similar.

Troubleshooting: If you encounter issues, see INSTALL_TROUBLESHOOTING.md for solutions to common problems.

Note: The original comments are publicly available here. Some comments may contain upsetting language or content.

The human process

Selecting a model as the starting point for human curation

See these instructions for model selection.

Curating the model to build a coding scheme

There are two steps in model curation.

Independent coding scheme creation. First, two independent analysts familiar with the subject matter (which we often refer to as subject matter experts or SMEs) go through the process for reviewing and labeling categories in these instructions. This can be viewed as having the SMEs independently engage in coding scheme/category creation guided by the bottom-up topic model analysis.

Creating a consensus coding scheme. Second, analysts look at the two independently-created sets of categories, following these instructions in order to arrive at a consensus set of categories. These can be two other SMEs, or it can be the SMEs who worked independently in the previous step. (Note: the consensus instructions have not yet been updated to be consistent with the most recent versions of file names, etc.)

The end result of this curation process is a set of categories and descriptions that have been guided via an automatic, scalable process that is bottom-up and thus minimizes human bias, while still retaining human quality control.

Obtaining representative documents ("verbatims") for a code

It is often useful to select a set of good examples for codes in a coding scheme. This is straightforward using the files already created by the TOPCAT process. In the materials used for human curation, each automatically created topic was accompanied by a set of its "top" documents. These can be considered a set of ranked candidates for verbatims for the code created using that topic.

Guidance on topic model granularity

Topic models require you to specify in advance the number of categories you would like to automatically create, which we will refer to as the granularity of the model; in the literature this value is conventionally referred to as K.

The best granularity varies from analysis to analysis, and at present there are no fully reliable methods to optimize that number for any given collection of text (although we're working on that). For now, the TOPCAT approach involves running multiple models at different granularities and an efficient human-centered process for selecting which one is the best starting point for more detailed curation.

We generally recommend creating three (or at most up to five) models with different granularities, and these are heuristics we generally follow (anecdotally consistent withg what we have heard from a number of other frequent topic model practitioners).

  • For a document collection with fewer than 500 documents, we would typically try K=5,10,15, though note that LDA may or may not produce anything of use at all for collections that small.

  • For 500-to-1000 documents (K=10,15,20 or 10,20,30)

  • For 1000-to-10000 (K=15,20,40 or 20,30,50)

  • For 10000- to-200000 (K=75,100,150)

These recommendations are anecdotally consistent with what we have heard from a number of other frequent topic model practitioners. Crucially, the human curation process reduces the burden to view any particular model size as optimal; in general we tend to err mildly on the side of more rather than fewer top- ics since our process permits less-good topics to be discarded and fine-grained topics can be merged under a single label and description.

About

Topic-Oriented Protocol for Content Analysis of Text (TOPCAT)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages