Topic-Oriented Protocol for Content Analysis of Text (TOPCAT)

Citation

An up to date publication for TOPCAT is in preparation. In the meantime, if you use TOPCAT, kindly make sure to cite the following in any reports, presentations, or publications:

@misc{Resnik_TOPCAT_Topic-Oriented_Protocol_2024,
  author = {Resnik, Philip and Ma, Bolei and Hoyle, Alexander and Goel, Pranav and Sarkar, Rupak and Gearing, Maeve and Bruce, Carol and Haensch, Anna-Carolina and Kreuter, Frauke},
  booktitle = {Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)},
  editor = {Card, Dallas and Field, Anjalie and Hovy, Dirk and Keith, Katherine},
  month = jun,
  publisher = {Association for Computational Linguistics},
  title = {{TOPCAT: Topic-Oriented Protocol for Content Analysis of Text – A Preliminary Study}},
  url = {https://aclanthology.org/2024.nlpcss-1.0/},
  note = "Poster",
  year = {2024}
}

The software

Installing MALLET

Follow the directions at Shawn Graham, Scott Weingart, and Ian Milligan, "Getting Started with Topic Modeling and MALLET," Programming Historian 1 (2012), https://doi.org/10.46430/phen0017.

Installing the TOPCAT code

Prerequisites

Before installing TOPCAT, you need:

conda or miniconda: If you don't have conda installed, install miniconda or full anaconda

Installation Steps

Step 1: Clone the repository

git clone https://github.com/psresnik/topcat.git
cd topcat

Step 2: Create conda environment

# Create the topcat environment (includes all dependencies)
conda env create -f code/topcat.yml

Step 3: Install spaCy language model

# Activate the topcat environment and download English language model
conda activate topcat
python -m spacy download en_core_web_sm

Step 4: Set up configuration

# Copy the template configuration file
cp templates/config_template.ini config.ini

# Edit config.ini and update relevant variables for your system and analysis

Note that you can

Step 5: Validate installation

# Activate the topcat environment and test that everything is working
conda activate topcat
python validate_installation.py

If you chose a name other than config.ini for your local, analysis-specific configuration file, you can call the validation code this way instead:

python validate_installation.py --config <your_config_file>

If validation passes, you're ready to use TOPCAT!

Configuration

TOPCAT uses a Python driver that reads parameters from a configuration file (default is config.ini).

Key parameters you'll typically need to edit:

Parameter	Description
`topcatdir`	Directory containing this TOPCAT repository
`malletdir`	Directory containing your MALLET installation
`rootdir`	Directory where analysis output files will be created
`csv`	Full path to your CSV file containing documents to analyze
`textcol`	Column number containing your text documents (1-indexed: first column = 1)
`modelname`	Name for your analysis (used in output filenames)
`granularities`	Space separated topic model sizes to try, e.g. `10 20 30`

Advanced parameters (usually don't need to change):

Parameter	Description
`stoplist`	Stopwords file (defaults to MALLET's English stoplist)
`numiterations`	MALLET training iterations (default: 1000)
`maxdocs`	Maximum documents per topic in curation materials (default: 100)
`seed`	Random seed for reproducible results (default: 13)
`debug`	Enable debug mode (default: false)

For the granularities parameter, choose topic model sizes based on your dataset size. See Guidance on Topic Model Granularity below for recommendations.

⚠️ Important Note about Re-running Analyses:

When debug = true in your configuration file, TOPCAT will automatically overwrite existing model directories from previous runs. This allows for easy re-running during development and testing. However, be aware that:

Re-running the same analysis will replace all previous results
Each topic granularity (10, 20, 30, etc.) has separate directories, so they won't interfere with each other
Consider setting debug = false to prevent accidental overwrites

Running the driver

The TOPCAT pipeline performs the following steps:

Extract and clean documents from your CSV file
Apply NLP preprocessing with spaCy (tokenization, phrase detection, stopword removal)
Train topic models using MALLET for each specified granularity
Generate human curation materials (Excel files, PDF word clouds)

To run TOPCAT:

# Test your configuration first with dry-run mode
python code/driver.py --dry-run --config config.ini

# Run the full analysis
python code/driver.py --config config.ini
# or simply (config.ini is the default):
python code/driver.py

# Safety option: exit if output directories already exist
python code/driver.py --output-safe --config config.ini

What to expect:

Processing time: depends on dataset size and number of topics
Progress indicators: You'll see preprocessing progress and MALLET progress updates
Output: Files will be created in your configured output directory

Safety options:

--output-safe: Exit if output directories already exist (safer behavior for production runs)
Default behavior: Will overwrite existing directories in debug mode, exit in production mode

What the automatic processing produces

In the OUTDIR directory specified in the driver, you will find one subdirectory per granularity in GRANULARITIES. In each directory you will find the following three files to be used during the human curation process.

Output file	Description
GRANULARITY_categories.xlsx	Top-words bar-chart and top documents for each topic
GRANULARITY_clouds.pdf	Cloud representation for each topic
GRANULARITY_alldocs.xlsx	Document-topic distribution with one document per row (in the `text` column)

Example run

In the example directory, you'll find a smaller (2K documents) dataset and a larger (10K documents) dataset sampled from public comments that were submitted to the U.S. Food and Drug Administration (FDA) in response to a 2021 request for public comments about emergency use authorization for a child COVID-19 vaccine. Note that some comments can contain upsetting language.

(Some research using these broader public comments was published in Alexander Hoyle, Rupak Sarkar, Pranav Goel, and Philip Resnik. 2023. Natural Language Decompositions of Implicit Content Enable Better Text Representations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13188–13214, Singapore. Association for Computational Linguistics. However, note that neither of these datasets exactly match the data used in that paper.)

By default (as specified in templates/config.ini the configuration will run on the 10K dataset. You can also modify your config to use the 2K dataset instead.

To run the example:

python code/driver.py --config config.ini
# or simply (config.ini is the default):
python code/driver.py

This will process the example dataset and create topic models with granularities of 10, 20, and 30 topics (as specified in the default configuration).

Expected outputs:

Processing time: ~5 minutes for the 10K example dataset on a 2021 M1 Mac
Output location: In your configured output directory (default: analysis/out/)
Files created: Excel files and PDF word clouds for human curation

Validation: You can compare your results with the reference output in the example/ directory. Results won't be identical due to the randomness in topic modeling, but topic themes should be similar.

Troubleshooting: If you encounter issues, see INSTALL_TROUBLESHOOTING.md for solutions to common problems.

Note: The original comments are publicly available here. Some comments may contain upsetting language or content.

The human process

Selecting a model as the starting point for human curation

See these instructions for model selection.

Curating the model to build a coding scheme

There are two steps in model curation.

Independent coding scheme creation. First, two independent analysts familiar with the subject matter (which we often refer to as subject matter experts or SMEs) go through the process for reviewing and labeling categories in these instructions. This can be viewed as having the SMEs independently engage in coding scheme/category creation guided by the bottom-up topic model analysis.

Creating a consensus coding scheme. Second, analysts look at the two independently-created sets of categories, following these instructions in order to arrive at a consensus set of categories. These can be two other SMEs, or it can be the SMEs who worked independently in the previous step. (Note: the consensus instructions have not yet been updated to be consistent with the most recent versions of file names, etc.)

The end result of this curation process is a set of categories and descriptions that have been guided via an automatic, scalable process that is bottom-up and thus minimizes human bias, while still retaining human quality control.

Obtaining representative documents ("verbatims") for a code

It is often useful to select a set of good examples for codes in a coding scheme. This is straightforward using the files already created by the TOPCAT process. In the materials used for human curation, each automatically created topic was accompanied by a set of its "top" documents. These can be considered a set of ranked candidates for verbatims for the code created using that topic.

Guidance on topic model granularity

Topic models require you to specify in advance the number of categories you would like to automatically create, which we will refer to as the granularity of the model; in the literature this value is conventionally referred to as K.

The best granularity varies from analysis to analysis, and at present there are no fully reliable methods to optimize that number for any given collection of text (although we're working on that). For now, the TOPCAT approach involves running multiple models at different granularities and an efficient human-centered process for selecting which one is the best starting point for more detailed curation.

We generally recommend creating three (or at most up to five) models with different granularities, and these are heuristics we generally follow (anecdotally consistent withg what we have heard from a number of other frequent topic model practitioners).

For a document collection with fewer than 500 documents, we would typically try K=5,10,15, though note that LDA may or may not produce anything of use at all for collections that small.
For 500-to-1000 documents (K=10,15,20 or 10,20,30)
For 1000-to-10000 (K=15,20,40 or 20,30,50)
For 10000- to-200000 (K=75,100,150)

These recommendations are anecdotally consistent with what we have heard from a number of other frequent topic model practitioners. Crucially, the human curation process reduces the burden to view any particular model size as optimal; in general we tend to err mildly on the side of more rather than fewer top- ics since our process permits less-good topics to be discarded and fine-grained topics can be merged under a single label and description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Topic-Oriented Protocol for Content Analysis of Text (TOPCAT)

Table of Contents

Citation

The software

Installing MALLET

Installing the TOPCAT code

Prerequisites

Installation Steps

Configuration

Running the driver

What the automatic processing produces

Example run

The human process

Selecting a model as the starting point for human curation

Curating the model to build a coding scheme

Obtaining representative documents ("verbatims") for a code

Guidance on topic model granularity

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
code		code
example		example
instructions		instructions
templates		templates
.gitignore		.gitignore
CITATION.cff		CITATION.cff
INSTALL_TROUBLESHOOTING.md		INSTALL_TROUBLESHOOTING.md
README.md		README.md
validate_installation.py		validate_installation.py

psresnik/topcat

Folders and files

Latest commit

History

Repository files navigation

Topic-Oriented Protocol for Content Analysis of Text (TOPCAT)

Table of Contents

Citation

The software

Installing MALLET

Installing the TOPCAT code

Prerequisites

Installation Steps

Configuration

Running the driver

What the automatic processing produces

Example run

The human process

Selecting a model as the starting point for human curation

Curating the model to build a coding scheme

Obtaining representative documents ("verbatims") for a code

Guidance on topic model granularity

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages