This project investigates how gendered and racialized stereotypes are produced, reinforced, or contested through the metadata of online pornographic content. By analyzing a large-scale dataset of video titles and tags from PornHub, the goal is to examine how classificatory systems—such as tags and titles participate in the construction and circulation of stereotypes. In this way we can better understand how stereotypes are not only embedded in video content but also reinforced through digital infrastructures and linguistic classifications.
This project analyzes metadata from over 250,000 PornHub videos published between 2008 and 2024, focusing on titles and tags as classificatory tools that shape content visibility and access. The dataset also includes additional metadata such as:
- Publication date
- View counts
- Upvotes
- Production information
- Actor details
- Categories
Standard text preprocessing steps were applied to ensure data quality and consistency:
- Tokenization of titles and tags
- Lowercasing of text
- Removal of stopwords, special characters, and non-informative tokens
- Deduplication of near-identical entries
These steps ensured that the analysis focused on linguistically meaningful content.
We adapted short-text NLP techniques to analyze the structural and semantic properties of the metadata:
-
Part-of-Speech (POS) Tagging
Identifies grammatical roles and recurring syntactic patterns. -
Dependency Parsing
Examines how gendered and racialized terms are positioned within phrases, revealing structural relationships. -
Bigram and Trigram Modeling
Captures frequently occurring word sequences and stereotypical expressions. -
Co-occurrence Analysis
Tracks patterns between gendered and racialized terms across time and content categories.
Python >=3.9 and <3.13 Python environment can be isolated using venv.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtWe expected data to be in a single csv file with each line containing a single video and columns containing meta data such as categories, upvotes, downvotes, and views.
