GitHub - ptypes-nlesc/stereotype-map: Mapping videos and predefined stereotypes using word embeddings

💡 Motivation

This project investigates how gendered and racialized stereotypes are produced, reinforced, or contested through the metadata of online pornographic content. By analyzing a large-scale dataset of video titles and tags from PornHub, the goal is to examine how classificatory systems—such as tags and titles participate in the construction and circulation of stereotypes. In this way we can better understand how stereotypes are not only embedded in video content but also reinforced through digital infrastructures and linguistic classifications.

🧠 Methodology Overview

This project analyzes metadata from over 250,000 PornHub videos published between 2008 and 2024, focusing on titles and tags as classificatory tools that shape content visibility and access. The dataset also includes additional metadata such as:

Publication date
View counts
Upvotes
Production information
Actor details
Categories

🔧 Preprocessing

Standard text preprocessing steps were applied to ensure data quality and consistency:

Tokenization of titles and tags
Lowercasing of text
Removal of stopwords, special characters, and non-informative tokens
Deduplication of near-identical entries

These steps ensured that the analysis focused on linguistically meaningful content.

🧩 NLP Techniques

We adapted short-text NLP techniques to analyze the structural and semantic properties of the metadata:

Part-of-Speech (POS) Tagging
Identifies grammatical roles and recurring syntactic patterns.
Dependency Parsing
Examines how gendered and racialized terms are positioned within phrases, revealing structural relationships.
Bigram and Trigram Modeling
Captures frequently occurring word sequences and stereotypical expressions.
Co-occurrence Analysis
Tracks patterns between gendered and racialized terms across time and content categories.

Requirements

Python >=3.9 and <3.13 Python environment can be isolated using venv.

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Data setup

We expected data to be in a single csv file with each line containing a single video and columns containing meta data such as categories, upvotes, downvotes, and views.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github/workflows		.github/workflows
data/processed		data/processed
docs		docs
notebooks		notebooks
plots		plots
src		src
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.dev.md		README.dev.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💡 Motivation

🧠 Methodology Overview

🔧 Preprocessing

🧩 NLP Techniques

Requirements

Data setup

Examples

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ptypes-nlesc/stereotype-map

Folders and files

Latest commit

History

Repository files navigation

💡 Motivation

🧠 Methodology Overview

🔧 Preprocessing

🧩 NLP Techniques

Requirements

Data setup

Examples

Documentation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages