Skip to content

ptypes-nlesc/stereotype-map

Repository files navigation

markdown-link-check python-package Quality Gate Status RSD github repo badge github license badge fair-software badge

💡 Motivation

This project investigates how gendered and racialized stereotypes are produced, reinforced, or contested through the metadata of online pornographic content. By analyzing a large-scale dataset of video titles and tags from PornHub, the goal is to examine how classificatory systems—such as tags and titles participate in the construction and circulation of stereotypes. In this way we can better understand how stereotypes are not only embedded in video content but also reinforced through digital infrastructures and linguistic classifications.

🧠 Methodology Overview

This project analyzes metadata from over 250,000 PornHub videos published between 2008 and 2024, focusing on titles and tags as classificatory tools that shape content visibility and access. The dataset also includes additional metadata such as:

  • Publication date
  • View counts
  • Upvotes
  • Production information
  • Actor details
  • Categories

🔧 Preprocessing

Standard text preprocessing steps were applied to ensure data quality and consistency:

  • Tokenization of titles and tags
  • Lowercasing of text
  • Removal of stopwords, special characters, and non-informative tokens
  • Deduplication of near-identical entries

These steps ensured that the analysis focused on linguistically meaningful content.


🧩 NLP Techniques

We adapted short-text NLP techniques to analyze the structural and semantic properties of the metadata:

  • Part-of-Speech (POS) Tagging
    Identifies grammatical roles and recurring syntactic patterns.

  • Dependency Parsing
    Examines how gendered and racialized terms are positioned within phrases, revealing structural relationships.

  • Bigram and Trigram Modeling
    Captures frequently occurring word sequences and stereotypical expressions.

  • Co-occurrence Analysis
    Tracks patterns between gendered and racialized terms across time and content categories.

Requirements

Python >=3.9 and <3.13 Python environment can be isolated using venv.

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Data setup

We expected data to be in a single csv file with each line containing a single video and columns containing meta data such as categories, upvotes, downvotes, and views.

Examples

alt text

Documentation

Releases

No releases published

Packages

No packages published