Urban Dictionary Archive – A continuously scraped dataset of slang definitions from Urban Dictionary, automatically updated every 5-15 minutes via GitHub Actions.
This repository maintains two complementary data storage formats:
Daily chronological archives containing all entries fetched on a specific date. Each file contains an array of definition objects.
Alphabetically organized files where entries are grouped by the first letter of the word. Each file is structured as a JSON object with words as keys and arrays of definitions as values.
Each definition entry contains the following fields:
{
"defid": 12345678,
"word": "rizz",
"definition": "Slang for charisma; ability to attract.",
"example": "He's got mad rizz.",
"written_on": "2025-09-09T21:31:00.000Z"
}defid: Unique identifier for the definition (used for deduplication)word: The slang term being defineddefinition: The definition of the wordexample: Usage example of the wordwritten_on: Timestamp when the definition was originally submitted
To find all entries from a specific date:
# View entries from January 15, 2024
cat data/2024-01-15.jsonTo find all words starting with a specific letter:
# View all words starting with 'R'
cat dictionary/R.json# Look in the R.json file
jq '.rizz' dictionary/R.json- Source: Urban Dictionary Random API (
https://api.urbandictionary.com/v0/random) - Frequency: Every 5-15 minutes via GitHub Actions
- Deduplication: Entries are deduplicated by
defidto ensure data cleanliness - Error Handling: API failures are handled gracefully with retry logic
- Storage: Dual storage system for both chronological and alphabetical access
- Clone the repository:
git clone https://github.com/yourusername/UrbanArchive.git
cd UrbanArchive- Install dependencies:
pip install -r requirements.txt- Run the fetcher:
python fetch_ud.pyStatistics are automatically updated with each data collection run
- Data is collected every 5-15 minutes via GitHub Actions
- Each run fetches 50 batches with ~10 entries per batch
- Automatic deduplication prevents duplicate entries
- All activity is logged to
logs/directory (not committed to repo)
- Deduplication: No duplicate entries based on
defid - Continuous Growth: New entries added every 5-15 minutes
- Dual Access: Both chronological (daily) and alphabetical organization
- JSON Validation: All data is validated before storage
This is an automated data collection project. The main ways to contribute are:
- Improving the fetching script (
fetch_ud.py) - Enhancing data processing or storage formats
- Adding data analysis tools or utilities
- Reporting issues with the automation
This project is open source. The collected data comes from Urban Dictionary's public API.
This archive contains user-generated content from Urban Dictionary. The definitions and examples may contain explicit language, offensive terms, or inappropriate content. This repository is for research and archival purposes only.