Divergent Discourses Corpus Analysis Tools

Comprehensive analysis tools for the Divergent Discourses Tibetan Newspaper Corpus (1950-1965).

About the Project

The Divergent Discourses project studies the role of narrative and discourse in the perpetuation of antagonisms in the Tibet-China dispute during the formative period of the 1950s and 1960s. These tools analyze a corpus of 16,718 pages from 16 Tibetan-language newspapers published between 1950 and 1965.

Features

✨ Complete Corpus Analysis

Automatic extraction of metadata from standardized filenames
Comprehensive statistics on newspapers, issues, and pages
Year-by-year coverage analysis

📊 Excel-Ready Exports

Single consolidated CSV optimized for pivot tables
Multi-sheet Excel workbook with pre-formatted data
23 comprehensive columns for flexible analysis

🏛️ Library Holdings Tracking

Detailed provenance information
Holdings by library, newspaper, and year
Multi-source issue identification

📉 Missing Issues Analysis

Automatic publication frequency detection
Gap identification and quantification
Completeness percentage estimates

📈 Multiple Output Formats

JSON for programmatic access
CSV for spreadsheet analysis
Text reports for documentation
Excel workbooks for visualization

Quick Start

Installation

pip install git+https://github.com/Divergent-Discourses/DDC_CorpusStatistics.git

Or install from source:

git clone https://github.com/Divergent-Discourses/DDC_CorpusStatistics.git
cd DDC_CorpusStatistics
pip install -e .

Basic Usage

# Complete analysis with all statistics
dd-analyze /path/to/corpus

# Export for Excel pivot tables (recommended!)
dd-excel-export /path/to/corpus

# Generate detailed reports
dd-reports /path/to/corpus

Python API

from dd_corpus_tools import NewspaperCorpusAnalyzer

analyzer = NewspaperCorpusAnalyzer('/path/to/corpus')
analyzer.scan_corpus()
analyzer.print_summary_statistics()
analyzer.export_to_json('statistics.json')

File Naming Convention

The tools expect files to follow this pattern:

XXX_YYYY_MM_DD_ppp_LL_abcd.ext

Where:

XXX: 3-letter newspaper code (e.g., TID, QTN, TIM)
YYYY_MM_DD: Publication date
ppp: Page number (3 digits with leading zeros)
LL: 2-letter library code
abcd: Optional shelfmark
.ext: File extension (.jpg, .png, .tif, .pdf)

Example: TID_1964_01_09_001_SB_Zsn128162MR.jpg

Documentation

User Guide - Comprehensive usage instructions
Excel Pivot Tables Guide - Detailed Excel analysis tutorial
API Reference - Python API documentation
Quick Reference - Quick command reference
Installation Guide - Detailed installation instructions

Newspaper Codes

Code	Newspaper
CWN	Central Weekly News
DTF	Defend Tibet's Freedom
FRD	Freedom
GDN	Ganze Daily
GTN	Gyantse News
KDN	Kangding News
MJN	Minjiang News
NIB	News in Brief
QTN	Qinghai Tibetan News
SGN	South Gansu News
TDP	Tibet Daily Pictorial
TID	Tibet Daily
TIF	Tibetan Freedom
TIM	Tibet Mirror
XNX	South-West Institute for Nationalities
ZYX	Central Institute for Nationalities

Command-Line Tools

dd-analyze

Complete corpus analysis with detailed statistics.

dd-analyze /path/to/corpus

Generates:

Console output with complete statistics
corpus_statistics.json - All statistics in JSON format

Features:

Complete newspaper list with sources
Library holdings by newspaper, year, and issue
Missing issues estimates
Year-by-year coverage
Statistical summaries

dd-excel-export

Excel-optimized export for pivot table analysis.

dd-excel-export /path/to/corpus

Generates:

corpus_pivot_table_data.csv - Single CSV with all data (23 columns)
corpus_analysis_workbook.xlsx - Multi-sheet Excel workbook (requires openpyxl)

Perfect for:

Creating pivot tables and charts in Excel
Interactive data exploration
Custom cross-tabulations
Data visualization

See the Excel Pivot Tables Guide for detailed usage.

dd-analyze-advanced

Extended analysis with additional exports and checks.

dd-analyze-advanced /path/to/corpus

Generates:

corpus_statistics.json - Complete statistics
corpus_detailed.csv - Page-level data
corpus_issues.csv - Issue-level data
library_holdings.csv - Library holdings
missing_issues.csv - Missing issues summary

Additional Features:

Page-level completeness checking
Duplicate page detection
Monthly statistics
Temporal gap analysis (>60 days)

dd-reports

Generate detailed text reports.

dd-reports /path/to/corpus

Generates:

library_holdings_report.txt - Comprehensive library holdings
missing_issues_report.txt - Detailed missing issues analysis

Use Cases:

Documentation
Sharing with non-technical collaborators
Detailed provenance tracking

dd-validate

Validate filename compliance.

dd-validate /path/to/corpus

Features:

Checks all files against expected pattern
Reports invalid filenames with specific errors
Suggests corrections for common issues

Output Files

Consolidated Export (Recommended for Excel)

dd-excel-export /path/to/corpus

Generates:

corpus_pivot_table_data.csv - Single CSV with 23 columns including:
- Date, Year, Month, Day, Quarter, Decade
- Newspaper_Code, Newspaper_Name, Region, Publisher_Type
- Administrative_Level, Province, Publication_Type
- Pages_In_Issue, Completeness_Pct
- Has_Missing_Pages, Has_Duplicate_Pages, Is_Complete_Issue
- Primary_Library, All_Libraries, Num_Libraries
- Estimated_Frequency, Avg_Gap_Days
corpus_analysis_workbook.xlsx - Multi-sheet Excel workbook with:
- Issues_Data (main pivot table sheet)
- Newspapers_Summary
- Library_Holdings
- Yearly_Statistics
- Missing_Issues
- Issue_Completeness

Advanced Analysis

dd-analyze-advanced /path/to/corpus

Generates:

corpus_statistics.json - Complete statistics in JSON
corpus_detailed.csv - Every page with metadata
corpus_issues.csv - Every issue with completeness flags
library_holdings.csv - Library holdings in tabular format
missing_issues.csv - Missing issues estimates by newspaper

Text Reports

dd-reports /path/to/corpus

Generates:

library_holdings_report.txt - Detailed holdings breakdown
missing_issues_report.txt - Gap analysis with estimates

Requirements

Python 3.6 or higher
No required dependencies for basic functionality
Optional: openpyxl for Excel export (.xlsx files)

pip install openpyxl  # Optional, for Excel export

Python API Examples

Basic Statistics

from dd_corpus_tools import NewspaperCorpusAnalyzer

analyzer = NewspaperCorpusAnalyzer('/path/to/corpus')
analyzer.scan_corpus()

# Get totals
total_newspapers = len(analyzer.data)
total_issues = sum(len(dates) for dates in analyzer.issues.values())
total_pages = sum(analyzer.pages_by_newspaper.values())

print(f"Newspapers: {total_newspapers}")
print(f"Issues: {total_issues}")
print(f"Pages: {total_pages}")

Query Specific Newspaper

newspaper = 'TID'  # Tibet Daily

if newspaper in analyzer.data:
    issues = len(analyzer.issues[newspaper])
    pages = analyzer.pages_by_newspaper[newspaper]
    years = sorted(analyzer.data[newspaper].keys())
    
    print(f"{newspaper}:")
    print(f"  Issues: {issues}")
    print(f"  Pages: {pages}")
    print(f"  Years: {min(years)}-{max(years)}")

Library Holdings

for library in sorted(analyzer.libraries.keys()):
    total_pages = sum(analyzer.libraries[library].values())
    total_issues = sum(len(analyzer.library_issues[library][np]) 
                      for np in analyzer.library_issues[library])
    newspapers = len(analyzer.libraries[library])
    
    print(f"{library}: {newspapers} newspapers, {total_issues} issues, {total_pages} pages")

Export Everything

from dd_corpus_tools import ConsolidatedExporter

exporter = ConsolidatedExporter('/path/to/corpus')
exporter.scan_corpus()
exporter.export_all()  # Creates CSV + Excel workbook

More examples in examples/example_usage.py.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

# Clone repository
git clone https://github.com/Divergent-Discourses/DDC_CorpusStatistics.git
cd DDC_CorpusStatistics

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode with dev dependencies
pip install -e .[dev,excel]

# Run tests
pytest

# Format code
black dd_corpus_tools/

Citation

If you use these tools in your research, please cite:

@software{dd_corpus_tools,
  title = {Divergent Discourses Corpus Analysis Tools},
  author = {{Divergent Discourses Project}},
  year = {2025},
  url = {https://github.com/Divergent-Discourses/DDC_CorpusStatistics},
  note = {Tools for analyzing Tibetan newspaper corpus (1950-1965)}
}

For the corpus itself:

@article{erhard2025divergent,
  title = {The Divergent Discourses Corpus: A Digital Collection of Early Tibetan Newspapers from the 1950s and 1960s},
  author = {Erhard, Franz Xaver},
  journal = {Revue d'Etudes Tibétaines},
  number = {74},
  pages = {44--80},
  year = {2025},
  month = {February}
}

Acknowledgments

This project is part of the Divergent Discourses research project, funded by:

Deutsche Forschungsgemeinschaft (DFG) - Project number 508232945
Arts and Humanities Research Council (AHRC) - Project reference AH/X001504/1

Contact

Project Website: https://research.uni-leipzig.de/diverge/
GitHub Issues: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/issues

Related Projects

Divergent Discourses Corpus - Access the digitized newspapers
Divergent Discourses Project - Main research project

Version History

See CHANGELOG.md for detailed version history.

Support

Documentation: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/tree/main/docs
Examples: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/tree/main/examples
Issues: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/issues
Discussions: Use GitHub Issues for questions and discussions

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
dd_corpus_reports		dd_corpus_reports
dd_corpus_tools		dd_corpus_tools
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PACKAGE_STRUCTURE.md		PACKAGE_STRUCTURE.md
README.md		README.md
advanced_corpus_analyzer.py		advanced_corpus_analyzer.py
consolidated_excel_export.py		consolidated_excel_export.py
corpus_utilities.py		corpus_utilities.py
example_usage.py		example_usage.py
library_holdings_reporter.py		library_holdings_reporter.py
newspaper_corpus_analyzer.py		newspaper_corpus_analyzer.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Divergent Discourses Corpus Analysis Tools

About the Project

Features

Quick Start

Installation

Basic Usage

Python API

File Naming Convention

Documentation

Newspaper Codes

Command-Line Tools

dd-analyze

dd-excel-export

dd-analyze-advanced

dd-reports

dd-validate

Output Files

Consolidated Export (Recommended for Excel)

Advanced Analysis

Text Reports

Requirements

Python API Examples

Basic Statistics

Query Specific Newspaper

Library Holdings

Export Everything

Contributing

Development Setup

Citation

Acknowledgments

Contact

Related Projects

Version History

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages