Comprehensive analysis tools for the Divergent Discourses Tibetan Newspaper Corpus (1950-1965).
The Divergent Discourses project studies the role of narrative and discourse in the perpetuation of antagonisms in the Tibet-China dispute during the formative period of the 1950s and 1960s. These tools analyze a corpus of 16,718 pages from 16 Tibetan-language newspapers published between 1950 and 1965.
✨ Complete Corpus Analysis
- Automatic extraction of metadata from standardized filenames
- Comprehensive statistics on newspapers, issues, and pages
- Year-by-year coverage analysis
📊 Excel-Ready Exports
- Single consolidated CSV optimized for pivot tables
- Multi-sheet Excel workbook with pre-formatted data
- 23 comprehensive columns for flexible analysis
🏛️ Library Holdings Tracking
- Detailed provenance information
- Holdings by library, newspaper, and year
- Multi-source issue identification
📉 Missing Issues Analysis
- Automatic publication frequency detection
- Gap identification and quantification
- Completeness percentage estimates
📈 Multiple Output Formats
- JSON for programmatic access
- CSV for spreadsheet analysis
- Text reports for documentation
- Excel workbooks for visualization
pip install git+https://github.com/Divergent-Discourses/DDC_CorpusStatistics.gitOr install from source:
git clone https://github.com/Divergent-Discourses/DDC_CorpusStatistics.git
cd DDC_CorpusStatistics
pip install -e .# Complete analysis with all statistics
dd-analyze /path/to/corpus
# Export for Excel pivot tables (recommended!)
dd-excel-export /path/to/corpus
# Generate detailed reports
dd-reports /path/to/corpusfrom dd_corpus_tools import NewspaperCorpusAnalyzer
analyzer = NewspaperCorpusAnalyzer('/path/to/corpus')
analyzer.scan_corpus()
analyzer.print_summary_statistics()
analyzer.export_to_json('statistics.json')The tools expect files to follow this pattern:
XXX_YYYY_MM_DD_ppp_LL_abcd.ext
Where:
XXX: 3-letter newspaper code (e.g., TID, QTN, TIM)YYYY_MM_DD: Publication dateppp: Page number (3 digits with leading zeros)LL: 2-letter library codeabcd: Optional shelfmark.ext: File extension (.jpg, .png, .tif, .pdf)
Example: TID_1964_01_09_001_SB_Zsn128162MR.jpg
- User Guide - Comprehensive usage instructions
- Excel Pivot Tables Guide - Detailed Excel analysis tutorial
- API Reference - Python API documentation
- Quick Reference - Quick command reference
- Installation Guide - Detailed installation instructions
| Code | Newspaper |
|---|---|
| CWN | Central Weekly News |
| DTF | Defend Tibet's Freedom |
| FRD | Freedom |
| GDN | Ganze Daily |
| GTN | Gyantse News |
| KDN | Kangding News |
| MJN | Minjiang News |
| NIB | News in Brief |
| QTN | Qinghai Tibetan News |
| SGN | South Gansu News |
| TDP | Tibet Daily Pictorial |
| TID | Tibet Daily |
| TIF | Tibetan Freedom |
| TIM | Tibet Mirror |
| XNX | South-West Institute for Nationalities |
| ZYX | Central Institute for Nationalities |
Complete corpus analysis with detailed statistics.
dd-analyze /path/to/corpusGenerates:
- Console output with complete statistics
corpus_statistics.json- All statistics in JSON format
Features:
- Complete newspaper list with sources
- Library holdings by newspaper, year, and issue
- Missing issues estimates
- Year-by-year coverage
- Statistical summaries
Excel-optimized export for pivot table analysis.
dd-excel-export /path/to/corpusGenerates:
corpus_pivot_table_data.csv- Single CSV with all data (23 columns)corpus_analysis_workbook.xlsx- Multi-sheet Excel workbook (requires openpyxl)
Perfect for:
- Creating pivot tables and charts in Excel
- Interactive data exploration
- Custom cross-tabulations
- Data visualization
See the Excel Pivot Tables Guide for detailed usage.
Extended analysis with additional exports and checks.
dd-analyze-advanced /path/to/corpusGenerates:
corpus_statistics.json- Complete statisticscorpus_detailed.csv- Page-level datacorpus_issues.csv- Issue-level datalibrary_holdings.csv- Library holdingsmissing_issues.csv- Missing issues summary
Additional Features:
- Page-level completeness checking
- Duplicate page detection
- Monthly statistics
- Temporal gap analysis (>60 days)
Generate detailed text reports.
dd-reports /path/to/corpusGenerates:
library_holdings_report.txt- Comprehensive library holdingsmissing_issues_report.txt- Detailed missing issues analysis
Use Cases:
- Documentation
- Sharing with non-technical collaborators
- Detailed provenance tracking
Validate filename compliance.
dd-validate /path/to/corpusFeatures:
- Checks all files against expected pattern
- Reports invalid filenames with specific errors
- Suggests corrections for common issues
dd-excel-export /path/to/corpusGenerates:
-
corpus_pivot_table_data.csv - Single CSV with 23 columns including:
- Date, Year, Month, Day, Quarter, Decade
- Newspaper_Code, Newspaper_Name, Region, Publisher_Type
- Administrative_Level, Province, Publication_Type
- Pages_In_Issue, Completeness_Pct
- Has_Missing_Pages, Has_Duplicate_Pages, Is_Complete_Issue
- Primary_Library, All_Libraries, Num_Libraries
- Estimated_Frequency, Avg_Gap_Days
-
corpus_analysis_workbook.xlsx - Multi-sheet Excel workbook with:
- Issues_Data (main pivot table sheet)
- Newspapers_Summary
- Library_Holdings
- Yearly_Statistics
- Missing_Issues
- Issue_Completeness
dd-analyze-advanced /path/to/corpusGenerates:
corpus_statistics.json- Complete statistics in JSONcorpus_detailed.csv- Every page with metadatacorpus_issues.csv- Every issue with completeness flagslibrary_holdings.csv- Library holdings in tabular formatmissing_issues.csv- Missing issues estimates by newspaper
dd-reports /path/to/corpusGenerates:
library_holdings_report.txt- Detailed holdings breakdownmissing_issues_report.txt- Gap analysis with estimates
- Python 3.6 or higher
- No required dependencies for basic functionality
- Optional:
openpyxlfor Excel export (.xlsxfiles)
pip install openpyxl # Optional, for Excel exportfrom dd_corpus_tools import NewspaperCorpusAnalyzer
analyzer = NewspaperCorpusAnalyzer('/path/to/corpus')
analyzer.scan_corpus()
# Get totals
total_newspapers = len(analyzer.data)
total_issues = sum(len(dates) for dates in analyzer.issues.values())
total_pages = sum(analyzer.pages_by_newspaper.values())
print(f"Newspapers: {total_newspapers}")
print(f"Issues: {total_issues}")
print(f"Pages: {total_pages}")newspaper = 'TID' # Tibet Daily
if newspaper in analyzer.data:
issues = len(analyzer.issues[newspaper])
pages = analyzer.pages_by_newspaper[newspaper]
years = sorted(analyzer.data[newspaper].keys())
print(f"{newspaper}:")
print(f" Issues: {issues}")
print(f" Pages: {pages}")
print(f" Years: {min(years)}-{max(years)}")for library in sorted(analyzer.libraries.keys()):
total_pages = sum(analyzer.libraries[library].values())
total_issues = sum(len(analyzer.library_issues[library][np])
for np in analyzer.library_issues[library])
newspapers = len(analyzer.libraries[library])
print(f"{library}: {newspapers} newspapers, {total_issues} issues, {total_pages} pages")from dd_corpus_tools import ConsolidatedExporter
exporter = ConsolidatedExporter('/path/to/corpus')
exporter.scan_corpus()
exporter.export_all() # Creates CSV + Excel workbookMore examples in examples/example_usage.py.
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Clone repository
git clone https://github.com/Divergent-Discourses/DDC_CorpusStatistics.git
cd DDC_CorpusStatistics
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode with dev dependencies
pip install -e .[dev,excel]
# Run tests
pytest
# Format code
black dd_corpus_tools/If you use these tools in your research, please cite:
@software{dd_corpus_tools,
title = {Divergent Discourses Corpus Analysis Tools},
author = {{Divergent Discourses Project}},
year = {2025},
url = {https://github.com/Divergent-Discourses/DDC_CorpusStatistics},
note = {Tools for analyzing Tibetan newspaper corpus (1950-1965)}
}For the corpus itself:
@article{erhard2025divergent,
title = {The Divergent Discourses Corpus: A Digital Collection of Early Tibetan Newspapers from the 1950s and 1960s},
author = {Erhard, Franz Xaver},
journal = {Revue d'Etudes Tibétaines},
number = {74},
pages = {44--80},
year = {2025},
month = {February}
}This project is part of the Divergent Discourses research project, funded by:
- Deutsche Forschungsgemeinschaft (DFG) - Project number 508232945
- Arts and Humanities Research Council (AHRC) - Project reference AH/X001504/1
- Project Website: https://research.uni-leipzig.de/diverge/
- GitHub Issues: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/issues
- Divergent Discourses Corpus - Access the digitized newspapers
- Divergent Discourses Project - Main research project
See CHANGELOG.md for detailed version history.
- Documentation: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/tree/main/docs
- Examples: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/tree/main/examples
- Issues: https://github.com/Divergent-Discourses/DDC_CorpusStatistics/issues
- Discussions: Use GitHub Issues for questions and discussions