diff --git a/DART/README.md b/DART/README.md new file mode 100644 index 0000000..c3fb99d --- /dev/null +++ b/DART/README.md @@ -0,0 +1,62 @@ +# DART/ + +## Purpose + +DART (Data Access in Real Time) validation, analysis, and visualization outputs from LIS Key Figures. Contains scripts for comparing LISSY microdata results with official DART tables and producing median income plots. + +## Contents + +- `dart_validation.py` - Validates LISSY results against DART median income tables +- `plot_dart_tables.py` - Generates visualizations from DART CSV tables +- `Methodological_Notes.md` - LIS Key Figures methodology (population coverage, income concepts, equivalence scales) +- `Methodological_Remarks.md` - Extended methodological documentation +- `dart-table_*.csv` - DART reference tables (DHI median, poverty rates) +- `dart_*_plot.png` - Generated visualizations +- **MIMA/** - Moving Average workflow (detailed README inside) + +## Quick start + +**Validate LISSY vs DART:** +```bash +python DART/dart_validation.py +# Outputs: dart_dhi_median_validation.csv, dart_dhi_median_error_facts.txt +``` + +**Plot DART tables:** +```bash +python DART/plot_dart_tables.py +# Outputs: PNG plots in DART/ +``` + +**Run MIMA workflow:** +```bash +python compute_mima.py \ + --ma-number 5 \ + --countries "Canada,Germany,Luxembourg,United Kingdom,United States" \ + --start-year 1985 --end-year 2021 \ + --input-path "xlsxConverted/csvFiles/dart-med-pop_decomp-dhi.csv" \ + --output-path "DART" +# Outputs: DART/MIMA/csv/ and DART/MIMA/visualizations/ +``` + +See `DART/MIMA/README.md` for full MIMA documentation. + +## Conventions + +- CSV tables use countries as rows, years as columns +- Scripts run from repository root (not from DART/ directory) +- Validation scripts compare LISSY outputs to DART tables and report error statistics + +## Privacy & Secrets + +No microdata is stored here - only aggregated tables and validation outputs. LISSY jobs must be run separately on the LIS remote server. + +## Related Folders + +- **LISSY/DART_Validation/** - Alternative DART validation using R +- **xlsxConverted/csvFiles/** - Source DART tables in CSV format +- **compute_mima.py** (root) - MIMA computation script + +## Maintainers + +DART tables sourced from [LIS DART Portal](https://www.lisdatacenter.org/data-access/dart/). diff --git a/LISSY/DART_Validation/README.md b/LISSY/DART_Validation/README.md new file mode 100644 index 0000000..2d5f304 --- /dev/null +++ b/LISSY/DART_Validation/README.md @@ -0,0 +1,61 @@ +# LISSY/DART_Validation/ + +## Purpose + +Validates LISSY microdata analysis results against official DART aggregated tables. Compares median income and poverty rates computed from LIS microdata (via LISSY jobs) with published DART Key Figures. + +## Contents + +- `validate_lissy_vs_dart.py` - Python validation script comparing LISSY vs DART for DHI/MHI metrics +- `R_code_steps.md` - Detailed R code methodology documentation (DART compliance steps) +- `lissy_pop_median_*.csv` - LISSY job outputs (median income and poverty rates, PPP-adjusted) +- `dart_table_*.csv` - DART reference tables (downloaded from LIS DART portal) +- `comparison_*.png` - Scatter plots showing LISSY vs DART agreement +- `error_moments_*.csv` - Error statistics (mean, std, skew, kurtosis) by country + +## Quick start + +**Run validation:** +```bash +cd LISSY/DART_Validation +python validate_lissy_vs_dart.py +``` + +**Outputs:** +- `comparison_*.png` - Visual comparisons (scatter plots with 45° line) +- `error_moments_*.csv` - Statistical summaries of discrepancies + +## Inputs Required + +1. **DART tables** (already present): + - `dart_table_dhi_median.csv`, `dart_table_dhi_pr.csv` (DHI median/poverty rate) + - `dart_table_mhi_median.csv`, `dart_table_mhi_pr.csv` (MHI median/poverty rate) + +2. **LISSY outputs** (run LISSY jobs separately): + - `lissy_pop_median_dhi_ppp_median_85-21.csv` + - `lissy_pop_median_mhi_ppp_median_85-21.csv` + + These files must be generated by running R/Stata jobs on the LISSY remote system (see `LISSY/Tutorial/` for how to submit jobs). + +## Conventions + +- LISSY outputs use long format (country, year, value columns) +- DART tables use wide format (countries as rows, years as columns) +- Validation compares PPP-adjusted values (2017 USD) +- Error moments help identify systematic biases or noisy countries + +## Privacy & Secrets + +**Important:** LISSY jobs access LIS microdata under strict privacy rules. Do NOT commit microdata to this repo. Only aggregated outputs (medians, poverty rates) are stored here. + +See `LISSY/README.md` for LISSY registration and job submission guidelines. + +## Related Folders + +- **DART/** - Alternative Python validation (`dart_validation.py`) +- **LISSY/Tutorial/** - LISSY onboarding and syntax examples +- **DART/Methodological_Notes.md** - DART computation methodology + +## Maintainers + +Validation pipeline for ensuring LISSY job outputs match published DART figures. diff --git a/LISSY/MIMA5/README.md b/LISSY/MIMA5/README.md new file mode 100644 index 0000000..3bc6e04 --- /dev/null +++ b/LISSY/MIMA5/README.md @@ -0,0 +1,60 @@ +# LISSY/MIMA5/ + +## Purpose + +MIMA5 (5-year Moving Average of Median Income) poverty rate analysis and visualizations. Contains LISSY job outputs computing poverty rates anchored to the 5-year moving average of median income, comparing DHI and MHI across countries. + +## Contents + +- `plotting_mima5_pr.py` - Generates 4 plots comparing poverty rates and MIMA5 trends +- `lissy_mima5_*.csv` - LISSY outputs (MIMA5-based poverty rates for DHI/MHI) +- `lissy_CPI_mima5_*.csv` - CPI-adjusted MIMA5 poverty rates +- `*.png` - Visualizations (poverty rates and MIMA5 time series) +- **OLD/** - Archived outputs from previous runs + +## Quick start + +**Generate plots from existing CSV files:** +```bash +cd LISSY/MIMA5 +python plotting_mima5_pr.py +``` + +**Outputs:** +- `mima5_dhi_50pp_pr.png` - DHI poverty rate (50% of MIMA5) +- `mima5_mhi_50pp_pr.png` - MHI poverty rate (50% of MIMA5) +- `mima5_dhi.png` - MIMA5 DHI time series +- `mima5_mhi.png` - MIMA5 MHI time series +- CPI-adjusted variants: `CPI_mima5_*.png` + +## Inputs Required + +**LISSY job outputs** (must be generated separately on LISSY): +- `lissy_mima5_dhi_50pr.csv` - DHI poverty rate @ 50% MIMA5 +- `lissy_mima5_mhi_50pr.csv` - MHI poverty rate @ 50% MIMA5 +- `lissy_CPI_mima5_dhi_50pr.csv` - CPI-adjusted DHI +- `lissy_CPI_mima5_mhi_50pr.csv` - CPI-adjusted MHI + +Run R/Stata jobs on LISSY to compute these (see `LISSY/Tutorial/` for syntax). + +## Conventions + +- CSV files use long format: `country, year, pr, mima5` +- Plots fix country colors: Canada (green), Germany (red), UK (orange), US (blue) +- Outputs are PNG format (300 DPI recommended for publication) + +## Privacy & Secrets + +Only aggregated poverty rates and medians are stored. **Do NOT commit LIS microdata.** + +LISSY jobs must be submitted via the [LISSY web interface](https://www.lisdatacenter.org/data-access/lissy/) or email. See `LISSY/README.md` for registration. + +## Related Folders + +- **DART/MIMA/** - MIMA computation workflow (using DART tables, not LISSY microdata) +- **METIS-LIS/mima_indicator.md** - MIMA methodology documentation +- **compute_mima.py** (root) - Python MIMA workflow for DART data + +## Maintainers + +MIMA5 analysis for poverty persistence research using LIS microdata. diff --git a/LISSY/README.md b/LISSY/README.md new file mode 100644 index 0000000..4965f1a --- /dev/null +++ b/LISSY/README.md @@ -0,0 +1,73 @@ +# LISSY/ + +## Purpose + +LISSY (LIS remote-execution system) documentation, tutorials, and outputs. This folder contains onboarding materials, validation scripts, and analysis results from LIS microdata jobs. + +## Contents + +- **Tutorial/** - LISSY onboarding, syntax examples, and exercises (comprehensive README inside) +- **DART_Validation/** - Validates LISSY job outputs against DART aggregated tables +- **MIMA5/** - MIMA5 poverty rate analysis and visualizations + +## Quick start + +**New to LISSY?** Start with the Tutorial: +```bash +# Read the tutorial README +cat LISSY/Tutorial/README.md + +# Browse R and Stata syntax examples +ls LISSY/Tutorial/Exercises_syntax_files-R-Part_II/ +``` + +**Validate your LISSY results:** +```bash +cd LISSY/DART_Validation +python validate_lissy_vs_dart.py +``` + +**Plot MIMA5 poverty rates:** +```bash +cd LISSY/MIMA5 +python plotting_mima5_pr.py +``` + +## What is LISSY? + +LISSY is a remote-execution system that allows researchers to access [LIS](https://www.lisdatacenter.org/) and [LWS](https://www.lisdatacenter.org/data-access/lws/) microdata while adhering to privacy restrictions. Researchers submit statistical programs (R, SAS, SPSS, Stata) through a web-based interface, and LISSY returns aggregated results. + +## How to Register + +[Register for LISSY access](https://www.lisdatacenter.org/data-access/lissy/) (1-year access, renewable annually). + +## Privacy & Secrets + +**Critical:** LIS microdata is confidential. NEVER commit microdata to this repository. + +- Submit jobs via the [LISSY web interface](https://www.lisdatacenter.org/data-access/lissy/) +- Only commit **aggregated outputs** (tables, plots, summary statistics) +- Individual-level data violates LIS terms of use +- See [LIS Privacy Policy](https://www.lisdatacenter.org/about-lis/terms-of-use/) + +## Onboarding Resources + +- **Tutorial/** folder in this repo (syntax examples, exercises) +- [LIS Self-Teaching Materials](https://www.lisdatacenter.org/resources/self-teaching/) +- [METIS Documentation Portal](https://www.lisdatacenter.org/frontend) +- [LIS FAQ](https://www.lisdatacenter.org/resources/faq/) +- Contact: [usersupport@lisdatacenter.org](mailto:usersupport@lisdatacenter.org) + +## Citation + +All papers using LIS microdata must be submitted to the LIS Working Paper series before publication. See [General Policies](https://www.lisdatacenter.org/working-papers/#general). + +## Related Folders + +- **METIS-LIS/** - LIS codebooks and variable documentation +- **DART/** - DART validation using aggregated tables (no microdata) +- **compute_mima.py** (root) - MIMA workflow using DART tables + +## Maintainers + +Documentation and examples sourced from [LIS Cross-National Data Center](https://www.lisdatacenter.org/). diff --git a/METIS-LIS/README.md b/METIS-LIS/README.md new file mode 100644 index 0000000..07277ca --- /dev/null +++ b/METIS-LIS/README.md @@ -0,0 +1,41 @@ +# METIS-LIS/ + +## Purpose + +Documentation and metadata for LIS (Luxembourg Income Study) datasets, including codebooks, MIMA indicator definitions, and wave/date mappings. + +## Contents + +- `codebook.pdf` - LIS variable codebook (names, definitions, codes) +- `mima_indicator.md` - MIMA (Median Income Moving Average) indicator methodology +- `waves-and-dates.md` - LIS data collection waves and reference dates + +## Quick start + +**View codebook:** +```bash +open METIS-LIS/codebook.pdf # macOS +xdg-open METIS-LIS/codebook.pdf # Linux +``` + +**Review MIMA methodology:** +```bash +cat METIS-LIS/mima_indicator.md +``` + +## Conventions + +- Files are reference documentation (read-only) +- PDF codebook is the authoritative source for LIS variable definitions +- Markdown files provide concise summaries for quick reference + +## Related Resources + +- [LIS METIS Portal](https://www.lisdatacenter.org/frontend) - Full online documentation +- [LIS Database](https://www.lisdatacenter.org/) - Official LIS homepage +- **DART/MIMA/** - Implementation of MIMA methodology +- **LISSY/** - Remote execution system for LIS microdata + +## Maintainers + +Documentation sourced from [LIS Cross-National Data Center](https://www.lisdatacenter.org/). diff --git a/README.md b/README.md index d59f8ee..df1ff19 100644 --- a/README.md +++ b/README.md @@ -27,3 +27,29 @@ This project serves to: * **Inform Policy Evaluation:** Offer policymakers a tool for evidence-based assessment of poverty alleviation programs and policies. * **Enable Comparative Studies:** Facilitate cross-national and cross-temporal comparisons of poverty and the effectiveness of different policy interventions. * **Promote Data-Driven Decision-Making:** Support strategic decisions regarding resource allocation and policy design in the global effort to reduce poverty. + +## Repository Map + +Navigate the repository structure using the links below. Each folder contains a README with purpose, quick start commands, and conventions. + +| Folder | Description | Link | +|--------|-------------|------| +| **DART/** | DART validation, MIMA workflow, and methodological notes | [DART/README.md](DART/README.md) | +| **LISSY/** | LISSY remote-execution system documentation, tutorials, and outputs | [LISSY/README.md](LISSY/README.md) | +| **LISSY/Tutorial/** | LISSY onboarding materials and syntax examples (R, Stata) | [LISSY/Tutorial/README.md](LISSY/Tutorial/README.md) | +| **LISSY/DART_Validation/** | Validation pipeline comparing LISSY vs DART results | [LISSY/DART_Validation/README.md](LISSY/DART_Validation/README.md) | +| **LISSY/MIMA5/** | MIMA5 poverty rate analysis and visualizations | [LISSY/MIMA5/README.md](LISSY/MIMA5/README.md) | +| **METIS-LIS/** | LIS codebooks, MIMA indicator docs, and wave/date mappings | [METIS-LIS/README.md](METIS-LIS/README.md) | +| **analysis/** | Parent folder for analytical pipelines | [analysis/README.md](analysis/README.md) | +| **analysis/data-availability/** | Submatrix analysis for optimal country-year panels | [analysis/data-availability/README.md](analysis/data-availability/README.md) | +| **scripts/** | Utility scripts (HTML-to-Markdown converter) | [scripts/README.md](scripts/README.md) | +| **xlsxFiles/** | Source Excel data files (DART tables, codebooks) | [xlsxFiles/README.md](xlsxFiles/README.md) | +| **xlsxConverted/** | Auto-generated CSV/JSON/Markdown outputs | [xlsxConverted/README.md](xlsxConverted/README.md) | +| **docs/** | Project documentation and reference materials | [docs/README.md](docs/README.md) | + +### Key Entry Points + +- **New to LIS/LISSY?** Start with [LISSY/Tutorial/README.md](LISSY/Tutorial/README.md) +- **Run MIMA workflow:** See [DART/MIMA/README.md](DART/MIMA/README.md) and `compute_mima.py` +- **Validate DART data:** See [DART/README.md](DART/README.md) and [LISSY/DART_Validation/README.md](LISSY/DART_Validation/README.md) +- **Analyze data availability:** See [analysis/data-availability/README.md](analysis/data-availability/README.md) diff --git a/analysis/README.md b/analysis/README.md new file mode 100644 index 0000000..5af4c34 --- /dev/null +++ b/analysis/README.md @@ -0,0 +1,41 @@ +# analysis/ + +## Purpose + +Parent folder for analytical pipelines and research modules. Each subfolder contains a self-contained analysis with its own README, requirements, and outputs. + +## Contents + +- **data-availability/** - Submatrix analysis finding optimal country-year panels in OECD income data + +## Quick start + +Navigate to specific analysis folders for detailed instructions: + +```bash +# Run data availability analysis +cd analysis/data-availability +python run.py +``` + +See individual folder READMEs for requirements, inputs, and outputs. + +## Conventions + +- Each analysis subfolder is **self-contained** with its own `requirements.txt` +- Analysis scripts run from the **repository root** (not from analysis/) +- Use Python virtual environments to isolate dependencies +- Large outputs (plots, JSON results) may be gitignored - check folder READMEs for regeneration steps + +## Adding New Analyses + +1. Create subfolder: `analysis/my-analysis/` +2. Add `README.md` documenting purpose, inputs, outputs, and commands +3. Add `requirements.txt` if Python dependencies are needed +4. Include example run command in README +5. Update this parent README to list the new analysis + +## Related Folders + +- **xlsxConverted/csvFiles/** - Common input source for many analyses +- **DART/** - DART-specific analysis and validation diff --git a/analysis/data-availability/README.md b/analysis/data-availability/README.md index a3fd595..a1964aa 100644 --- a/analysis/data-availability/README.md +++ b/analysis/data-availability/README.md @@ -31,7 +31,7 @@ with columns: A consolidated summary is generated at: - analysis/data-availability/summary.md -## How to run +## Quick start From repository root: @@ -42,12 +42,31 @@ pip install -r analysis/data-availability/requirements.txt python analysis/data-availability/run.py ``` -Results will be written to analysis/data-availability/results/, multi-row CSVs to analysis/data-availability/results/multiple-rows/, and a summary to analysis/data-availability/summary.md. +**Outputs** (gitignored - regenerate locally): +- `analysis/data-availability/results/*.json` - Algorithm results +- `analysis/data-availability/results/multiple-rows/*.csv` - Multi-row algorithm outputs +- `analysis/data-availability/summary.md` - Consolidated summary + +**GitHub Actions:** Workflow available at `.github/workflows/data-availability-analysis.yml` ## Algorithms -- Greedy longest-streak (Phase 1): Sort countries by their longest consecutive-ones streak p_k^m. Start with the largest p^u and collect countries whose p_k^m exactly equals p^u. On the first non-match, record the row and shrink p^u by intersecting with the new country's p_k^m. Repeat. Phase 2 repeats on the dataset with the Phase 1 first-row countries removed. -- Greedy pivot coverage: Same mechanics as above, but each recorded pivot interval includes all countries that fully cover the interval (not only those whose longest streak equals it). -- Best consecutive window: Searches all year intervals [l, r] and picks the interval with the largest number of fully covered countries (ties broken by longer interval). -- Fixed L-year windows: Picks the best L-year moving window overall; offset-restricted variants limit start positions to l % L in {offset}. -- Max biclique: Exact backtracking search to find an arbitrary set of years and countries forming a full rectangle of ones, maximizing area (num_countries × length). Limited by a small time budget with pruning. +- **Greedy longest-streak (Phase 1):** Sort countries by longest consecutive-ones streak. Start with largest and collect matching countries. On mismatch, record row and shrink. Phase 2 repeats with Phase 1 first-row countries removed. +- **Greedy pivot coverage:** Same mechanics, but records all countries covering the pivot interval (not just those with matching longest streak). +- **Best consecutive window:** Searches all year intervals `[l, r]` for maximum country coverage (ties broken by longer interval). +- **Fixed L-year windows:** Best L-year moving window; offset-restricted variants limit start positions. +- **Max biclique:** Exact backtracking search for arbitrary year sets maximizing `num_countries × length` (time-limited with pruning). + +## Conventions + +- Input CSV must have `countries` column + year columns (4-digit years) +- Outputs are JSON/CSV format +- Large result files are gitignored (see `.gitignore`) + +## Privacy & Secrets + +No sensitive data - analysis uses publicly available DART aggregated tables. + +## Maintainers + +Analytical module for finding optimal cross-country longitudinal panels. diff --git a/docs/README.md b/docs/README.md index 3c89f43..6688d6f 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1 +1,40 @@ -Put html file here. +# docs/ + +## Purpose + +Documentation and reference materials for the Poverty Project repository. + +## Contents + +Currently minimal; intended for project-wide documentation such as: +- Contributor guides +- Data access instructions +- Analysis methodology overviews +- Onboarding materials + +## Quick start + +Place HTML documentation files here for automatic conversion to Markdown via the `scripts/convert-html-to-md.sh` workflow. + +**Add new documentation:** +```bash +# Place HTML file in docs/ +cp mydoc.html docs/ + +# Commit and push - automatic conversion to Markdown will run +git add docs/mydoc.html +git commit -m "docs: add new documentation" +git push +``` + +## Conventions + +- Use descriptive filenames +- Prefer Markdown (`.md`) over HTML when possible +- Link to external LIS documentation rather than duplicating content + +## Related Folders + +- **METIS-LIS/** - LIS-specific codebooks and metadata +- **LISSY/Tutorial/** - LISSY onboarding and usage examples +- **scripts/** - HTML-to-Markdown converter diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 0000000..ed16344 --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,35 @@ +# scripts/ + +## Purpose + +Utility scripts for repository automation and maintenance. + +## Contents + +- `convert-html-to-md.sh` - Automated HTML-to-Markdown converter using pandoc + +## Quick start + +The HTML converter runs automatically via GitHub Actions when HTML files are added to the repository. + +**Manual execution:** +```bash +# Ensure pandoc is installed +sudo apt-get install pandoc # Linux +# brew install pandoc # macOS + +# Run converter +./scripts/convert-html-to-md.sh +``` + +## Conventions + +- Scripts are executable (`chmod +x`) +- Include usage comments at the top of each script +- Set environment variables via GitHub Actions workflows (see `.github/workflows/html-to-md.yml`) + +## GitHub Actions Integration + +- **Workflow**: `.github/workflows/html-to-md.yml` +- **Trigger**: Push events with HTML files +- **Output**: Markdown files committed automatically with `[skip html-to-md]` marker diff --git a/xlsxConverted/README.md b/xlsxConverted/README.md new file mode 100644 index 0000000..754b199 --- /dev/null +++ b/xlsxConverted/README.md @@ -0,0 +1,41 @@ +# xlsxConverted/ + +## Purpose + +Auto-generated output directory containing converted formats (CSV, JSON, Markdown) from Excel files in `xlsxFiles/`. + +## Contents + +- **csvFiles/** - CSV format (for R/Python/Stata analysis) +- **jsonFiles/** - JSON format (for web/API consumption) +- **mdFiles/** - Markdown tables (for documentation) + +## Quick start + +**Regenerate all conversions:** +```bash +python convert_excel.py --input-dir xlsxFiles --output-dir xlsxConverted +``` + +Or trigger via GitHub Actions workflow: `.github/workflows/convert-excel.yml` + +## Conventions + +- **Do not edit files manually** - they are auto-generated +- Files mirror the structure and naming of source Excel files +- Outputs are committed to the repository for reproducibility + +## Privacy & Secrets + +No sensitive data - all files are derived from public LIS/DART tables. + +## Regenerating Outputs + +If source Excel files are updated: +1. Run `python convert_excel.py` locally, or +2. Trigger the GitHub Actions workflow manually via Actions tab + +## Related Folders + +- **xlsxFiles/** - Source Excel files +- **analysis/data-availability/** - Consumes CSV files from this folder diff --git a/xlsxFiles/README.md b/xlsxFiles/README.md new file mode 100644 index 0000000..1d524db --- /dev/null +++ b/xlsxFiles/README.md @@ -0,0 +1,34 @@ +# xlsxFiles/ + +## Purpose + +Source data files in Excel format (`.xlsx`) used as input for analysis workflows and conversion pipelines. + +## Contents + +- `dart-*.xlsx` - DART median income tables by decomposition type (DHI, MHI, MHIT) and category (population, household type, urban/rural, etc.) +- `codebook.xlsx` - Variable definitions and metadata +- `our-lis-documentation*.xlsx` - LIS dataset documentation and availability matrices +- `variables-definition*.xlsx` - Variable mappings and definitions + +## Quick start + +These files are **read-only inputs**. Do not modify directly. + +**Convert to CSV/JSON/MD:** +```bash +python convert_excel.py --input xlsxFiles/.xlsx --output xlsxConverted +``` + +Or use the automated workflow (`.github/workflows/convert-excel.yml`). + +## Conventions + +- Files are version-controlled (committed to repo) +- Use descriptive naming: `-_-.xlsx` +- Large Excel files (>100MB) should be excluded via `.gitignore` + +## Related Folders + +- **xlsxConverted/** - Generated outputs (CSV, JSON, Markdown) +- **DART/** - Analysis using these data files