Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,15 @@ jobs:
uses: astral-sh/setup-uv@v6
with:
python-version: ${{ matrix.python-version }}
enable-cache: false

- name: Install dependencies
run: uv sync --all-groups

- name: Run unit tests
run: uv run pytest tests/ -n auto -m "not integration" -v
env:
ODA_READER_CACHE_DIR: ${{ runner.temp }}/oda_cache

- name: Integration Tests
if: github.event_name == 'pull_request' && matrix.python-version == '3.12' && matrix.os == 'ubuntu-latest'
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Changelog for oda_reader

## 1.3.1 (2025-06-27)
- Improves cache management for very large files. Introduces tests and improved documentation

## 1.3.0 (2025-06-16)
- Improves cache management.

Expand Down
81 changes: 8 additions & 73 deletions docs/docs/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ OECD occasionally changes dataflow versions (schema updates). ODA Reader handles

When a dataflow version returns 404 (not found), ODA Reader automatically:

1. Tries the configured version (e.g., `1.0`)
2. If 404, retries with `0.9`
3. Continues decrementing: `0.8`, `0.7`, `0.6`
1. Tries the configured version (e.g., `1.5`)
2. If 404, retries with `1.4`
3. Continues decrementing: `1.3`, `1.2`, `1.1`
4. Returns data from first successful version (up to 5 attempts)

This means your code keeps working even when OECD makes breaking schema changes.
Expand All @@ -58,9 +58,9 @@ This means your code keeps working even when OECD makes breaking schema changes.
from oda_reader import download_dac1

# ODA Reader will automatically try:
# 1.0 -> 404
# 0.9 -> 404
# 0.8 -> Success! Returns data with version 0.8
# 1.5 -> 404
# 1.4 -> 404
# 1.3 -> Success! Returns data with version 1.3
data = download_dac1(start_year=2022, end_year=2022)
```

Expand All @@ -71,11 +71,11 @@ You'll see a message indicating which version succeeded.
You can specify an exact dataflow version:

```python
# Force use of version 0.8
# Force use of version 1.3
data = download_dac1(
start_year=2022,
end_year=2022,
dataflow_version="0.8"
dataflow_version="1.3"
)
```

Expand Down Expand Up @@ -177,26 +177,6 @@ combined = pd.merge(
- Column names and codes must align
- Filter carefully to avoid double-counting

## Custom Schema Handling

If you need custom schema translation beyond built-in options:

### Access Raw Data and Translate Manually

```python
# Get raw API data
data = download_dac1(
start_year=2022,
end_year=2022,
pre_process=False,
dotstat_codes=False
)

# Apply custom transformations
data = data.rename(columns={'DONOR': 'donor_custom'})
data['donor_custom'] = data['donor_custom'].map(my_custom_mapping)
```

### Load Schema Mapping Files

```python
Expand Down Expand Up @@ -234,51 +214,6 @@ def get_crs_data():
return pd.read_parquet("/data/crs_full.parquet")
```

### Refresh Strategy

```python
from pathlib import Path
from datetime import datetime, timedelta

def refresh_if_old(file_path, max_age_days=7):
"""Re-download if file is older than max_age_days"""
path = Path(file_path)

if not path.exists():
print("File doesn't exist, downloading...")
bulk_download_crs(save_to_path=file_path)
return

file_age = datetime.now() - datetime.fromtimestamp(path.stat().st_mtime)

if file_age > timedelta(days=max_age_days):
print(f"File is {file_age.days} days old, refreshing...")
bulk_download_crs(save_to_path=file_path)
else:
print(f"File is recent ({file_age.days} days old), using cached version")

# Use in pipeline
refresh_if_old("/data/crs_full.parquet", max_age_days=7)
crs_data = pd.read_parquet("/data/crs_full.parquet")
```

### Memory-Efficient Aggregation

```python
# Process bulk CRS in chunks, aggregate results
sector_totals = {}

for chunk in bulk_download_crs(as_iterator=True):
# Aggregate by sector
sector_sums = chunk.groupby('purpose_code')['usd_commitment'].sum()

# Accumulate
for sector, amount in sector_sums.items():
sector_totals[sector] = sector_totals.get(sector, 0) + amount

print(f"Total sectors: {len(sector_totals)}")
```

## Debugging Tips

### Enable Verbose Logging
Expand Down
33 changes: 0 additions & 33 deletions docs/docs/bulk-downloads.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,6 @@ bulk_download_crs(
)
```

The reduced version omits some descriptive columns but retains all flow amounts and key dimensions.

## Memory-Efficient Processing with Iterators

For very large files, process in chunks to avoid loading the entire dataset into memory:
Expand Down Expand Up @@ -206,37 +204,6 @@ Bulk downloads already have:

See [Schema Translation](schema-translation.md) for detailed comparison.

## Combining Bulk and API Downloads

You can mix approaches:

```python
# Download full CRS as bulk file
crs_full = bulk_download_crs()

# Use API for recent updates or specific queries
crs_recent = download_crs(
start_year=2023,
end_year=2023,
filters={"donor": "USA"}
)

# Combine if schemas match
# (you may need to harmonize column names first)
```

## Performance Comparison

Approximate times (varies by network speed and OECD server load):

| Method | Dataset Size | Time |
|--------|-------------|------|
| API download (filtered) | 10,000 rows | 10-30 seconds |
| API download (large query) | 100,000 rows | 2-5 minutes |
| Bulk download CRS | ~2 million rows | 1-2 minutes |
| Bulk + iterator (filter) | Process 2 million rows | 2-5 minutes |

Bulk downloads are consistently fast regardless of query complexity, while API times vary significantly with query size.

## Troubleshooting

Expand Down
19 changes: 14 additions & 5 deletions docs/docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,25 @@ ODA Reader provides access to five datasets covering official development assist
| Dataset | What It Contains | Use When |
|---------|------------------|----------|
| **DAC1** | Aggregate flows by donor | Analyzing overall ODA trends, donor performance |
| **DAC2a** | Bilateral flows by donor-recipient | Recipient-level analysis, who gives to whom |
| **DAC2a** | Bilateral flows by donor-recipient | Recipient-level analysis |
| **CRS** | Project-level microdata | Sector analysis, project details, activity-level data |
| **Multisystem** | Multilateral system usage | Analyzing multilateral channels and contributions |
| **AidData** | Chinese development finance | Non-DAC donor analysis, Chinese aid flows |
| **AidData** | Chinese development finance | Chinese aid flows |

## DAC1: Aggregate Flows

**What it contains**: Total ODA and OOF by donor, aggregated across all recipients and sectors. This is the highest-level view of development assistance.

**Key dimensions**:

- Donor (bilateral donors and multilateral organizations)
- Measure type (ODA, OOF, grants, loans, etc.)
- Flow type (commitments, disbursements, grant equivalents)
- Price base (current or constant prices)
- Unit measure (USD millions, national currency, etc.)

**Use when**:

- You need donor-level totals
- Analyzing overall ODA trends over time
- Comparing donor performance
Expand Down Expand Up @@ -56,12 +58,14 @@ oda_constant = download_dac1(
**What it contains**: Bilateral ODA and OOF flows broken down by both donor and recipient country. Shows who gives to whom.

**Key dimensions**:

- Donor (bilateral donors)
- Recipient (receiving countries and regions)
- Measure type (bilateral ODA, imputed multilateral, etc.)
- Price base (current or constant)

**Use when**:

- Analyzing flows to specific recipient countries
- Understanding bilateral relationships
- Studying geographic distribution of aid
Expand Down Expand Up @@ -95,6 +99,7 @@ germany_eastafrica = download_dac2a(
**What it contains**: Individual project and activity-level data with detailed information about each development assistance activity. This is the most granular dataset.

**Key dimensions**:

- Donor
- Recipient
- Sector (purpose codes at various levels of detail)
Expand All @@ -104,6 +109,7 @@ germany_eastafrica = download_dac2a(
- Microdata flag (True for project-level, False for semi-aggregates)

**Use when**:

- You need project-level details (descriptions, amounts, sectors)
- Analyzing sector-specific flows
- Understanding implementation channels
Expand Down Expand Up @@ -153,13 +159,15 @@ semi_agg = download_crs(
**What it contains**: Data on how DAC members use the multilateral aid system, including core contributions to multilateral organizations and earmarked funding.

**Key dimensions**:

- Donor
- Recipient (multilateral organizations)
- Channel (specific multilateral organizations)
- Flow type (commitments, disbursements)
- Measure type

**Use when**:

- Analyzing multilateral contributions
- Understanding core vs. earmarked funding
- Studying specific multilateral channels (World Bank, UN agencies, etc.)
Expand Down Expand Up @@ -191,16 +199,17 @@ ida_contributions = download_multisystem(
**What it contains**: Project-level data on Chinese development finance activities, compiled by AidData. Covers official finance from China that may not be reported to the OECD.

**Key dimensions**:

- Commitment year
- Recipient country
- Sector
- Project descriptions
- Flow amounts and types

**Use when**:

- Analyzing Chinese development finance
- Comparing traditional DAC donors with China
- Studying non-DAC donor activities
- Comparing DAC donors with China

**Example**:

Expand All @@ -213,7 +222,7 @@ chinese_aid = download_aiddata(start_year=2015, end_year=2020)
# AidData is downloaded as bulk file, filtered by year after download
```

**Note**: AidData comes from Excel files, not the OECD API. It uses a different schema than DAC datasets.
**Note**: AidData comes from Excel files from the Aid Data website, not the OECD API. It uses a different schema than DAC datasets.

## Discovering Available Filters

Expand Down
8 changes: 6 additions & 2 deletions docs/docs/filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,12 +101,13 @@ multisystem_filters = get_available_filters("multisystem")
### DAC1 and DAC2a

Common dimensions:

- `donor` - Donor country (ISO3 codes like "USA", "GBR", "FRA")
- `recipient` - Recipient country or region (DAC2a only)
- `measure` - Type of flow (ODA, OOF, grants, loans, etc.)
- `flow_type` - Commitments, disbursements, net flows, etc.
- `price_base` - "V" for current prices, "Q" for constant prices
- `unit_measure` - "USD" for US dollars, "XDC" for national currency
- `unit_measure` - "USD" for US dollars

**Example**: Get net ODA disbursements in constant prices:

Expand All @@ -127,6 +128,7 @@ data = download_dac1(
### CRS (Creditor Reporting System)

CRS has additional dimensions:

- `sector` - Purpose codes (5-digit codes like "12220" for basic health)
- `channel` - Implementing organization (government, NGO, multilateral, etc.)
- `modality` - Grant, loan, equity, etc.
Expand Down Expand Up @@ -178,6 +180,7 @@ The `_T` suffix means "total" - it aggregates across that dimension to avoid dou
### Multisystem

Multisystem tracks multilateral contributions:

- `donor` - Contributing country
- `channel` - Specific multilateral organization (e.g., "44002" for World Bank IDA)
- `flow_type` - Commitments, disbursements
Expand Down Expand Up @@ -214,7 +217,8 @@ print(data['measure'].unique()) # See all measure codes

3. **Use trial and error**: Download a small query and examine column values

**Note**: Codes differ between API schema and .Stat schema. By default, ODA Reader returns .Stat codes. See [Schema Translation](schema-translation.md) for details.
**Note**: Codes differ between API schema and .Stat schema. When making API calls, you must use the
API schema. However by default, ODA Reader returns .Stat codes. See [Schema Translation](schema-translation.md) for details.

## Empty Filters

Expand Down
6 changes: 2 additions & 4 deletions docs/docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ pip install oda-reader
Or using uv (recommended for faster installs):

```bash
uv pip install oda-reader
uv add oda-reader
```

That's it! ODA Reader and its dependencies (pandas, requests, pyarrow, etc.) are now installed.
Expand Down Expand Up @@ -121,6 +121,4 @@ Now that you've downloaded your first datasets, explore:

**Query is slow**: First-time queries can take 10-30 seconds as ODA Reader fetches from OECD's API. Subsequent identical queries are instant due to caching.

**Rate limit errors**: By default, ODA Reader limits to 20 requests per 60 seconds. This should prevent rate limit errors. If you see them, your cache might have been cleared. Wait a minute and retry.

**Import errors**: Make sure you installed with dependencies: `pip install oda-reader` (not just `oda_reader`).
**Rate limit errors**: By default, ODA Reader limits to 20 requests per hour. This should prevent rate limit errors. If you see them, your cache might have been cleared. Wait and retry.
11 changes: 6 additions & 5 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# ODA Reader

**Programmatic access to OECD DAC data without the headaches**
**Programmatic access to OECD DAC data**

Working with OECD Development Assistance Committee (DAC) data is frustrating. You need to navigate multiple datasets (DAC1, DAC2a, CRS), understand complex SDMX API syntax, manage rate limits, and reconcile different schema versions. The OECD doesn't provide any first-party Python library to help.
Working with OECD Development Assistance Committee (DAC) data can be frustrating. You need to navigate multiple datasets (DAC1, DAC2a, CRS,...), understand complex SDMX API syntax, manage toy rate limits, and reconcile different schema versions. The OECD doesn't provide any first-party Python library to help.

Worse, the OECD has a habit of introducing undocumented schema changes, breaking link URLs, and making format changes without notice. What works today might break tomorrow, making it extremely difficult to build robust data pipelines for research and analysis.
Unfortunately, the OECD has a habit of introducing undocumented schema changes, breaking link URLs, and making format changes without notice. What works today might break tomorrow, making it very difficult to build robust data pipelines for research and analysis.

ODA Reader eliminates these headaches. It provides a unified Python interface that handles complexity for you: automatic version fallbacks when schemas change, consistent APIs across datasets, smart caching to reduce dependency on flaky endpoints, and schema translation between API and legacy formats.
ODA Reader eliminates these headaches. It provides a unified Python interface that handles complexity for you: automatic search of the latest schema, consistent APIs across datasets, smart caching to reduce dependency on flaky endpoints, and schema translation between the data-explorer API and OECD.Stat formats.

**Key features**:

Expand All @@ -15,7 +15,8 @@ ODA Reader eliminates these headaches. It provides a unified Python interface th
- **Bulk download large files** with memory-efficient streaming for the full CRS (1GB+)
- **Automatic rate limiting** and caching to work within API constraints
- **Schema translation** between Data Explorer API and OECD.Stat formats
- **Version fallback** automatically retries with older schema versions when OECD makes breaking changes
- **Version fallback** automatically searches for the most recent schema version since they
can unexpectedly change with new data releases.

**Built for researchers, analysts, and developers** who need reliable, programmatic access to ODA data without fighting infrastructure.

Expand Down
Loading