Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
ac679de
first pass at scraping the fields in the UI
rdhyee Nov 9, 2023
b812b80
latest progress in working with iSamples API
rdhyee Dec 1, 2023
26347b0
current logic of how to adapt pysolr to query /thing/select
rdhyee Dec 14, 2023
f6bfc82
debugging pysolr get vs post query to /thing/select
rdhyee Dec 21, 2023
4870018
current state of my iSamples work on 2024.01.24 before I make major c…
rdhyee Jan 24, 2024
e2b6fc3
simple use Jupyter widgets
rdhyee Jan 26, 2024
c8a390b
a pass at getting this to run as a Docker container and also on mybinder
rdhyee Feb 15, 2024
1ab522c
add a first draft of a Python client for bulk data handling in the iS…
rdhyee Apr 4, 2024
4171f00
add ipytree to requirements.in
rdhyee Jul 2, 2024
6561ffe
a little start to clearning up isbclient.py
rdhyee Jul 8, 2024
cae3890
add a record_count method to IsbClient2
rdhyee Jul 9, 2024
717044a
adapted the facets function for IsbClient2
rdhyee Jul 9, 2024
6260351
adapted pivot for IsbClient2
rdhyee Jul 9, 2024
1863eec
Using Claude Sonnet 3.5 + back and forth from RY to produce a tutoria…
rdhyee Jul 11, 2024
4288730
add new dependencies
rdhyee Jul 11, 2024
694f456
new version of tutorial with using polars and reading into pandas -- …
rdhyee Jul 11, 2024
531ed59
reaching the limits of the Claude-assisted tutorial generation for ge…
rdhyee Jul 12, 2024
9759d2b
first version of trying to analyze the geoparquet files coming out of…
rdhyee Jul 12, 2024
533a867
update the version of minimal-notebook and removing jupytext in requi…
rdhyee Jul 18, 2024
01ea3f5
add code to install requirements if in google colab
rdhyee Jul 18, 2024
9779564
catchup: 2024.08.29
rdhyee Aug 29, 2024
0d08e4b
refactoring to make package pip installable
rdhyee Sep 11, 2024
2563fb4
setting up basic package structure for isamples_client
rdhyee Sep 16, 2024
f2bfc12
add git+https://github.com/rdhyee/isamples-python.git@exploratory#egg…
rdhyee Sep 16, 2024
0ea6998
ooops forgot to add pyproject.toml to the repo
rdhyee Sep 16, 2024
f069370
installing poetry not working yet -- but a stepping stone towards it …
rdhyee Oct 3, 2024
2b33382
runs until the end...but permission problem
rdhyee Oct 3, 2024
8f12090
clean up Dockerfile
rdhyee Oct 4, 2024
436bf23
changes to try to use poetry to install dependencies
rdhyee Oct 9, 2024
68b90dc
seems like we can use poetry now to install dependencies in google colab
rdhyee Oct 9, 2024
6fb8c19
rearranging files to go to examples dir
rdhyee Oct 10, 2024
6d9f74e
commit more files and modify .gitignore to not track poetry.lock
rdhyee Oct 10, 2024
ae38557
clear cell outputs
rdhyee Oct 11, 2024
3e2aa3b
fix the path of the src to add to sys.path
rdhyee Oct 11, 2024
42d3f48
try to add fixes for google colab
rdhyee Oct 25, 2024
86e3cb9
you can't put isamples-client as a dependency of itself in pproject.toml
rdhyee Oct 25, 2024
3c2231e
if in colab, run
rdhyee Oct 25, 2024
d757fe2
ironing out more bugs in record_counts.ipynb
rdhyee Oct 25, 2024
aaa2a44
some code formatting and linting corrections
rdhyee Dec 4, 2024
166c6d2
put a docstring for src/isamples_client/__init__.py
rdhyee Dec 5, 2024
ff5af0b
some more formatting and cleanup of code
rdhyee Dec 11, 2024
1a53b73
change default search to get the facet counts
rdhyee Jan 20, 2025
a3ca359
fixing one issue at a time -- so I'm checking this in case there is s…
rdhyee Feb 14, 2025
0a12313
some preliminary work on Eric's parquet files
rdhyee Feb 18, 2025
2e8d56f
catchup: 2025.02.24
rdhyee Feb 24, 2025
6b94d3c
added mysql to dockerfile
rdhyee Feb 24, 2025
f456abe
starter sample code of geoparquet
rdhyee May 1, 2025
b20218a
capature the current state before asking for major refactor of geopar…
rdhyee Jun 13, 2025
0f567fa
working version of geoparquet -- rough cut of some interactivity
rdhyee Jun 14, 2025
689687d
put code to grab the data file from Zenodo if the file is not available.
rdhyee Jun 19, 2025
06f3dd0
incorporating some documentation of geoparquet exploration
rdhyee Jun 26, 2025
68fce27
first draft of isample-archive.ipynb to do a simple duckdb calculatio…
rdhyee Jul 14, 2025
c6d9228
show the efficiencies of using duckdb to compute on a remote geoparqu…
rdhyee Jul 14, 2025
482f45d
more complicated analyses of geoparquet
rdhyee Jul 14, 2025
83d6309
Add comprehensive CLAUDE.md for development guidance
rdhyee Aug 5, 2025
6cc0cbc
Add comprehensive repository documentation and organize examples
rdhyee Sep 5, 2025
2faebbe
Update .gitignore for Node.js and development environment
rdhyee Sep 5, 2025
ea385df
Add Node.js configuration for examples/basic directory
rdhyee Sep 5, 2025
84c1bfd
Add cross-repository alignment strategy and coordination
rdhyee Sep 5, 2025
ebf93ab
Add comprehensive OpenContext parquet analysis documentation
rdhyee Sep 23, 2025
200a5da
Fix parquet analysis queries and move .qmd to website repo
rdhyee Sep 23, 2025
d54885d
Add Ibis integration and fix OpenContext parquet analysis queries
rdhyee Sep 23, 2025
1b6de7d
Distinguish generic PQG framework from OpenContext-specific implement…
rdhyee Sep 24, 2025
d68dd9e
Update notebook formatting for better JSON structure
rdhyee Sep 24, 2025
d88be49
Final notebook updates for PQG framework distinction
rdhyee Sep 24, 2025
2dae302
Clarify iSamples model is domain-agnostic, not archaeology-specific
rdhyee Oct 1, 2025
4e5bbc3
Fix terminology: Use MaterialSampleRecord instead of Sample in Path c…
rdhyee Oct 3, 2025
1eb7603
Add comprehensive Path 1/Path 2 documentation and Eric's query analysis
rdhyee Oct 3, 2025
bacbd4a
Clarify Path 1 and Path 2 as complementary geographic granularities
rdhyee Oct 6, 2025
dbd2d83
Add mathematical proof that Path 1 and Path 2 are the ONLY paths
rdhyee Oct 6, 2025
3dc6ea8
Fix Lonboard 0.12+ API breaking change and performance issues
rdhyee Oct 8, 2025
e9415e9
Add jupytext workflow documentation and tooling
rdhyee Oct 15, 2025
f7662eb
Update notebooks and add jupysql dependencies
rdhyee Oct 16, 2025
73c83d1
Clean up repository: Add seaborn, ignore data files, clear notebook o…
rdhyee Nov 11, 2025
d5dc75d
Add PQG demonstration notebook and integration planning
rdhyee Nov 13, 2025
9e525fc
Add typed edge demo notebook and pqg dependency
rdhyee Nov 14, 2025
41c853c
Add narrow vs wide schema learning notebook
rdhyee Dec 4, 2025
39bc3be
Update parquet file paths to ~/Data/iSample/pqg_refining/
rdhyee Dec 10, 2025
f1f325c
Add frontend bundle v2 generator and schema comparison analysis
rdhyee Dec 11, 2025
f068c28
Make schema_comparison.ipynb portable for mybinder.org
rdhyee Dec 11, 2025
4efda7c
Add visualization and focus site exploration to schema_comparison not…
rdhyee Jan 13, 2026
ec80226
Update Section 13 to use pqg APIs with narrow format
rdhyee Jan 13, 2026
128557c
Add enhanced popup with full sample details in Lonboard visualization
rdhyee Jan 13, 2026
d06fd2c
Enhance PKAP and Poggio site visualizations with rich popup data
rdhyee Jan 13, 2026
dc558a6
Add Section 12: Eric's OpenContext vs Zenodo data quality analysis
rdhyee Jan 13, 2026
bb88d78
Add Section 14: Stephen's SESAR parquet comparison
rdhyee Jan 14, 2026
597c5c1
Update notebook with execution outputs for Stephen's SESAR comparison
rdhyee Jan 14, 2026
1bb87d1
Add source API clients for OpenContext, SESAR, GEOME, and Smithsonian
rdhyee Jan 15, 2026
44d9ab5
Fix identifier resolution and API query issues
rdhyee Jan 15, 2026
ac421e7
Add iSamples Interactive Explorer notebook
rdhyee Jan 15, 2026
adee63b
Add dynamic viewport loading to iSamples Explorer
rdhyee Jan 16, 2026
da8cb94
Add bidirectional selection sync between map and table
rdhyee Jan 16, 2026
6a96d47
Fix lonboard compatibility for older versions
rdhyee Jan 16, 2026
c860df9
Add Binder configuration with pinned lonboard version
rdhyee Jan 16, 2026
e518b2f
Fix Binder pandas/pyarrow compatibility
rdhyee Jan 16, 2026
4dc00a3
Pin minimum versions for lonboard, pandas, pyarrow in pyproject.toml
rdhyee Jan 16, 2026
fad889b
Fix viewport bounding box calculation for edge clipping
rdhyee Jan 16, 2026
4e7995a
Add fulltext search with weighted ranking
rdhyee Jan 16, 2026
4c7ed2e
Disable adaptive sampling when searching
rdhyee Jan 16, 2026
eec186a
Update schema_comparison.ipynb execution counts
rdhyee Jan 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Jupyter notebook handling with jupytext
#
# Strategy: Pair .ipynb with .py companions
# - .py files are git-diffable and Claude Code friendly
# - .ipynb files contain outputs for local use
# - Only .py files are meaningfully diffed in PRs

# Use jupytext for notebook diffs (if available)
*.ipynb diff=jupytext

# Filter for showing clean diffs
# (requires: git config diff.jupytext.command 'jupytext --to md --set-formats - -o -')
27 changes: 26 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ ipython_config.py
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
Expand Down Expand Up @@ -158,3 +158,28 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# mac
.DS_Store

# Node.js dependencies
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*

# Claude Code temporary files
.claude/

# Office temporary files
~$*.xlsx
~$*.docx
~$*.pptx


# DuckDB temporary storage
.tmp/
duckdb_temp_storage*.tmp

# Large data files (use remote parquet instead)
*.parquet
160 changes: 160 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Development Commands

### Python Environment Management
- **Poetry** is the primary dependency manager (`pyproject.toml` manages dependencies)
- Install dependencies: `poetry install`
- Install with examples dependencies: `poetry install --with examples`
- Activate virtual environment: `poetry shell`
- Run Python scripts: `poetry run python <script.py>`

### Testing
- Run Python tests: `poetry run pytest tests/`
- Run single test: `poetry run pytest tests/test_isbclient.py::test_field_names`
- Test files are in `tests/` directory

### Playwright Testing (Web Scraping)
- Playwright tests located in `playwright/tests/`
- Run Playwright tests: `cd playwright && npx playwright test`
- View test reports: `cd playwright && npx playwright show-report`
- Configuration: `playwright/playwright.config.js`

### Docker Development
- Build and run Jupyter environment: `./run_docker.sh [port]`
- Default port is 8890, custom port can be specified as first argument
- Dockerfile creates a Jupyter environment with all dependencies installed

## Current Status & Issues ⚠️

**IMPORTANT**: As of September 2025, the iSamples central API at `https://central.isample.xyz/isamples_central/` is offline. This affects all three client classes below. The repository is transitioning to **offline-first geoparquet workflows** - see examples in `examples/basic/geoparquet.ipynb` and `examples/basic/isample-archive.ipynb` for working patterns.

## Architecture Overview

### Core Python Client (`src/isamples_client/`)
The main Python package provides three client classes for interacting with the iSamples API:

1. **`IsbClient`** (`isbclient.py:232-339`): Basic HTTP client using httpx
- Direct API interaction with `/thing/select` endpoint
- Methods: `field_names()`, `record_count()`, `facets()`, `pivot()`

2. **`IsbClient2`** (`isbclient.py:341-586`): Enhanced Solr client using pysolr
- Extends IsbClient with more sophisticated search capabilities
- Supports complex filter queries (`_fq_from_kwargs()`)
- Default search parameters in `default_search_params()`
- Faceting and pivot table functionality

3. **`ISamplesBulkHandler`** (`isbclient.py:588-683`): Bulk data operations
- Handles large dataset exports via authentication
- Methods: `create_download()`, `get_status()`, `download_file()`
- Loads bulk data into pandas DataFrames

### Key Configuration Constants
- `ISB_SERVER`: Default iSamples API endpoint
- `FL_DEFAULT`: Default field list for search results
- `FACET_FIELDS_DEFAULT`: Default faceting fields
- `MAJOR_FIELDS`: UI field mappings
- `ISAMPLES_SOURCES`: Available data sources (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN)

### Examples Structure
- **`examples/basic/`**: Basic API usage examples and Jupyter notebooks
- **`examples/spatial/`**: Geospatial data analysis with geoparquet, DuckDB
- **`examples/opencontext/`**: OpenContext-specific examples
- **`javascript/`**: JavaScript/Node.js integration examples

### Jupyter Notebook Integration
Heavy emphasis on Jupyter notebook examples for data exploration:
- Interactive data analysis with pandas, xarray
- Geospatial analysis using geopandas, folium, cartopy
- **Lonboard WebGL visualization**: High-performance point cloud rendering
- **DuckDB integration**: Efficient remote parquet processing via HTTP range requests
- **API-independent workflows**: Examples that work without central API access

#### Notebook Editing & Version Control Tools
**For Claude Code and Git Workflows**:

1. **jupytext pairing** (recommended for active development):
- Pair `.ipynb` with `.py` companions: `~/bin/nb_pair.sh notebook.ipynb`
- Edit `.py` files to avoid token limits (no outputs in source)
- Auto-sync changes between `.ipynb` ↔ `.py`
- Commit `.py` files for clean git diffs
- See: `JUPYTEXT_WORKFLOW.md` for full guide

2. **nb_source_diff.py** (for quick diffs):
- Diff notebooks without output noise: `nb-diff notebook.ipynb HEAD`
- Use for one-off comparisons or unpaired notebooks
- Tool location: `~/bin/nb_source_diff.py`

**Quick Reference**: See `QUICKREF_NOTEBOOKS.md` for command cheatsheet

**Recommended Workflow**:
- When Claude Code hits token limits on `.ipynb` files → Edit the `.py` companion instead
- Pair new notebooks immediately: `~/bin/nb_pair.sh notebook.ipynb`
- **Commit BOTH files** to git (`.ipynb` for outputs, `.py` for clean diffs)
- Review `.py` diffs for code changes, `.ipynb` for output changes
- Sync after Claude edits: `~/bin/nb_pair.sh --sync notebook.ipynb`

### Dependencies Architecture
- **Core dependencies**: httpx, requests, pandas, xarray, pysolr
- **Spatial analysis**: geopandas, duckdb, polars, ibis-framework, shapely
- **Visualization**: matplotlib, folium, cartopy, ipyleaflet, lonboard
- **Jupyter ecosystem**: ipywidgets, ipydatagrid, sidecar

## Development Patterns

### Search Parameter Building
The codebase uses a sophisticated parameter building system:
- `_fq_from_kwargs()` builds Solr filter queries from keyword arguments
- Uses `multidict.MultiDict` for handling multiple values for same parameter
- Supports date range queries, source filtering, and complex boolean logic

### Error Handling and Logging
- Uses Python `logging` module (configured at INFO level)
- Request URLs are logged for debugging
- HTTP status codes checked with appropriate error raising

### Monkey Patching for Large Queries
- `monkey_patch_select()` modifies pysolr to handle large queries via POST
- `SWITCH_TO_POST` threshold (10000 bytes) determines GET vs POST usage
- Critical for handling complex search queries that exceed URL limits

## Known Issues & Troubleshooting

### API Connectivity Issues
- **Central API offline**: If you see connection errors to `https://central.isample.xyz/isamples_central/`, the API is currently offline
- **Workaround**: Use the geoparquet examples in `examples/basic/geoparquet.ipynb` and `examples/basic/isample-archive.ipynb` which work without API access
- **Alternative data sources**: The examples demonstrate accessing iSamples data via Zenodo archives and remote parquet files

### Lonboard Visualization Issues

**⚠️ CRITICAL: Lonboard 0.12+ API Breaking Change**

Lonboard 0.12+ changed how map initialization works. The old `zoom` and `center` parameters cause `TypeError`.

**OLD (BROKEN)**:
```python
viz(result, map_kwargs={"zoom": 1, "center": {"lat": 0, "lon": 0}})
```

**NEW (CORRECT for 0.12+)**:
```python
viz(result, map_kwargs={"view_state": {"zoom": 1, "latitude": 0, "longitude": 0}})
```

**Key changes**:
- `zoom` and `center` must be nested inside `view_state`
- `center: {lat, lon}` becomes flat `latitude` and `longitude` keys
- Dynamic updates: `m.set_view_state(longitude=..., latitude=..., zoom=...)`
- Animation: `m.fly_to(...)`

**Other considerations**:
- **Memory usage**: Always use `LIMIT` clauses when visualizing parquet data (e.g., `LIMIT 100000`)
- **Performance**: For 6M+ row datasets, querying without LIMIT can cause 5+ minute hangs
- **CRS warnings**: "No CRS exists on data" warnings are expected and can be ignored if lon/lat are WGS84
- **Deprecated**: The `con` parameter to `viz()` is deprecated in newer versions

### Environment Setup
- **Node.js conflicts**: Multiple `package.json` files exist; use `poetry install --with examples` for Python dependencies
- **Jupyter extensions**: Some notebooks require ipywidgets and sidecar extensions for full functionality
Loading