[WIP] merging the current exploratory work into the main exploratory repo #1

rdhyee · 2024-09-19T01:46:00Z

Work in progress.....

…leanup.

…amples API

…l notebook on geoparquet and duckdb

…the calculation of the distances of the cities are screwy though.

…oparquet_duckdb_tutorial.md

… the isample export

…rements.in

…=isamples_client to requirements.in

…working in Dockerfile

- Updated CLAUDE.md with prominent warning about view_state syntax change - Fixed record_counts.ipynb cell 80: * Changed map_kwargs from old zoom/center to new view_state format * Added LIMIT 100000 to prevent loading 6M+ rows (was causing 5+ min hangs) - Added geoparquet0.ipynb as working reference implementation Lonboard 0.12+ requires: map_kwargs={"view_state": {"zoom": 1, "latitude": 0, "longitude": 0}} Instead of: map_kwargs={"zoom": 1, "center": {"lat": 0, "lon": 0}} Performance fix prevents timeout issues when visualizing large parquet datasets. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Created comprehensive jupytext pairing setup for better notebook version control: 1. JUPYTEXT_WORKFLOW.md - Full guide with workflows, troubleshooting, examples 2. QUICKREF_NOTEBOOKS.md - Quick reference card and command cheatsheet 3. .gitattributes - Git configuration for notebook handling 4. Updated CLAUDE.md - Added notebook workflow guidance for future sessions Key benefits: - Pair .ipynb with .py companions for clean git diffs - Edit .py files in Claude Code to avoid token limits on large notebooks - Commit both files: .ipynb for outputs, .py for clean code diffs - Auto-sync changes between paired files Helper script location: ~/bin/nb_pair.sh Related tools: - nb_source_diff.py for one-off diffs without outputs - jupytext pairing for permanent workflow 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changes: - Updated examples/basic/geoparquet0.ipynb with execution outputs - Updated examples/basic/oc_parquet_analysis.ipynb - Updated examples/basic/oc_parquet_analysis_enhanced.ipynb with latest analysis - Added jupysql, duckdb-engine, toml to dependencies New dependencies support SQL magic commands in notebooks for better DuckDB integration and interactive queries. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…utputs Changes: - Add seaborn to pyproject.toml dependencies - Update .gitignore: DuckDB temp files and parquet data files - Clear output cells from oc_parquet_analysis_enhanced.ipynb (best practice for VCS) Cleanup: Removed 51GB of DuckDB temporary storage files from .tmp/

- pqg_demo.ipynb: Working notebook demonstrating PQG library with OpenContext data * Shows single node retrieval, relationship expansion (max_depth) * Compares PQG vs SQL for various graph operations * Includes decision matrix for when to use each approach * All examples tested and working with 11.6M record parquet file - PQG_INTEGRATION_PLAN.md: Strategic plan for integrating PQG into analysis workflows * 4-phase implementation strategy * Decision matrix for PQG vs SQL tradeoffs * Analysis of 100-cell notebook patterns (29 CTEs, 11 functions) * Hybrid approach: PQG for clarity, SQL for performance - SESSION_SUMMARY.md: Complete session documentation from Nov 11 work * PR #4 merged (schema migration to INTEGER row_ids) * PR #5 updated (documentation + Copilot fixes) * Repository cleanup (51GB freed) * Ready for next phase: pushing PQG to its limits Next: Use pqg_demo.ipynb as foundation to explore PQG capabilities and identify enhancement opportunities for contributing back to pqg library. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changes: - Completely rewrote pqg_demo.ipynb to showcase new typed edge functionality - Demonstrates all 14 iSamples edge types from PQG PR #6 - Added pqg dependency to pyproject.toml (using test branch) - Updated click dependency to >=8.1.3 for pqg compatibility - Notebook uses local oc_isamples_pqg.parquet file (691MB, gitignored) New notebook features: - Edge type discovery and statistics - Type-safe edge queries (MSR_PRODUCED_BY, MSR_KEYWORDS, etc.) - Multi-hop graph traversal with typed edges - Edge validation against iSamples schema - Performance comparison: typed edges vs raw SQL - Material type and keyword analysis examples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Comprehensive Jupyter notebook explaining PQG schema formats: - Part 1-6: Narrow vs Wide schema comparison with examples - Part 7: Data snapshot analysis (June vs December 2025) - Entity count changes over time - Sample overlap analysis (99.9995% stable) - New samples categorization (lithics, botanical specimens) - Vocabulary enrichment (151 new concepts) - Keyword enrichment examples Includes performance benchmarks showing 2-3x query speedup with wide schema and 60% file size reduction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Moved Eric's OpenContext PQG parquet files to a central location for easier tracking across notebooks and sessions. Files moved: - oc_isamples_pqg.parquet (691MB, narrow format) - oc_isamples_pqg_wide.parquet (275MB, wide format) Updated paths in: - SESSION_SUMMARY.md - geoparquet.ipynb - oc_parquet_analysis.ipynb - oc_parquet_analysis_enhanced.ipynb - pgp.ipynb - pqg_demo.ipynb - narrow_vs_wide_schema.ipynb 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- scripts/generate_frontend_bundle.py: v2.0 bundle generator with: - H3 resolutions 4-7 for continental to local zoom - Source-partitioned samples (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN) - SHA256 integrity hashes in manifest - Search-optimized agent index (lowercased, deduped) - summary.parquet for instant first paint (<5MB) - Enhanced manifest with load_order and schema info - tests/test_frontend_bundle.py: 10 validation tests for bundle integrity - examples/basic/schema_comparison.ipynb: Benchmark Export vs Narrow vs Wide formats with query patterns for map, facets, agents, reverse lookup - isamples_schema_review.md: Architecture analysis and recommendations Bundle v2 output: ~/Data/iSample/frontend_bundle_v2/ (628 MB total) Critical path to first paint: ~2.5 MB 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add smart path resolution: checks local files first, then cache, then downloads - Auto-detect environment: /tmp/pqgfiles for mybinder, ~/Data/iSample/pqg_cache for others - Support ISAMPLES_CACHE_DIR env var override - Add USE_REMOTE option to query remote parquet via HTTP (no download) - Add DOWNLOAD_MISSING option to control download behavior - Update cells to use path_available() helper for URL/Path compatibility - Add portability documentation to notebook header Works out-of-the-box on: - Raymond's laptop (uses existing local files) - mybinder.org (downloads to /tmp/pqgfiles) - Other users (downloads to ~/Data/iSample/pqg_cache) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ebook - Add Section 9: Lonboard WebGL visualization with color-coded source collections - Add Section 10: Cesium browser visualization reference - Add Section 11: Focus site exploration for PKAP (Cyprus) and Poggio Civitate (Tuscany) - Add material analysis using official iSamples vocabulary (material_hierarchy.json) - Archive 9 obsolete/duplicate notebooks to examples/basic/archive/ and examples/spatial/archive/ The focus sites provide concrete, relatable subsets for demonstrating queries: - PKAP: 34.987°N, 33.708°E (archaeological project in Cyprus) - Poggio Civitate: 43.15°N, 11.40°E (Etruscan site in Tuscany) Related: isamplesorg/pqg#10 (vocabulary labels in Wide/Narrow formats) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace raw SQL examples with proper pqg library usage - Use PQG class and TypedEdgeQueries for graph traversal - Demonstrate edge type inference and discovery APIs - Document narrow vs wide format differences - Reference GitHub issues #11 (unified API) and #12 (OC project data) The pqg library provides full query support for narrow format: - PQG(connection, source_path) for loading graphs - TypedEdgeQueries.get_edges_by_type() for typed queries - get_edge_types_by_subject/object() for discovery - infer_edge_type() for SPO -> edge type mapping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Section 9.2 now extracts nested fields for richer click popups: - materials: Material categories (anthropogenicmetal, rock, etc.) - context: Sampled feature context - object_type: Sample object type - site_name: Sampling site name - keywords: Associated keywords - description: Truncated sample description - curation, registrant: Additional metadata Uses DuckDB list_transform() and array_to_string() to flatten nested STRUCT arrays into readable comma-separated strings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Site-specific maps now query enhanced fields (materials, context, object_type, site_name, keywords) so clicking a point shows full sample details instead of just basic properties. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Compares Eric's OC parquet files with Zenodo files to document: - Entity count differences (Eric has 66K more samples) - Field population gaps (project 100% vs 0%, coords 0% vs 99.5%) - IdentifiedConcept/Agent presence in Eric but missing in Zenodo - Coordinate storage strategy differences (shared vs 1:1) See pqg issue #13 for full analysis and merge recommendations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Prep for meeting with Stephen Richard to discuss SESAR parquet file (s3://sesar-parquet/sesar_samples.parquet) vs Zenodo wide format. Comparison cells analyze: - Schema alignment (both have 49 columns - matches well) - Entity type distribution differences - IdentifiedConcept duplication issue (~77K rows, ~51K unique PIDs) - SampleRelation count difference (3.8M vs 501K) - Agent/MaterialSampleCuration extraction gaps Key discussion points documented for meeting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Includes outputs from running Section 14 comparison cells showing: - Schema alignment verification - Entity type distribution comparison - IdentifiedConcept duplication analysis results - SampleRelation content analysis Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This adds Python client libraries for direct access to the 4 iSamples data sources, enabling data verification and enrichment workflows. New files: - src/isamples_client/sources/ - API client package - base.py: BaseSourceClient abstract class, SampleRecord dataclass - opencontext.py: OpenContextClient for archaeological data - sesar.py: SESARClient for geological samples (IGSN) - geome.py: GEOMEClient for genomic/biological samples - smithsonian.py: SmithsonianClient for museum collections - tests/test_sources/test_clients.py - 16 unit tests (all passing) - examples/basic/source_correlation.py - Cross-source correlation notebook Features: - Unified SampleRecord dataclass for cross-source comparison - Common interface: search(), get_sample(), get_samples_by_location() - Context manager support for proper resource cleanup - Iterator pattern for memory-efficient large result sets Authentication: - OpenContext, GEOME, SESAR: No auth needed for read operations - Smithsonian: Requires free API key from api.data.gov Dependencies: - Added dateparser for flexible date parsing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

GEOME: - Fix geo query to use Lucene range syntax [min TO max] - Query Event entity for coordinates (not Sample) - Update _parse_record to handle Event entity fields OpenContext: - Add ARK identifier resolution via n2t.net - Handle dc-terms:isPartOf as list (not dict) - Support multiple identifier formats (ARK, UUID, URL) SESAR: - Add new app.geosamples.org JSON endpoint with DOI format - Handle inconsistent API responses (sometimes HTML) - Add _parse_app_json_record for new JSON format Example notebook: - Use local parquet files when available - Avoid Cloudflare R2 rate limiting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 1 of the UX plan: cohesive Jupyter demo with: - lonboard WebGL map (handles 500K+ points) - ipydatagrid interactive table - Sample card on row selection - Source dropdown filter - Adjustable sample count slider (up to 500K per source) - Balanced sampling across all 4 data sources Uses DuckDB queries on wide parquet with direct lat/lon columns (no graph traversal needed for basic display). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

New features for responsive map exploration: - Viewport Mode toggle: auto-reload data on pan/zoom - View state observer with 500ms debounce - Loading spinner indicator during data fetch - Adaptive sampling based on zoom level: - World (zoom<2): 10K/source - Continent (2-5): 25K/source - Country (5-8): 50K/source - Region (8-12): 100K/source - Local (>12): user slider value - Bounding box queries filter parquet to visible extent Uses lonboard's traitlets observe() pattern for view_state changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Click map dot → highlights corresponding table row + shows card - Click table row → recenters map on that point (preserves zoom) - Uses lonboard layer.selected_index observer for map clicks - Uses ipydatagrid selections observer for table clicks - syncing_selection flag prevents infinite callback loops - Refactored to unified select_sample() function Note: Table auto-scroll to selected row not implemented due to ipydatagrid limitation (no scrollToRow API exposed to Python). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove _height parameter from Map() constructor - not supported in older lonboard versions (e.g., on Binder). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Creates binder/requirements.txt to ensure Binder uses lonboard>=0.10.0 which supports newer features like _height parameter. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add pyarrow>=12.0.0 and pandas>=2.0.0 to fix ArrowDtype error in lonboard's auto_downcast function. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Align with binder/requirements.txt for consistency: - lonboard >= 0.10.0 (supports _height parameter) - pandas >= 2.0.0 (proper pyarrow backend support) - pyarrow >= 12.0.0 (ArrowDtype compatibility) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Increase buffer_factor from 1.2 to 1.5 for more margin - Add aspect_ratio parameter (1.5) to account for wide maps - Add Mercator stretch correction (1/cos(lat)) for higher latitudes At ~40°N latitude, the longitude buffer is now ~2.9x larger, preventing points from being clipped on the left/right margins. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Search box filters by label, description, and place_name fields - Weighted scoring: label (10 pts) > description (5 pts) > place (3 pts) - Results sorted by score, displayed in new "score" column - Search works with viewport mode (searches within current view) - Clear button to reset search and reload all samples - Reorganized controls into two rows for better layout Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When a search term is active, use the slider value directly instead of zoom-based adaptive sampling. This ensures consistent search results regardless of zoom level - zooming out no longer reduces result count. Adaptive sampling still applies when browsing without a search term. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

rdhyee added 30 commits November 9, 2023 13:24

first pass at scraping the fields in the UI

ac679de

latest progress in working with iSamples API

b812b80

current logic of how to adapt pysolr to query /thing/select

26347b0

debugging pysolr get vs post query to /thing/select

f6bfc82

current state of my iSamples work on 2024.01.24 before I make major c…

4870018

…leanup.

simple use Jupyter widgets

e2b6fc3

a pass at getting this to run as a Docker container and also on mybinder

c8a390b

add a first draft of a Python client for bulk data handling in the iS…

1ab522c

…amples API

add ipytree to requirements.in

4171f00

a little start to clearning up isbclient.py

6561ffe

add a record_count method to IsbClient2

cae3890

adapted the facets function for IsbClient2

717044a

adapted pivot for IsbClient2

6260351

Using Claude Sonnet 3.5 + back and forth from RY to produce a tutoria…

1863eec

…l notebook on geoparquet and duckdb

add new dependencies

4288730

new version of tutorial with using polars and reading into pandas -- …

694f456

…the calculation of the distances of the cities are screwy though.

reaching the limits of the Claude-assisted tutorial generation for ge…

531ed59

…oparquet_duckdb_tutorial.md

first version of trying to analyze the geoparquet files coming out of…

9759d2b

… the isample export

update the version of minimal-notebook and removing jupytext in requi…

533a867

…rements.in

add code to install requirements if in google colab

01ea3f5

catchup: 2024.08.29

9779564

refactoring to make package pip installable

0d08e4b

setting up basic package structure for isamples_client

2563fb4

add git+https://github.com/rdhyee/isamples-python.git@exploratory#egg…

f2bfc12

…=isamples_client to requirements.in

ooops forgot to add pyproject.toml to the repo

0ea6998

installing poetry not working yet -- but a stepping stone towards it …

f069370

…working in Dockerfile

runs until the end...but permission problem

2b33382

clean up Dockerfile

8f12090

changes to try to use poetry to install dependencies

436bf23

seems like we can use poetry now to install dependencies in google colab

68b90dc

rdhyee and others added 30 commits October 8, 2025 15:01

Fix lonboard compatibility for older versions

6a96d47

Remove _height parameter from Map() constructor - not supported in older lonboard versions (e.g., on Binder). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Binder configuration with pinned lonboard version

c860df9

Creates binder/requirements.txt to ensure Binder uses lonboard>=0.10.0 which supports newer features like _height parameter. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix Binder pandas/pyarrow compatibility

e518b2f

Add pyarrow>=12.0.0 and pandas>=2.0.0 to fix ArrowDtype error in lonboard's auto_downcast function. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update schema_comparison.ipynb execution counts

eec186a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] merging the current exploratory work into the main exploratory repo #1

[WIP] merging the current exploratory work into the main exploratory repo #1

Uh oh!

rdhyee commented Sep 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[WIP] merging the current exploratory work into the main exploratory repo #1

Are you sure you want to change the base?

[WIP] merging the current exploratory work into the main exploratory repo #1

Uh oh!

Conversation

rdhyee commented Sep 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant