-
Notifications
You must be signed in to change notification settings - Fork 1
[WIP] merging the current exploratory work into the main exploratory repo #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
rdhyee
wants to merge
100
commits into
isamplesorg:main
Choose a base branch
from
rdhyee:exploratory
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…l notebook on geoparquet and duckdb
…the calculation of the distances of the cities are screwy though.
…oparquet_duckdb_tutorial.md
… the isample export
…working in Dockerfile
- Updated CLAUDE.md with prominent warning about view_state syntax change
- Fixed record_counts.ipynb cell 80:
* Changed map_kwargs from old zoom/center to new view_state format
* Added LIMIT 100000 to prevent loading 6M+ rows (was causing 5+ min hangs)
- Added geoparquet0.ipynb as working reference implementation
Lonboard 0.12+ requires:
map_kwargs={"view_state": {"zoom": 1, "latitude": 0, "longitude": 0}}
Instead of:
map_kwargs={"zoom": 1, "center": {"lat": 0, "lon": 0}}
Performance fix prevents timeout issues when visualizing large parquet datasets.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Created comprehensive jupytext pairing setup for better notebook version control: 1. JUPYTEXT_WORKFLOW.md - Full guide with workflows, troubleshooting, examples 2. QUICKREF_NOTEBOOKS.md - Quick reference card and command cheatsheet 3. .gitattributes - Git configuration for notebook handling 4. Updated CLAUDE.md - Added notebook workflow guidance for future sessions Key benefits: - Pair .ipynb with .py companions for clean git diffs - Edit .py files in Claude Code to avoid token limits on large notebooks - Commit both files: .ipynb for outputs, .py for clean code diffs - Auto-sync changes between paired files Helper script location: ~/bin/nb_pair.sh Related tools: - nb_source_diff.py for one-off diffs without outputs - jupytext pairing for permanent workflow 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changes: - Updated examples/basic/geoparquet0.ipynb with execution outputs - Updated examples/basic/oc_parquet_analysis.ipynb - Updated examples/basic/oc_parquet_analysis_enhanced.ipynb with latest analysis - Added jupysql, duckdb-engine, toml to dependencies New dependencies support SQL magic commands in notebooks for better DuckDB integration and interactive queries. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…utputs Changes: - Add seaborn to pyproject.toml dependencies - Update .gitignore: DuckDB temp files and parquet data files - Clear output cells from oc_parquet_analysis_enhanced.ipynb (best practice for VCS) Cleanup: Removed 51GB of DuckDB temporary storage files from .tmp/
- pqg_demo.ipynb: Working notebook demonstrating PQG library with OpenContext data * Shows single node retrieval, relationship expansion (max_depth) * Compares PQG vs SQL for various graph operations * Includes decision matrix for when to use each approach * All examples tested and working with 11.6M record parquet file - PQG_INTEGRATION_PLAN.md: Strategic plan for integrating PQG into analysis workflows * 4-phase implementation strategy * Decision matrix for PQG vs SQL tradeoffs * Analysis of 100-cell notebook patterns (29 CTEs, 11 functions) * Hybrid approach: PQG for clarity, SQL for performance - SESSION_SUMMARY.md: Complete session documentation from Nov 11 work * PR #4 merged (schema migration to INTEGER row_ids) * PR #5 updated (documentation + Copilot fixes) * Repository cleanup (51GB freed) * Ready for next phase: pushing PQG to its limits Next: Use pqg_demo.ipynb as foundation to explore PQG capabilities and identify enhancement opportunities for contributing back to pqg library. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changes: - Completely rewrote pqg_demo.ipynb to showcase new typed edge functionality - Demonstrates all 14 iSamples edge types from PQG PR #6 - Added pqg dependency to pyproject.toml (using test branch) - Updated click dependency to >=8.1.3 for pqg compatibility - Notebook uses local oc_isamples_pqg.parquet file (691MB, gitignored) New notebook features: - Edge type discovery and statistics - Type-safe edge queries (MSR_PRODUCED_BY, MSR_KEYWORDS, etc.) - Multi-hop graph traversal with typed edges - Edge validation against iSamples schema - Performance comparison: typed edges vs raw SQL - Material type and keyword analysis examples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Comprehensive Jupyter notebook explaining PQG schema formats: - Part 1-6: Narrow vs Wide schema comparison with examples - Part 7: Data snapshot analysis (June vs December 2025) - Entity count changes over time - Sample overlap analysis (99.9995% stable) - New samples categorization (lithics, botanical specimens) - Vocabulary enrichment (151 new concepts) - Keyword enrichment examples Includes performance benchmarks showing 2-3x query speedup with wide schema and 60% file size reduction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Moved Eric's OpenContext PQG parquet files to a central location for easier tracking across notebooks and sessions. Files moved: - oc_isamples_pqg.parquet (691MB, narrow format) - oc_isamples_pqg_wide.parquet (275MB, wide format) Updated paths in: - SESSION_SUMMARY.md - geoparquet.ipynb - oc_parquet_analysis.ipynb - oc_parquet_analysis_enhanced.ipynb - pgp.ipynb - pqg_demo.ipynb - narrow_vs_wide_schema.ipynb 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- scripts/generate_frontend_bundle.py: v2.0 bundle generator with: - H3 resolutions 4-7 for continental to local zoom - Source-partitioned samples (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN) - SHA256 integrity hashes in manifest - Search-optimized agent index (lowercased, deduped) - summary.parquet for instant first paint (<5MB) - Enhanced manifest with load_order and schema info - tests/test_frontend_bundle.py: 10 validation tests for bundle integrity - examples/basic/schema_comparison.ipynb: Benchmark Export vs Narrow vs Wide formats with query patterns for map, facets, agents, reverse lookup - isamples_schema_review.md: Architecture analysis and recommendations Bundle v2 output: ~/Data/iSample/frontend_bundle_v2/ (628 MB total) Critical path to first paint: ~2.5 MB 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add smart path resolution: checks local files first, then cache, then downloads - Auto-detect environment: /tmp/pqgfiles for mybinder, ~/Data/iSample/pqg_cache for others - Support ISAMPLES_CACHE_DIR env var override - Add USE_REMOTE option to query remote parquet via HTTP (no download) - Add DOWNLOAD_MISSING option to control download behavior - Update cells to use path_available() helper for URL/Path compatibility - Add portability documentation to notebook header Works out-of-the-box on: - Raymond's laptop (uses existing local files) - mybinder.org (downloads to /tmp/pqgfiles) - Other users (downloads to ~/Data/iSample/pqg_cache) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ebook - Add Section 9: Lonboard WebGL visualization with color-coded source collections - Add Section 10: Cesium browser visualization reference - Add Section 11: Focus site exploration for PKAP (Cyprus) and Poggio Civitate (Tuscany) - Add material analysis using official iSamples vocabulary (material_hierarchy.json) - Archive 9 obsolete/duplicate notebooks to examples/basic/archive/ and examples/spatial/archive/ The focus sites provide concrete, relatable subsets for demonstrating queries: - PKAP: 34.987°N, 33.708°E (archaeological project in Cyprus) - Poggio Civitate: 43.15°N, 11.40°E (Etruscan site in Tuscany) Related: isamplesorg/pqg#10 (vocabulary labels in Wide/Narrow formats) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace raw SQL examples with proper pqg library usage - Use PQG class and TypedEdgeQueries for graph traversal - Demonstrate edge type inference and discovery APIs - Document narrow vs wide format differences - Reference GitHub issues #11 (unified API) and #12 (OC project data) The pqg library provides full query support for narrow format: - PQG(connection, source_path) for loading graphs - TypedEdgeQueries.get_edges_by_type() for typed queries - get_edge_types_by_subject/object() for discovery - infer_edge_type() for SPO -> edge type mapping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Section 9.2 now extracts nested fields for richer click popups: - materials: Material categories (anthropogenicmetal, rock, etc.) - context: Sampled feature context - object_type: Sample object type - site_name: Sampling site name - keywords: Associated keywords - description: Truncated sample description - curation, registrant: Additional metadata Uses DuckDB list_transform() and array_to_string() to flatten nested STRUCT arrays into readable comma-separated strings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Site-specific maps now query enhanced fields (materials, context, object_type, site_name, keywords) so clicking a point shows full sample details instead of just basic properties. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Compares Eric's OC parquet files with Zenodo files to document: - Entity count differences (Eric has 66K more samples) - Field population gaps (project 100% vs 0%, coords 0% vs 99.5%) - IdentifiedConcept/Agent presence in Eric but missing in Zenodo - Coordinate storage strategy differences (shared vs 1:1) See pqg issue #13 for full analysis and merge recommendations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prep for meeting with Stephen Richard to discuss SESAR parquet file (s3://sesar-parquet/sesar_samples.parquet) vs Zenodo wide format. Comparison cells analyze: - Schema alignment (both have 49 columns - matches well) - Entity type distribution differences - IdentifiedConcept duplication issue (~77K rows, ~51K unique PIDs) - SampleRelation count difference (3.8M vs 501K) - Agent/MaterialSampleCuration extraction gaps Key discussion points documented for meeting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Includes outputs from running Section 14 comparison cells showing: - Schema alignment verification - Entity type distribution comparison - IdentifiedConcept duplication analysis results - SampleRelation content analysis Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This adds Python client libraries for direct access to the 4 iSamples data sources, enabling data verification and enrichment workflows. New files: - src/isamples_client/sources/ - API client package - base.py: BaseSourceClient abstract class, SampleRecord dataclass - opencontext.py: OpenContextClient for archaeological data - sesar.py: SESARClient for geological samples (IGSN) - geome.py: GEOMEClient for genomic/biological samples - smithsonian.py: SmithsonianClient for museum collections - tests/test_sources/test_clients.py - 16 unit tests (all passing) - examples/basic/source_correlation.py - Cross-source correlation notebook Features: - Unified SampleRecord dataclass for cross-source comparison - Common interface: search(), get_sample(), get_samples_by_location() - Context manager support for proper resource cleanup - Iterator pattern for memory-efficient large result sets Authentication: - OpenContext, GEOME, SESAR: No auth needed for read operations - Smithsonian: Requires free API key from api.data.gov Dependencies: - Added dateparser for flexible date parsing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GEOME: - Fix geo query to use Lucene range syntax [min TO max] - Query Event entity for coordinates (not Sample) - Update _parse_record to handle Event entity fields OpenContext: - Add ARK identifier resolution via n2t.net - Handle dc-terms:isPartOf as list (not dict) - Support multiple identifier formats (ARK, UUID, URL) SESAR: - Add new app.geosamples.org JSON endpoint with DOI format - Handle inconsistent API responses (sometimes HTML) - Add _parse_app_json_record for new JSON format Example notebook: - Use local parquet files when available - Avoid Cloudflare R2 rate limiting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 1 of the UX plan: cohesive Jupyter demo with: - lonboard WebGL map (handles 500K+ points) - ipydatagrid interactive table - Sample card on row selection - Source dropdown filter - Adjustable sample count slider (up to 500K per source) - Balanced sampling across all 4 data sources Uses DuckDB queries on wide parquet with direct lat/lon columns (no graph traversal needed for basic display). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New features for responsive map exploration: - Viewport Mode toggle: auto-reload data on pan/zoom - View state observer with 500ms debounce - Loading spinner indicator during data fetch - Adaptive sampling based on zoom level: - World (zoom<2): 10K/source - Continent (2-5): 25K/source - Country (5-8): 50K/source - Region (8-12): 100K/source - Local (>12): user slider value - Bounding box queries filter parquet to visible extent Uses lonboard's traitlets observe() pattern for view_state changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Click map dot → highlights corresponding table row + shows card - Click table row → recenters map on that point (preserves zoom) - Uses lonboard layer.selected_index observer for map clicks - Uses ipydatagrid selections observer for table clicks - syncing_selection flag prevents infinite callback loops - Refactored to unified select_sample() function Note: Table auto-scroll to selected row not implemented due to ipydatagrid limitation (no scrollToRow API exposed to Python). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove _height parameter from Map() constructor - not supported in older lonboard versions (e.g., on Binder). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Creates binder/requirements.txt to ensure Binder uses lonboard>=0.10.0 which supports newer features like _height parameter. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pyarrow>=12.0.0 and pandas>=2.0.0 to fix ArrowDtype error in lonboard's auto_downcast function. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Align with binder/requirements.txt for consistency: - lonboard >= 0.10.0 (supports _height parameter) - pandas >= 2.0.0 (proper pyarrow backend support) - pyarrow >= 12.0.0 (ArrowDtype compatibility) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Increase buffer_factor from 1.2 to 1.5 for more margin - Add aspect_ratio parameter (1.5) to account for wide maps - Add Mercator stretch correction (1/cos(lat)) for higher latitudes At ~40°N latitude, the longitude buffer is now ~2.9x larger, preventing points from being clipped on the left/right margins. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Search box filters by label, description, and place_name fields - Weighted scoring: label (10 pts) > description (5 pts) > place (3 pts) - Results sorted by score, displayed in new "score" column - Search works with viewport mode (searches within current view) - Clear button to reset search and reload all samples - Reorganized controls into two rows for better layout Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a search term is active, use the slider value directly instead of zoom-based adaptive sampling. This ensures consistent search results regardless of zoom level - zooming out no longer reduces result count. Adaptive sampling still applies when browsing without a search term. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Work in progress.....