Skip to content

Conversation

@rdhyee
Copy link

@rdhyee rdhyee commented Sep 19, 2024

Work in progress.....

rdhyee added 30 commits November 9, 2023 13:24
…the calculation of the distances of the cities are screwy though.
rdhyee and others added 30 commits October 8, 2025 15:01
- Updated CLAUDE.md with prominent warning about view_state syntax change
- Fixed record_counts.ipynb cell 80:
  * Changed map_kwargs from old zoom/center to new view_state format
  * Added LIMIT 100000 to prevent loading 6M+ rows (was causing 5+ min hangs)
- Added geoparquet0.ipynb as working reference implementation

Lonboard 0.12+ requires:
  map_kwargs={"view_state": {"zoom": 1, "latitude": 0, "longitude": 0}}
Instead of:
  map_kwargs={"zoom": 1, "center": {"lat": 0, "lon": 0}}

Performance fix prevents timeout issues when visualizing large parquet datasets.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Created comprehensive jupytext pairing setup for better notebook version control:

1. JUPYTEXT_WORKFLOW.md - Full guide with workflows, troubleshooting, examples
2. QUICKREF_NOTEBOOKS.md - Quick reference card and command cheatsheet
3. .gitattributes - Git configuration for notebook handling
4. Updated CLAUDE.md - Added notebook workflow guidance for future sessions

Key benefits:
- Pair .ipynb with .py companions for clean git diffs
- Edit .py files in Claude Code to avoid token limits on large notebooks
- Commit both files: .ipynb for outputs, .py for clean code diffs
- Auto-sync changes between paired files

Helper script location: ~/bin/nb_pair.sh

Related tools:
- nb_source_diff.py for one-off diffs without outputs
- jupytext pairing for permanent workflow

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Updated examples/basic/geoparquet0.ipynb with execution outputs
- Updated examples/basic/oc_parquet_analysis.ipynb
- Updated examples/basic/oc_parquet_analysis_enhanced.ipynb with latest analysis
- Added jupysql, duckdb-engine, toml to dependencies

New dependencies support SQL magic commands in notebooks for better
DuckDB integration and interactive queries.

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…utputs

Changes:
- Add seaborn to pyproject.toml dependencies
- Update .gitignore: DuckDB temp files and parquet data files
- Clear output cells from oc_parquet_analysis_enhanced.ipynb (best practice for VCS)

Cleanup: Removed 51GB of DuckDB temporary storage files from .tmp/
- pqg_demo.ipynb: Working notebook demonstrating PQG library with OpenContext data
  * Shows single node retrieval, relationship expansion (max_depth)
  * Compares PQG vs SQL for various graph operations
  * Includes decision matrix for when to use each approach
  * All examples tested and working with 11.6M record parquet file

- PQG_INTEGRATION_PLAN.md: Strategic plan for integrating PQG into analysis workflows
  * 4-phase implementation strategy
  * Decision matrix for PQG vs SQL tradeoffs
  * Analysis of 100-cell notebook patterns (29 CTEs, 11 functions)
  * Hybrid approach: PQG for clarity, SQL for performance

- SESSION_SUMMARY.md: Complete session documentation from Nov 11 work
  * PR #4 merged (schema migration to INTEGER row_ids)
  * PR #5 updated (documentation + Copilot fixes)
  * Repository cleanup (51GB freed)
  * Ready for next phase: pushing PQG to its limits

Next: Use pqg_demo.ipynb as foundation to explore PQG capabilities
and identify enhancement opportunities for contributing back to pqg library.

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Completely rewrote pqg_demo.ipynb to showcase new typed edge functionality
- Demonstrates all 14 iSamples edge types from PQG PR #6
- Added pqg dependency to pyproject.toml (using test branch)
- Updated click dependency to >=8.1.3 for pqg compatibility
- Notebook uses local oc_isamples_pqg.parquet file (691MB, gitignored)

New notebook features:
- Edge type discovery and statistics
- Type-safe edge queries (MSR_PRODUCED_BY, MSR_KEYWORDS, etc.)
- Multi-hop graph traversal with typed edges
- Edge validation against iSamples schema
- Performance comparison: typed edges vs raw SQL
- Material type and keyword analysis examples

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Comprehensive Jupyter notebook explaining PQG schema formats:
- Part 1-6: Narrow vs Wide schema comparison with examples
- Part 7: Data snapshot analysis (June vs December 2025)
  - Entity count changes over time
  - Sample overlap analysis (99.9995% stable)
  - New samples categorization (lithics, botanical specimens)
  - Vocabulary enrichment (151 new concepts)
  - Keyword enrichment examples

Includes performance benchmarks showing 2-3x query speedup
with wide schema and 60% file size reduction.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Moved Eric's OpenContext PQG parquet files to a central location for
easier tracking across notebooks and sessions.

Files moved:
- oc_isamples_pqg.parquet (691MB, narrow format)
- oc_isamples_pqg_wide.parquet (275MB, wide format)

Updated paths in:
- SESSION_SUMMARY.md
- geoparquet.ipynb
- oc_parquet_analysis.ipynb
- oc_parquet_analysis_enhanced.ipynb
- pgp.ipynb
- pqg_demo.ipynb
- narrow_vs_wide_schema.ipynb

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- scripts/generate_frontend_bundle.py: v2.0 bundle generator with:
  - H3 resolutions 4-7 for continental to local zoom
  - Source-partitioned samples (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN)
  - SHA256 integrity hashes in manifest
  - Search-optimized agent index (lowercased, deduped)
  - summary.parquet for instant first paint (<5MB)
  - Enhanced manifest with load_order and schema info

- tests/test_frontend_bundle.py: 10 validation tests for bundle integrity

- examples/basic/schema_comparison.ipynb: Benchmark Export vs Narrow vs Wide
  formats with query patterns for map, facets, agents, reverse lookup

- isamples_schema_review.md: Architecture analysis and recommendations

Bundle v2 output: ~/Data/iSample/frontend_bundle_v2/ (628 MB total)
Critical path to first paint: ~2.5 MB

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add smart path resolution: checks local files first, then cache, then downloads
- Auto-detect environment: /tmp/pqgfiles for mybinder, ~/Data/iSample/pqg_cache for others
- Support ISAMPLES_CACHE_DIR env var override
- Add USE_REMOTE option to query remote parquet via HTTP (no download)
- Add DOWNLOAD_MISSING option to control download behavior
- Update cells to use path_available() helper for URL/Path compatibility
- Add portability documentation to notebook header

Works out-of-the-box on:
- Raymond's laptop (uses existing local files)
- mybinder.org (downloads to /tmp/pqgfiles)
- Other users (downloads to ~/Data/iSample/pqg_cache)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ebook

- Add Section 9: Lonboard WebGL visualization with color-coded source collections
- Add Section 10: Cesium browser visualization reference
- Add Section 11: Focus site exploration for PKAP (Cyprus) and Poggio Civitate (Tuscany)
- Add material analysis using official iSamples vocabulary (material_hierarchy.json)
- Archive 9 obsolete/duplicate notebooks to examples/basic/archive/ and examples/spatial/archive/

The focus sites provide concrete, relatable subsets for demonstrating queries:
- PKAP: 34.987°N, 33.708°E (archaeological project in Cyprus)
- Poggio Civitate: 43.15°N, 11.40°E (Etruscan site in Tuscany)

Related: isamplesorg/pqg#10 (vocabulary labels in Wide/Narrow formats)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace raw SQL examples with proper pqg library usage
- Use PQG class and TypedEdgeQueries for graph traversal
- Demonstrate edge type inference and discovery APIs
- Document narrow vs wide format differences
- Reference GitHub issues #11 (unified API) and #12 (OC project data)

The pqg library provides full query support for narrow format:
- PQG(connection, source_path) for loading graphs
- TypedEdgeQueries.get_edges_by_type() for typed queries
- get_edge_types_by_subject/object() for discovery
- infer_edge_type() for SPO -> edge type mapping

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Section 9.2 now extracts nested fields for richer click popups:
- materials: Material categories (anthropogenicmetal, rock, etc.)
- context: Sampled feature context
- object_type: Sample object type
- site_name: Sampling site name
- keywords: Associated keywords
- description: Truncated sample description
- curation, registrant: Additional metadata

Uses DuckDB list_transform() and array_to_string() to flatten
nested STRUCT arrays into readable comma-separated strings.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Site-specific maps now query enhanced fields (materials, context,
object_type, site_name, keywords) so clicking a point shows full
sample details instead of just basic properties.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Compares Eric's OC parquet files with Zenodo files to document:
- Entity count differences (Eric has 66K more samples)
- Field population gaps (project 100% vs 0%, coords 0% vs 99.5%)
- IdentifiedConcept/Agent presence in Eric but missing in Zenodo
- Coordinate storage strategy differences (shared vs 1:1)

See pqg issue #13 for full analysis and merge recommendations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prep for meeting with Stephen Richard to discuss SESAR parquet file
(s3://sesar-parquet/sesar_samples.parquet) vs Zenodo wide format.

Comparison cells analyze:
- Schema alignment (both have 49 columns - matches well)
- Entity type distribution differences
- IdentifiedConcept duplication issue (~77K rows, ~51K unique PIDs)
- SampleRelation count difference (3.8M vs 501K)
- Agent/MaterialSampleCuration extraction gaps

Key discussion points documented for meeting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Includes outputs from running Section 14 comparison cells showing:
- Schema alignment verification
- Entity type distribution comparison
- IdentifiedConcept duplication analysis results
- SampleRelation content analysis

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This adds Python client libraries for direct access to the 4 iSamples
data sources, enabling data verification and enrichment workflows.

New files:
- src/isamples_client/sources/ - API client package
  - base.py: BaseSourceClient abstract class, SampleRecord dataclass
  - opencontext.py: OpenContextClient for archaeological data
  - sesar.py: SESARClient for geological samples (IGSN)
  - geome.py: GEOMEClient for genomic/biological samples
  - smithsonian.py: SmithsonianClient for museum collections

- tests/test_sources/test_clients.py - 16 unit tests (all passing)
- examples/basic/source_correlation.py - Cross-source correlation notebook

Features:
- Unified SampleRecord dataclass for cross-source comparison
- Common interface: search(), get_sample(), get_samples_by_location()
- Context manager support for proper resource cleanup
- Iterator pattern for memory-efficient large result sets

Authentication:
- OpenContext, GEOME, SESAR: No auth needed for read operations
- Smithsonian: Requires free API key from api.data.gov

Dependencies:
- Added dateparser for flexible date parsing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GEOME:
- Fix geo query to use Lucene range syntax [min TO max]
- Query Event entity for coordinates (not Sample)
- Update _parse_record to handle Event entity fields

OpenContext:
- Add ARK identifier resolution via n2t.net
- Handle dc-terms:isPartOf as list (not dict)
- Support multiple identifier formats (ARK, UUID, URL)

SESAR:
- Add new app.geosamples.org JSON endpoint with DOI format
- Handle inconsistent API responses (sometimes HTML)
- Add _parse_app_json_record for new JSON format

Example notebook:
- Use local parquet files when available
- Avoid Cloudflare R2 rate limiting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 1 of the UX plan: cohesive Jupyter demo with:
- lonboard WebGL map (handles 500K+ points)
- ipydatagrid interactive table
- Sample card on row selection
- Source dropdown filter
- Adjustable sample count slider (up to 500K per source)
- Balanced sampling across all 4 data sources

Uses DuckDB queries on wide parquet with direct lat/lon
columns (no graph traversal needed for basic display).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New features for responsive map exploration:
- Viewport Mode toggle: auto-reload data on pan/zoom
- View state observer with 500ms debounce
- Loading spinner indicator during data fetch
- Adaptive sampling based on zoom level:
  - World (zoom<2): 10K/source
  - Continent (2-5): 25K/source
  - Country (5-8): 50K/source
  - Region (8-12): 100K/source
  - Local (>12): user slider value
- Bounding box queries filter parquet to visible extent

Uses lonboard's traitlets observe() pattern for view_state changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Click map dot → highlights corresponding table row + shows card
- Click table row → recenters map on that point (preserves zoom)
- Uses lonboard layer.selected_index observer for map clicks
- Uses ipydatagrid selections observer for table clicks
- syncing_selection flag prevents infinite callback loops
- Refactored to unified select_sample() function

Note: Table auto-scroll to selected row not implemented due to
ipydatagrid limitation (no scrollToRow API exposed to Python).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove _height parameter from Map() constructor - not supported
in older lonboard versions (e.g., on Binder).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Creates binder/requirements.txt to ensure Binder uses lonboard>=0.10.0
which supports newer features like _height parameter.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pyarrow>=12.0.0 and pandas>=2.0.0 to fix ArrowDtype error
in lonboard's auto_downcast function.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Align with binder/requirements.txt for consistency:
- lonboard >= 0.10.0 (supports _height parameter)
- pandas >= 2.0.0 (proper pyarrow backend support)
- pyarrow >= 12.0.0 (ArrowDtype compatibility)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Increase buffer_factor from 1.2 to 1.5 for more margin
- Add aspect_ratio parameter (1.5) to account for wide maps
- Add Mercator stretch correction (1/cos(lat)) for higher latitudes

At ~40°N latitude, the longitude buffer is now ~2.9x larger,
preventing points from being clipped on the left/right margins.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Search box filters by label, description, and place_name fields
- Weighted scoring: label (10 pts) > description (5 pts) > place (3 pts)
- Results sorted by score, displayed in new "score" column
- Search works with viewport mode (searches within current view)
- Clear button to reset search and reload all samples
- Reorganized controls into two rows for better layout

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a search term is active, use the slider value directly instead of
zoom-based adaptive sampling. This ensures consistent search results
regardless of zoom level - zooming out no longer reduces result count.

Adaptive sampling still applies when browsing without a search term.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant