2025.11.04 #73

atambay37 · 2025-11-25T18:43:21Z

atambay37
Nov 25, 2025

Project Restructuring and Pipeline Packaging: Don presented a major update on the project, detailing the restructuring of the pipeline into modular Python packages and the integration of Pixie for simplified orchestration, with Peter and Mei as key stakeholders and contributors to testing and context.
- Pipeline Reorganization: Don explained that the project has been reorganized into three distinct Python packages: data models, pipeline, and web service, each with its own dependencies and testing pipelines. This modular approach allows for easier management, versioning, and collaboration, reducing merge conflicts and hard pathing issues.
- Pixie Integration: The team now uses Pixie to orchestrate services such as PostgreSQL and Prefect, enabling streamlined deployment and management of the ETL pipeline. Don demonstrated how Pixie commands can spin up services, run migrations, and deploy flows, making local development and testing more efficient.
- Database Schema Separation: Don described the separation of the bio service database and the Prefect database due to schema conflicts during Alembic migrations, ensuring that each service maintains its own schema and migration history for stability and clarity.
- Testing and Documentation Updates: Each package now includes its own unit and integration tests, and Don has updated the README and agents markdown files to reflect the new structure. Peter noted that some old documentation is now outdated and will be revised to align with the new system.
- Action Items and Next Steps: Don requested Peter's feedback on the PR before merging, after which Vraj will address any merge conflicts and continue work on documentation. The team aims to have a consolidated documentation website ready for the next meeting.
Documentation Consolidation and Tooling: Vraj, with support from Don, is consolidating all project documentation into a single hosted site using MkDocs and Read the Docs, aiming to improve accessibility and maintainability for all users, including Mei and future contributors.
- Documentation Hosting: Vraj is working on a PR to consolidate documentation into a single website hosted by Read the Docs, which will include information from pipelines, README files, and other relevant resources, making it easier for users to find and update documentation.
- MkDocs Implementation: The team is using MkDocs as the tool for rendering documentation, which will provide a visually appealing and organized structure for the consolidated site.
- Agents Markdown Files: Don updated the agents markdown files, creating both a root-level file and package-specific files to support AI-assisted development and provide context for tools like Copilot and IDE integrations.
- Future Documentation Updates: Don indicated that Vraj will handle further documentation updates and resolve merge conflicts after the PR is merged, with the goal of having a comprehensive documentation site available by the next meeting.
ETL Pipeline and Prefect Flow Registration: Peter and Don discussed the process for registering Prefect flows and deployments in the updated ETL pipeline, clarifying that the workflow remains largely unchanged and is now more modular due to the new package structure.
- Flow Registration Process: Peter asked whether the process for registering Prefect flows and deployments had changed, and Don confirmed that explicit imports are still required but are now managed within the modular package structure, improving maintainability and IDE support.
Dependency Management in Modular Packages: Peter sought clarification from Don on how to manage dependencies within the new modular package structure, with Don explaining that dependencies should be added to the specific package's project.toml file rather than the root Pixie toml.
- Package-Specific Dependencies: Don clarified that when working within a specific package, such as the ETL pipeline, dependencies should be added to that package's project.toml file, while the root Pixie toml serves as an orchestrator for foundational dependencies like Python versions.
LinkML Schema Integration and Data Model Conversion: Peter updated the team on efforts to use LinkML for schema management, highlighting challenges in converting LinkML YAML to SQLModel classes and discussing potential workarounds and future plans for data validation and collaboration.
- LinkML Conversion Challenges: Peter explained that LinkML currently converts schemas to SQLAlchemy or Pydantic base classes, but not directly to SQLModel, requiring manual adjustments for compatibility with SQLModel and proper metadata imports.
- Data Validation Strategies: Don suggested that if converting to SQLModel is too cumbersome, the team could proceed with SQLAlchemy and use LinkML's validation tools, noting that Pydantic integration is more critical for API input/output validation than for database ingestion.
- Collaboration Requirements: Peter emphasized the need to use LinkML for schema management to facilitate collaboration with external groups, such as the J Bay team, and plans to continue developing this approach for future PRs and meetings.
Time Series and Spectral Data Storage Strategies: Peter raised questions about storing large time series and spectral data, discussing options for object storage in Google Cloud and database integration with Don and Tyler, who provided insights into using Parquet files, TimescaleDB, and hybrid approaches.
- Object Storage vs. Database: Peter proposed storing large time series data as Parquet or CSV files in Google Cloud Storage, with database pointers to these files, due to the inefficiency of storing massive datasets directly in Postgres.
- TimescaleDB Extension: Don recommended considering TimescaleDB, a Postgres extension optimized for time series data, which offers efficient querying and automated binning for front-end rendering, while also supporting REST API integration.
- Hybrid Storage Approach: Tyler suggested leveraging both object storage for modeling and ETL pipelines for timescale data needed in the database, allowing flexibility for front-end rendering and machine learning workflows.
- Future Considerations: The team agreed to continue evaluating storage strategies as data requirements become clearer, ensuring compatibility and scalability for future modeling and analysis needs.
AI-Assisted Development and Repository Structure: Don, Peter, Vraj, and Mei discussed the use of AI tools like Copilot and Cloud Code for development, the benefits of a monorepo structure for AI context, and strategies for integrating front-end and back-end codebases.
- Copilot and Cloud Code Usage: Don described using GitHub Copilot for code generation and automation, while Vraj and Mei explored alternatives like Cloud Code and Codex by GPT, comparing their strengths in implementation and architecture.
- Monorepo Strategy: Don advocated for a monorepo structure to provide full context for AI tools, facilitating end-to-end understanding and reducing tool call complexity, with Tyler expressing openness to consolidating the front-end into the main repository.

Follow-up tasks:

Documentation Consolidation: Complete the consolidation of all project documentation into the Read the Docs hosted website and ensure all relevant materials are included. (Vraj)
Documentation Deployment: Stand up the consolidated documentation website once all necessary additions are made. (Vraj)
Project Cleanup: Review and clean up leftover or outdated files and documentation in the repository to align with the new project structure. (Peter, Don)
Documentation Update: Revise or delete outdated documentation within the pipeline to ensure alignment with the new system. (Peter)
PR Review and Merge: Send comments on the big PR to Don, and upon addressing them, approve the PR for merging. (Peter)
Documentation Enhancement: Add any additional documentation or comments needed based on PR review feedback before merging. (Don)
Documentation Website Update: Pull in the new changes after the PR merge, resolve any merge conflicts, and finalize the Read the Docs deployment. (Vraj)
Copilot Access Troubleshooting: Troubleshoot and attempt to obtain Copilot access using National Lab credentials, and share findings with the group. (Mei)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2025.11.04 #73

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

2025.11.04 #73

Uh oh!

atambay37 Nov 25, 2025

Replies: 0 comments

atambay37
Nov 25, 2025