refactor: 3-layer architecture + web monitoring & UX improvements#55
Open
BenjaminNavet wants to merge 23 commits intomainfrom
Open
refactor: 3-layer architecture + web monitoring & UX improvements#55BenjaminNavet wants to merge 23 commits intomainfrom
BenjaminNavet wants to merge 23 commits intomainfrom
Conversation
…53) - Add progress_callback and cancel_event params to CollectCollection (optional, defaults to None — CLI behavior unchanged) - Add POST /pipelines/{job_id}/cancel endpoint with graceful shutdown - Offload blocking collection/aggregation to thread pool via run_in_executor so the event loop stays responsive for polling - Add Streamlit progress monitoring section with per-API stats, progress bar, and cancel button (polls every 2s) - Fix bare except Exception to catch queue.Empty specifically - Fix Zotero output string concatenation precedence bug
- Add enable_enrichment, enrichment_threshold, enrichment_limit fields to CollectionConfig Pydantic model - Run enrich_with_hf.main() after aggregation when enrichment is enabled in run_collection_task(), offloaded to thread pool - Add enrichment checkbox with conditional threshold/limit controls in Tab 1 (New Collection) - Add standalone "Enrich with HuggingFace" section in Tab 3 (Filter & Export) with threshold slider, limit input, and enrich button calling scilex-enrich via subprocess
Replace single dropdown with individual expanders per API in the sidebar. Each shows a status indicator (✅ configured / ⬚ not). "Clear" button wipes all credentials for that API at once.
Use default values from enrich_with_hf.py (threshold=85, no limit). These parameters are too technical for the web UI.
These are output/enrichment tools, not collection sources. Simplify Data Sources section to 2 columns: Free APIs and Paid APIs.
- Wrap all configuration (API keys, output dir) in a single collapsible expander (open by default, can be closed) - Show all APIs in a flat list with green dot (●) when configured, grey circle (○) when not — visible at a glance without expanding - Each API has inline Save/Clear buttons with a divider between them
Add PubMedCentral and Istex to the free APIs list in both the frontend and the /available-apis endpoint. Remove HuggingFace and Zotero from /available-apis since they are not data sources. Now shows all 11 registered collectors: 8 free + 3 paid.
Enrichment is a pipeline step, not an export action. The checkbox in New Collection (Tab 1) is the correct place to enable it.
Add OpenAlex, PubMed (optional keys for higher rate limits) and CrossRef (mailto for polite pool) to the API keys configuration. Now lists all 9 APIs that accept credentials.
- Detect existing data on disk for the collection name - Show info message explaining idempotent behavior - Change button text to "Resume Collection" when partial data exists - Improve cancelled state message to explain restart is safe
When partial data exists, show a "Start fresh" checkbox that deletes previous results before starting. Button adapts: "Resume Collection", "Start Fresh", or "Start Collection Pipeline" depending on context.
Rewrite 12 vague one-liner tooltips with clearer 2-3 line descriptions drawn from config file comments. Add 6 missing tooltips for Enable Base Filters, abstract length sliders, Allowed Publication Types, and API source selectors. Remove label_visibility="collapsed" from 3 multiselects so the help icon is actually visible.
… polish
Add real-time log capture, phase stepper, paper detail cards with DOI links, collection deletion (API + UI), and reorganize API keys by category. Quality filters moved into collapsible section. Includes ruff formatting across the codebase.
Extract the 2272-line aggregate_collect.py monolith into clean pipeline modules. Eliminate sys.argv hack and module-level side effects. - Add scilex/config.py: centralized SciLExConfig with from_files() and from_dicts() constructors, replacing scattered config loading - Add scilex/pipeline/ modules: tracker, text_filter, citation_filter, ranking, itemtype_filter, enrichment, post_filter, orchestrator - Add pipeline/orchestrator.py: run_aggregation() and run_collection() as pure function calls, accepting SciLExConfig + AggregationOptions - Replace sys.argv manipulation in scilex_api.py with orchestrator call - Remove sys.path.insert hacks from web files - Move module-level side effects (setup_logging, load_all_configs, print) inside main() in run_collection.py and aggregate_collect.py - Centralize FORMAT_CONVERTERS registry in crawlers/aggregate.py - Standardize logging via setup_logging() in all entry points - Add shared post_filter.py for web UI/API filtering consolidation - Slim aggregate_collect.py from 2272 to ~960 lines (thin CLI wrapper)
# Conflicts: # paper/paper.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
aggregate_collect.py(2272→960 lignes) enscilex/config.py,scilex/pipeline/(7 modules), et wrappers CLI/web minces. Élimination du hacksys.argvet des effets de bord module-level.Changes
scilex/config.py—SciLExConfigcentralisée avecfrom_files()/from_dicts()scilex/pipeline/— 7 modules : orchestrator, tracker, text_filter, citation_filter, ranking, itemtype_filter, enrichment, post_filterscilex/webapi/scilex_api.py— nouveaux endpoints (delete collection, logs streaming)scilex/webapi/web_interface.py— refonte UI complète (monitoring, cards, tooltips)Test plan
uv run python -m pytest tests/— tous les tests passent🤖 Generated with Claude Code