Skip to content

refactor: 3-layer architecture + web monitoring & UX improvements#55

Open
BenjaminNavet wants to merge 23 commits intomainfrom
dev
Open

refactor: 3-layer architecture + web monitoring & UX improvements#55
BenjaminNavet wants to merge 23 commits intomainfrom
dev

Conversation

@BenjaminNavet
Copy link
Collaborator

Summary

  • Architecture 3 couches : décomposition du monolithe aggregate_collect.py (2272→960 lignes) en scilex/config.py, scilex/pipeline/ (7 modules), et wrappers CLI/web minces. Élimination du hack sys.argv et des effets de bord module-level.
  • Monitoring pipeline web : capture de logs temps réel, stepper de phases, cartes de détail articles avec liens DOI, suppression de collections (API + UI), clés API réorganisées par catégorie.
  • UX polish : réécriture de 12 tooltips, ajout de 6 tooltips manquants, panneau de config repliable, gestion reprise/redémarrage de collections.

Changes

  • scilex/config.pySciLExConfig centralisée avec from_files() / from_dicts()
  • scilex/pipeline/ — 7 modules : orchestrator, tracker, text_filter, citation_filter, ranking, itemtype_filter, enrichment, post_filter
  • scilex/webapi/scilex_api.py — nouveaux endpoints (delete collection, logs streaming)
  • scilex/webapi/web_interface.py — refonte UI complète (monitoring, cards, tooltips)
  • 41 fichiers modifiés, +2849/−2018 lignes

Test plan

  • uv run python -m pytest tests/ — tous les tests passent
  • Vérifier le pipeline web de bout en bout (collect → aggregate → export)
  • Vérifier la suppression de collection via l'UI
  • Vérifier l'affichage des tooltips sur tous les champs du formulaire

🤖 Generated with Claude Code

datalogism and others added 23 commits March 9, 2026 16:58
…53)

- Add progress_callback and cancel_event params to CollectCollection
  (optional, defaults to None — CLI behavior unchanged)
- Add POST /pipelines/{job_id}/cancel endpoint with graceful shutdown
- Offload blocking collection/aggregation to thread pool via
  run_in_executor so the event loop stays responsive for polling
- Add Streamlit progress monitoring section with per-API stats,
  progress bar, and cancel button (polls every 2s)
- Fix bare except Exception to catch queue.Empty specifically
- Fix Zotero output string concatenation precedence bug
- Add enable_enrichment, enrichment_threshold, enrichment_limit fields
  to CollectionConfig Pydantic model
- Run enrich_with_hf.main() after aggregation when enrichment is enabled
  in run_collection_task(), offloaded to thread pool
- Add enrichment checkbox with conditional threshold/limit controls
  in Tab 1 (New Collection)
- Add standalone "Enrich with HuggingFace" section in Tab 3
  (Filter & Export) with threshold slider, limit input, and enrich
  button calling scilex-enrich via subprocess
Replace single dropdown with individual expanders per API in the
sidebar. Each shows a status indicator (✅ configured / ⬚ not).
"Clear" button wipes all credentials for that API at once.
Use default values from enrich_with_hf.py (threshold=85, no limit).
These parameters are too technical for the web UI.
These are output/enrichment tools, not collection sources.
Simplify Data Sources section to 2 columns: Free APIs and Paid APIs.
- Wrap all configuration (API keys, output dir) in a single collapsible
  expander (open by default, can be closed)
- Show all APIs in a flat list with green dot (●) when configured,
  grey circle (○) when not — visible at a glance without expanding
- Each API has inline Save/Clear buttons with a divider between them
Add PubMedCentral and Istex to the free APIs list in both the
frontend and the /available-apis endpoint. Remove HuggingFace
and Zotero from /available-apis since they are not data sources.
Now shows all 11 registered collectors: 8 free + 3 paid.
Enrichment is a pipeline step, not an export action. The checkbox
in New Collection (Tab 1) is the correct place to enable it.
Add OpenAlex, PubMed (optional keys for higher rate limits) and
CrossRef (mailto for polite pool) to the API keys configuration.
Now lists all 9 APIs that accept credentials.
- Detect existing data on disk for the collection name
- Show info message explaining idempotent behavior
- Change button text to "Resume Collection" when partial data exists
- Improve cancelled state message to explain restart is safe
When partial data exists, show a "Start fresh" checkbox that deletes
previous results before starting. Button adapts: "Resume Collection",
"Start Fresh", or "Start Collection Pipeline" depending on context.
Rewrite 12 vague one-liner tooltips with clearer 2-3 line descriptions
drawn from config file comments. Add 6 missing tooltips for Enable Base
Filters, abstract length sliders, Allowed Publication Types, and API
source selectors. Remove label_visibility="collapsed" from 3 multiselects
so the help icon is actually visible.
… polish

                                                                           Add real-time log capture, phase stepper, paper detail cards with DOI links, collection deletion (API + UI), and reorganize API keys by category.  Quality filters moved into collapsible section. Includes ruff formatting across the codebase.
Extract the 2272-line aggregate_collect.py monolith into clean pipeline
modules. Eliminate sys.argv hack and module-level side effects.

- Add scilex/config.py: centralized SciLExConfig with from_files() and
  from_dicts() constructors, replacing scattered config loading
- Add scilex/pipeline/ modules: tracker, text_filter, citation_filter,
  ranking, itemtype_filter, enrichment, post_filter, orchestrator
- Add pipeline/orchestrator.py: run_aggregation() and run_collection()
  as pure function calls, accepting SciLExConfig + AggregationOptions
- Replace sys.argv manipulation in scilex_api.py with orchestrator call
- Remove sys.path.insert hacks from web files
- Move module-level side effects (setup_logging, load_all_configs, print)
  inside main() in run_collection.py and aggregate_collect.py
- Centralize FORMAT_CONVERTERS registry in crawlers/aggregate.py
- Standardize logging via setup_logging() in all entry points
- Add shared post_filter.py for web UI/API filtering consolidation
- Slim aggregate_collect.py from 2272 to ~960 lines (thin CLI wrapper)
# Conflicts:
#	paper/paper.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants