Conversation
LucasHJin
commented
Mar 10, 2026
- Made a claude md + plan md file to be referenced for improvements for deployment
There was a problem hiding this comment.
Pull request overview
Adds planning/documentation artifacts to guide deployment and future technical improvements for the Argus codebase.
Changes:
- Added a deployment-focused PRD with phased tickets and success criteria.
- Added a “Claude context” file capturing architecture, file map, conventions, and environment variables.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| plan/PRD.md | Introduces phased deployment/automation/schema/testing roadmap and success metrics. |
| plan/CLAUDE.md | Adds architectural/context reference for contributors (stack, structure, conventions, env vars). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | # | Ticket | Priority | | ||
| |---|--------|----------| | ||
| | 1 | Create backend Dockerfile (FastAPI + uvicorn, multi-stage build) | P0 | | ||
| | 2 | Create frontend Dockerfile (Vite build -> nginx static serve) | P0 | | ||
| | 3 | Create `docker-compose.yml` with services: backend, frontend, postgres+pgvector, redis | P0 | | ||
| | 4 | Add `.dockerignore` files (exclude `.env`, `node_modules`, `.venv`, `__pycache__`) | P0 | |
There was a problem hiding this comment.
The Markdown tables are written with a double leading pipe (e.g., || # | Ticket | Priority |), which renders as an extra empty first column in GitHub Markdown. Use a single leading pipe (| # | Ticket | Priority |) for each table row to avoid the blank column and keep column alignment consistent.
| | Component | Technology | Notes | | ||
| |-----------|-----------|-------| | ||
| | Frontend | React 19 + TypeScript 5 + Vite 6 | SPA, no SSR | | ||
| | Globe | react-globe.gl (three.js wrapper) | Heavy bundle — lazy-load | | ||
| | Styling | Tailwind CSS 3 + CSS custom properties | Design tokens in `index.css` | | ||
| | Backend | FastAPI + Uvicorn (Python 3.11+) | Async throughout | |
There was a problem hiding this comment.
The Tech Stack and other tables also start rows with || ..., which creates an unintended empty first column in GitHub Markdown. Switch these rows to a single leading | so the tables render with the expected number of columns.
| │ ├── agent_service.py # Graph-RAG pipeline orchestration | ||
| │ ├── agent_tools.py # DB query tools (search, relate, detail, impact) | ||
| │ ├── gemini_client.py # Gemini API: synthesis, confidence, realtime analysis | ||
| │ ├── scraping_service.py # Polymarket + Kalshi + GDELT orchestrator | ||
| │ └── content_repository.py # (duplicate — also in ingestion/) | ||
| ├── repositories/ | ||
| │ └── content_repository.py # Market signal row persistence | ||
| ├── embeddings/ |
There was a problem hiding this comment.
services/content_repository.py is listed in the project tree, but there is no backend/app/services/content_repository.py in the repo. The existing implementations are backend/app/repositories/content_repository.py and backend/app/ingestion/content_repository.py, so the tree should be updated to match the actual paths (or the duplicate description moved to the correct locations).
| ### Backend | ||
| - **Async everywhere** — use `async def` for all route handlers and service methods | ||
| - **asyncpg** for DB access (not SQLAlchemy ORM) | ||
| - **Pydantic v2** for request/response models | ||
| - **Raw SQL** for queries (no ORM) — parameterize all user inputs with `$1, $2` syntax | ||
| - **Environment variables** via `python-dotenv` and `os.getenv()` | ||
| - **Scraper output** normalized via `row_format.make_content_row()` before DB insert |
There was a problem hiding this comment.
The Backend conventions say DB access is asyncpg-only and "Async everywhere", but several production routes are currently synchronous and use psycopg2 (e.g., backend/app/routers/content.py defines def get_content_points() / def get_content_arcs() and opens psycopg2.connect()). Please update this section to reflect the current mixed sync/async state, while still stating the desired direction for new work.
| ``` | ||
| DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/dbname # Required | ||
| GEMINI_API_KEY=... # Required for agent | ||
| GEMINI_MODEL=gemini-2.5-flash # Optional, default shown | ||
| OPENAI_API_KEY=... # Required for embeddings (until local model migration) | ||
| ACLED_API_KEY=... # Required for ACLED ingestion |
There was a problem hiding this comment.
The sample DATABASE_URL uses the SQLAlchemy-style postgresql+asyncpg://... scheme, but the codebase connects with psycopg2.connect(DATABASE_URL) in multiple places and with asyncpg.create_pool(dsn=settings.database_url) elsewhere; those drivers expect a libpq/asyncpg DSN like postgresql://... (without +asyncpg). Using postgresql+asyncpg:// will fail for psycopg2 and likely for asyncpg. Align this example with backend/.env.example (which uses postgresql://...).
| ## Known Issues & Tech Debt | ||
|
|
||
| 1. **Duplicate `content_repository.py`** — exists in both `repositories/` and `ingestion/`. Must consolidate. | ||
| 2. **No shared DB pool** — `ingestion/db.py` has its own pool; other services use inline `asyncpg.connect()`. Need a single shared pool. |
There was a problem hiding this comment.
We should explicitly document that the max_size of the pool needs to be set so we won't throw a "too many connections" error if there's too many users. I believe the default max is 100
| | 16 | Design `BaseScraper` ABC interface: `async fetch() -> list[NormalizedRow]` with built-in rate limiting, error handling, and dedup. Use `kalshi.py` rate limiter and `ingestion_service.py` error patterns as reference. | P0 | | ||
| | 17 | Write new production scrapers implementing `BaseScraper` for each data source (determine which sources to keep/add based on product needs) | P0 | | ||
| | 18 | Create unified scrape entrypoint (`run_daily_pipeline.py`) that runs all scrapers + embedding backfill in sequence | P0 | | ||
| | 19 | Add cron scheduler (APScheduler in a separate container, or cron in docker-compose) | P0 | |
There was a problem hiding this comment.
We need a locking mechanism here. If run_daily_pipeline.py hangs, we don't want the next cron job to spawn a zombie process. We could use a Redis-based lock for this but i gotta do more research
There was a problem hiding this comment.
Could probably use a locking in postgresql for simplicity, but i agree with the cron problem
| | Backend | FastAPI + Uvicorn (Python 3.11+) | Async throughout | | ||
| | Database | PostgreSQL 15+ with pgvector + pgcrypto | Extensions required | | ||
| | AI Model | Google Gemini 2.5-flash | Structured JSON output | | ||
| | Embeddings | OpenAI text-embedding-3-small (1536 dims) | Migration target: local sentence-transformers | |
There was a problem hiding this comment.
why are we migrating to sentence-transformers? Is it better than openai?
There was a problem hiding this comment.
Cheaper embeddings so we don't just waste money every day this is running. Unless you guys don't care and then we can stick with the other text-embedding.
|
|
||
| | Method | Path | Purpose | AI Cost | | ||
| |--------|------|---------|---------| | ||
| | GET | `/content/points` | All content with lat/lng (last 31 days) | None | |
There was a problem hiding this comment.
This is gonna create like a insanely huge json file and it will prolly lag computers on the lower end. Apparently react-globe.gl’ has built-in clustering. Why don't we make it so that you need to zoom in a certain amount to actually look at the details of that point and when your rlly zoomed out, the points in the same area merge into one. We need to experiment with this though bc im not sure if it will look good or not
There was a problem hiding this comment.
Makes sense, what are your thoughts on doing server side viewport filtering (i.e. pass additional parameters -> GET /content/points?bbox=west,south,east,north&zoom=level) and then from here we only render what is shown (i.e. front half + zoom) and if it's zoomed out we cmobine points and show individual when zoomed in. But, I don't think there is clustering in react-globe.gl from what i can see.
| │ └── acled/ | ||
| │ ├── acled_client.py # ACLED API client | ||
| │ └── acled_normalizer.py | ||
| └── scrapers/ |
There was a problem hiding this comment.
keep scrapers for inspo, marginally improve off them
| - `id` UUID PK (gen_random_uuid) | ||
| - `title`, `body`, `url` (UNIQUE) | ||
| - `latitude`, `longitude` (nullable floats) | ||
| - `image_url`, `s3_url` — DEPRECATED, to be dropped (Cloudinary/S3 no longer used) |
There was a problem hiding this comment.
we will be using s3 for media storage
There was a problem hiding this comment.
Yeah from my ubnderstnading it was like we don't need both image_url and s3_url cause we can use just image_url for s3
| | GET | `/content/arcs?threshold=0.7` | Similarity arcs via pgvector cosine | None | | ||
| | GET | `/content/{id}` | Single content item detail | None | | ||
| | POST | `/content/{id}/confidence-score` | Gemini credibility scoring (0.31-1.0) | 1 Gemini call | | ||
| | POST | `/content/{id}/realtime-analysis` | Gemini + Google Search grounding | 1 Gemini call | |
There was a problem hiding this comment.
how much credit are we gonna burn...
There was a problem hiding this comment.
LMFAO yeah lowk i think one of the challenges is just figuring out how to minimize credit burn
[WIP] Add deployment planning documentation
Co-authored-by: Tpypan <155002057+Tpypan@users.noreply.github.com>
chore: add deployment issue board with all pre-deploy tasks
need to use issue board for cursor hackathon