Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
284 changes: 284 additions & 0 deletions plan/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
# Argus — Claude Context File

Read this file before working on any ticket. It contains the full architectural context, conventions, and key file locations for the Argus project.

> **Note:** This document reflects the team's best understanding at time of writing. If the user gives instructions that conflict with what's written here, **follow the user's instructions** — they take priority. If parts of this context have become outdated or irrelevant due to changes in the codebase, use your judgement and note the discrepancy rather than blindly following stale guidance.

## What Is Argus

A 3D global event intelligence platform. World events are scraped daily from multiple sources, stored in PostgreSQL with vector embeddings, and visualized on an interactive globe. An AI agent (Graph-RAG pipeline) lets users query events with persona-aware analysis.

## Tech Stack

| Component | Technology | Notes |
|-----------|-----------|-------|
| Frontend | React 19 + TypeScript 5 + Vite 6 | SPA, no SSR |
| Globe | react-globe.gl (three.js wrapper) | Heavy bundle — lazy-load |
| Styling | Tailwind CSS 3 + CSS custom properties | Design tokens in `index.css` |
| Backend | FastAPI + Uvicorn (Python 3.11+) | Async throughout |
Comment on lines +13 to +18
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tech Stack and other tables also start rows with || ..., which creates an unintended empty first column in GitHub Markdown. Switch these rows to a single leading | so the tables render with the expected number of columns.

Copilot uses AI. Check for mistakes.
| Database | PostgreSQL 15+ with pgvector + pgcrypto | Extensions required |
| AI Model | Google Gemini 2.5-flash | Structured JSON output |
| Embeddings | OpenAI text-embedding-3-small (1536 dims) | Migration target: local sentence-transformers |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we migrating to sentence-transformers? Is it better than openai?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheaper embeddings so we don't just waste money every day this is running. Unless you guys don't care and then we can stick with the other text-embedding.

| Voice | ElevenLabs Scribe v1 | Optional, speech-to-text |
| Media | Placeholder SVGs only | Cloudinary and S3 removed — no longer needed |

## Project Structure

```
hackcanada/
├── plan/ # PRD and this context file
├── frontend/
│ ├── src/
│ │ ├── main.tsx # Entry: nested context providers
│ │ ├── App.tsx # Bootstrap: fetch points/arcs, render overlays
│ │ ├── index.css # Design tokens, fonts
│ │ ├── api/client.ts # Typed fetch wrapper for all backend endpoints
│ │ ├── components/
│ │ │ ├── Globe/GlobeView.tsx # 3D globe — points, arcs, tooltips, clusters
│ │ │ ├── Filters/FilterBar.tsx # Event-type filter chips
│ │ │ ├── Timeline/TimelineSlider.tsx # Date scrubber + play/pause
│ │ │ ├── Modal/EventModal.tsx # Right panel — event detail + AI analysis
│ │ │ ├── Modal/RealTimeAnalysisSection.tsx # Gemini + Google Search grounding
│ │ │ └── Agent/ # Left panel — AI query interface
│ │ │ ├── AgentPanel.tsx # Query input, voice, submit
│ │ │ ├── AgentAnswerView.tsx # Citation parsing, financial impact
│ │ │ ├── AgentNavigationOverlay.tsx # Globe camera animation
│ │ │ ├── PersonaSelector.tsx # Role + industry selection
│ │ │ └── FinancialImpactSection.tsx
│ │ ├── context/
│ │ │ ├── AppContext.tsx # Events, arcs, filters, timeline, globe focus
│ │ │ ├── AgentContext.tsx # Agent state, highlights, navigation plan
│ │ │ └── UserPersonaContext.tsx # Role + industry (localStorage-persisted)
│ │ ├── types/
│ │ │ ├── events.ts # Event, ContentPoint, ContentArc, EventDetail
│ │ │ └── agent.ts # AgentResponse, NavigationPlan, FinancialImpact
│ │ └── utils/mediaConfig.ts # DEPRECATED — Cloudinary/S3 removed, delete this file (ticket #10)
│ └── package.json
└── backend/
├── requirements.txt
├── run_scrape.py # LEGACY CLI — will be replaced by run_daily_pipeline.py
├── run_gdelt_scrape.py # LEGACY CLI — will be replaced by run_daily_pipeline.py
├── migrations/
│ └── 001_init_schema.sql # Full PostgreSQL schema
└── app/
├── main.py # FastAPI app, CORS, router registration
├── config.py # Env var loading, agent defaults
├── models/
│ ├── enums.py # EventType, RelationshipType (StrEnum)
│ ├── schemas.py # Core Pydantic models
│ └── agent_schemas.py # Agent-specific Pydantic models
├── routers/
│ ├── content.py # GET /content/points, /arcs, POST /{id}/confidence-score, /{id}/realtime-analysis
│ ├── agent.py # POST /agent/query
│ ├── ingestion.py # POST /ingestion/acled
│ ├── embeddings.py # POST /embeddings/backfill/content
│ └── market_signals.py # GET /market-signals (live fetch)
├── services/
│ ├── agent_service.py # Graph-RAG pipeline orchestration
│ ├── agent_tools.py # DB query tools (search, relate, detail, impact)
│ ├── gemini_client.py # Gemini API: synthesis, confidence, realtime analysis
│ ├── scraping_service.py # Polymarket + Kalshi + GDELT orchestrator
│ └── content_repository.py # (duplicate — also in ingestion/)
├── repositories/
│ └── content_repository.py # Market signal row persistence
├── embeddings/
Comment on lines +78 to +85
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

services/content_repository.py is listed in the project tree, but there is no backend/app/services/content_repository.py in the repo. The existing implementations are backend/app/repositories/content_repository.py and backend/app/ingestion/content_repository.py, so the tree should be updated to match the actual paths (or the duplicate description moved to the correct locations).

Copilot uses AI. Check for mistakes.
│ ├── embedding_repository.py # Fetch/update embedding vectors
│ ├── embedding_backfill_service.py # Backfill missing embeddings
│ ├── openai_embedding_client.py # OpenAI API wrapper
│ └── run_embedding_backfill.py # CLI entry
├── ingestion/
│ ├── ingestion_service.py # ACLED pipeline: fetch -> normalize -> dedupe -> insert
│ ├── content_repository.py # ensure_sources, insert_content (DUPLICATE)
│ ├── db.py # asyncpg connection pool (only used by ingestion)
│ ├── dedupe_service.py # Duplicate detection
│ └── acled/
│ ├── acled_client.py # ACLED API client
│ └── acled_normalizer.py
└── scrapers/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep scrapers for inspo, marginally improve off them

├── row_format.py # Shared row normalization -> content_table shape (KEEP — used by new scrapers)
├── _reference/ # Hackathon scrapers kept as design inspiration only (NOT used in production)
│ ├── gdelt.py # Reference: dual-path fetch, CAMEO mapping, Goldstein normalization
│ ├── kalshi.py # Reference: async rate limiter, cursor pagination, asyncio.gather
│ ├── polymarket.py # Reference: simple REST API pattern, tag filtering
│ └── acled/ # Reference: client/normalizer separation, NormalizedRecord model
├── eonet.py # UNUSED — delete
├── eonet_db.py # UNUSED — delete
├── social_scraper.py # UNUSED — delete
├── reddit.py # UNUSED — delete
├── reddit_classifier.py # UNUSED — delete
├── reddit_db.py # UNUSED — delete
├── reddit_schema.sql # UNUSED — delete
├── Reddit Scraper/ # UNUSED — delete
├── natural-disasters/ # UNUSED — delete
└── ryan_scrapers/ # UNUSED — delete
```

## Database Schema

PostgreSQL with pgvector and pgcrypto extensions.

### Core Tables

**`content_table`** (primary data store — rename target: `articles`)
- `id` UUID PK (gen_random_uuid)
- `title`, `body`, `url` (UNIQUE)
- `latitude`, `longitude` (nullable floats)
- `image_url`, `s3_url` — DEPRECATED, to be dropped (Cloudinary/S3 no longer used)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will be using s3 for media storage

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah from my ubnderstnading it was like we don't need both image_url and s3_url cause we can use just image_url for s3

- `embedding` vector(1536) — OpenAI text-embedding-3-small
- `sentiment_score` float, `market_signal` text
- `published_at` timestamptz, `event_type` text, `raw_metadata_json` JSONB
- `source_id` FK -> sources, `engagement_id` FK -> engagement
- `created_at` timestamptz

**`engagement`** — Reddit, Polymarket, Twitter metrics per content item

**`sources`** — name, type, base_url, trust_score

**`entities`** — extracted entities (person, org, location, etc.)

**`content_entities`** — join table (content_item_id, entity_id, relevance_score)

**`events`** — clustered event groups with cluster_embedding, canada_impact_summary, confidence_score

**`event_content`** — join between events and content_table

**`event_relationships`** — event_a_id, event_b_id, relationship_type, score, reason_codes

### Key Indexes
- HNSW cosine index on `content_table.embedding`
- UNIQUE on `content_table.url`

## API Endpoints

| Method | Path | Purpose | AI Cost |
|--------|------|---------|---------|
| GET | `/content/points` | All content with lat/lng (last 31 days) | None |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gonna create like a insanely huge json file and it will prolly lag computers on the lower end. Apparently react-globe.gl’ has built-in clustering. Why don't we make it so that you need to zoom in a certain amount to actually look at the details of that point and when your rlly zoomed out, the points in the same area merge into one. We need to experiment with this though bc im not sure if it will look good or not

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, what are your thoughts on doing server side viewport filtering (i.e. pass additional parameters -> GET /content/points?bbox=west,south,east,north&zoom=level) and then from here we only render what is shown (i.e. front half + zoom) and if it's zoomed out we cmobine points and show individual when zoomed in. But, I don't think there is clustering in react-globe.gl from what i can see.

| GET | `/content/arcs?threshold=0.7` | Similarity arcs via pgvector cosine | None |
| GET | `/content/{id}` | Single content item detail | None |
| POST | `/content/{id}/confidence-score` | Gemini credibility scoring (0.31-1.0) | 1 Gemini call |
| POST | `/content/{id}/realtime-analysis` | Gemini + Google Search grounding | 1 Gemini call |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much credit are we gonna burn...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMFAO yeah lowk i think one of the challenges is just figuring out how to minimize credit burn

| POST | `/agent/query` | Graph-RAG agent pipeline | 1 Gemini call + 1 OpenAI embed |
| GET | `/market-signals` | Live Polymarket + Kalshi fetch | None |
| POST | `/ingestion/acled` | Trigger ACLED ingestion pipeline | None |
| POST | `/embeddings/backfill/content` | Backfill missing embeddings | N OpenAI calls |
| GET | `/health` | Health check | None |

## Agent Pipeline (Graph-RAG)

1. **Classify query** — pattern match keywords -> query_type (event_explanation, impact_analysis, connection_discovery, entity_relevance)
2. **Seed retrieval** — keyword ILIKE search + pgvector cosine similarity
3. **Graph expansion** — 2-hop: for each seed, find 6 nearest neighbors via pgvector
4. **Context assembly** — full article bodies + financial impact heuristics
5. **Gemini synthesis** — structured JSON output with citations `[cite:UUID]`, navigation plan, financial impact
6. **Post-processing** — filter to globe-navigable events, strip invalid citations

Output schema includes: answer, confidence, caution, query_type, navigation_plan, relevant_event_ids, highlight_relationships, financial_impact, reasoning_steps, cited_event_map.

## Event Types (StrEnum)

```
geopolitics, trade_supply_chain, energy_commodities,
financial_markets, climate_disasters, policy_regulation
```

Each maps to a color in the frontend `EVENT_TYPE_COLORS` constant.

## Data Sources

The hackathon prototype used the sources below. **None of the existing scraper implementations will be used directly** — new production scrapers will be written implementing a `BaseScraper` ABC. The old code is kept in `scrapers/_reference/` for design inspiration.

| Source | What It Provides | Reference File | Quality |
|--------|-----------------|----------------|---------|
| GDELT | Global events from news (BigQuery or DOC API) | `scrapers/_reference/gdelt.py` | Excellent — study for complex normalization |
| ACLED | Armed conflict events | `scrapers/_reference/acled/` | Good — study for client/normalizer separation |
| Polymarket | Prediction market events + probabilities | `scrapers/_reference/polymarket.py` | Decent — study for simple REST pattern |
| Kalshi | Prediction market events + volumes | `scrapers/_reference/kalshi.py` | Excellent — study for async rate limiting |

Which data sources to keep, replace, or add is a product decision for Phase 2. The scraper architecture (BaseScraper ABC, row_format contract, dedup-before-embed) is what matters.

## Known Issues & Tech Debt

1. **Duplicate `content_repository.py`** — exists in both `repositories/` and `ingestion/`. Must consolidate.
2. **No shared DB pool** — `ingestion/db.py` has its own pool; other services use inline `asyncpg.connect()`. Need a single shared pool.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly document that the max_size of the pool needs to be set so we won't throw a "too many connections" error if there's too many users. I believe the default max is 100

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 🫡

3. **Dead scraper code** — `eonet.py`, `reddit*.py`, `social_scraper.py`, `Reddit Scraper/`, `natural-disasters/`, `ryan_scrapers/` are all unused junk. The "real" scrapers (`gdelt.py`, `kalshi.py`, `polymarket.py`, `acled/`) are hackathon-quality reference code only — new production scrapers need to be written.
4. **No scheduled scraping** — all ingestion is manual CLI or API trigger.
5. **Expensive embeddings** — OpenAI API called per-row. Should switch to local model.
6. **No caching** — every confidence score and realtime analysis call hits Gemini. Need Redis.
7. **CORS wildcard** — `*` origin allowed in production. Must lock down.
8. **No tests** — zero test files in the repo.
9. **Print debugging** — `print()` used instead of structured logging.
10. **No migration tool** — raw SQL files, no Alembic.
11. **Dead Cloudinary/S3 code** — Cloudinary and S3 are no longer used. `utils/mediaConfig.ts`, `@cloudinary/react`, `@cloudinary/url-gen`, `cloudinary`, `boto3` deps, `image_url`/`s3_url` DB columns, and all related env vars should be removed (ticket #10).

## Conventions

### Backend
- **Async everywhere** — use `async def` for all route handlers and service methods
- **asyncpg** for DB access (not SQLAlchemy ORM)
- **Pydantic v2** for request/response models
- **Raw SQL** for queries (no ORM) — parameterize all user inputs with `$1, $2` syntax
- **Environment variables** via `python-dotenv` and `os.getenv()`
- **Scraper output** normalized via `row_format.make_content_row()` before DB insert
Comment on lines +216 to +222
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Backend conventions say DB access is asyncpg-only and "Async everywhere", but several production routes are currently synchronous and use psycopg2 (e.g., backend/app/routers/content.py defines def get_content_points() / def get_content_arcs() and opens psycopg2.connect()). Please update this section to reflect the current mixed sync/async state, while still stating the desired direction for new work.

Copilot uses AI. Check for mistakes.
- **New scrapers** must implement `BaseScraper` ABC — see `scrapers/_reference/` for patterns, especially `kalshi.py` (rate limiting) and `gdelt.py` (normalization)

### Frontend
- **React 19** with function components and hooks only
- **Context API** for state (no Redux) — three providers: App, Agent, UserPersona
- **Tailwind CSS** for styling — design tokens as CSS variables in `index.css`
- **No routing library** yet — single-page, overlay-based navigation
- **Client-side filtering** — all events loaded on mount, visibility controlled by pointRadius/pointColor

### Git
- Branch from `main`
- Conventional-ish commit messages (e.g., `feat:`, `fix:`, `chore:`)

## Environment Variables

### Backend (.env)
```
DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/dbname # Required
GEMINI_API_KEY=... # Required for agent
GEMINI_MODEL=gemini-2.5-flash # Optional, default shown
OPENAI_API_KEY=... # Required for embeddings (until local model migration)
ACLED_API_KEY=... # Required for ACLED ingestion
Comment on lines +239 to +244
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample DATABASE_URL uses the SQLAlchemy-style postgresql+asyncpg://... scheme, but the codebase connects with psycopg2.connect(DATABASE_URL) in multiple places and with asyncpg.create_pool(dsn=settings.database_url) elsewhere; those drivers expect a libpq/asyncpg DSN like postgresql://... (without +asyncpg). Using postgresql+asyncpg:// will fail for psycopg2 and likely for asyncpg. Align this example with backend/.env.example (which uses postgresql://...).

Copilot uses AI. Check for mistakes.
ELEVENLABS_API_KEY=... # Optional
# NOTE: CLOUDINARY_* and AWS_*/S3_* vars are no longer needed — remove if present
```

### Frontend (.env)
```
VITE_API_URL=/api # or http://127.0.0.1:8000
VITE_ELEVENLABS_API_KEY=... # Optional
# NOTE: VITE_CLOUDINARY_CLOUD_NAME is no longer needed — remove if present
```

## Running Locally

```bash
# Backend
cd backend && python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

# Frontend
cd frontend && npm install && npm run dev

# Manual scraping (LEGACY — will be replaced by run_daily_pipeline.py)
python run_scrape.py # Polymarket + Kalshi
python run_gdelt_scrape.py # GDELT (--days 14 --limit 500)
curl -X POST localhost:8000/ingestion/acled
curl -X POST localhost:8000/embeddings/backfill/content
```

## Working on Tickets

When picking up a ticket from the PRD (`plan/PRD.md`):

1. **Read the relevant source files first** — don't modify code you haven't read
2. **Check for duplicates** — the codebase has redundant implementations (see Known Issues)
3. **Keep async** — all new backend code should be async
4. **Parameterize SQL** — never string-interpolate user input into queries
5. **No new print()** — use `logging.getLogger(__name__)`
6. **Test what you build** — add tests alongside new code (once pytest is set up)
7. **Budget-conscious** — if a feature involves AI API calls, always consider caching and batching first
Loading