Skip to content

seanbeirnes/cpal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CPAL — Canvas Personal Assistant for Learning

Status: Deprecated/Archived

The Instructure Community site changed its structure, breaking many cited source links and the HTML selectors used by the extractor. CPAL is preserved for reference but is no longer maintained, and answers may reference stale or invalid URLs.

Overview

CPAL (Canvas Personal Assistant for Learning) was a Retrieval-Augmented Generation (RAG) chatbot designed to answer Canvas LMS questions with concise, cited responses. It indexed Canvas Community guides and solved forum threads, retrieved relevant passages, and synthesized answers with links to original sources.

Screenshot of CPAL demo

Not affiliated with Instructure.

Key Capabilities

  • RAG over Canvas Community docs and solved forum threads
  • Query rewriting to improve recall before retrieval
  • Answer synthesis with source citations
  • CAPTCHA-protected query endpoint; Q/A event logging
  • Full-stack: FastAPI backend, Vite/React frontend, single-container deployment

System Architecture

  • Backend (FastAPI): serves API and static frontend
    • App: backend/main.py, routes: backend/web/api.py
    • LLM (Gemini 2.5 Flash): backend/service/llm.py
    • Embeddings (MiniLM): backend/service/embedding.py
    • Vector DB (Pinecone): backend/service/vectordb.py
    • Event logging (SQLModel → Postgres): backend/service/events.py, backend/model/event.py
    • reCAPTCHA verification: backend/service/captcha.py
  • Frontend (Vite/React): frontend/
  • Containerization/Deploy: Dockerfile, fly.toml

Example flow:

  1. User question → FastAPI /api/query
  2. Query rewritten (LLM) to expand recall
  3. Embedding generated → similarity search (Pinecone)
  4. Top matches filtered; forum question chunks replaced with corresponding answer chunks
  5. Prompt assembled → LLM generates markdown answer with sources
  6. Q/A logged to Postgres; response returned

Data Pipeline (one-time bootstrap)

Ran manually once to populate the vector index:

  1. Extract HTML → Markdown + metadata (Go)
    • Entrypoint: pipeline/cmd/extract/main.go
    • Logic: pipeline/internal/extract/extract.go
    • Config: pipeline/config.json
    • Output: tmp/raw/{id}.md and tmp/raw/{id}.json
  2. Clean and chunk Markdown → plain text chunks
    • pipeline/python/prepare_data.pytmp/chunks/
  3. Generate embeddings for chunks (MiniLM)
    • pipeline/python/generate_embeddings.pytmp/embeddings/
  4. Upsert vectors + metadata to Pinecone
    • pipeline/python/store_embeddings.py

Note: Extractor assumed the legacy Instructure Community DOM; it no longer matches the current site.

Infrastructure

  • Hosting: Fly.io (single container serving API + built frontend)
  • Database: Supabase (Postgres) via DATABASE_URL for Q/A events
  • Vector Store: Pinecone via VECTOR_DB_API_KEY / VECTOR_DB_INDEX_NAME
  • LLM: Google Generative AI (Gemini) via LLM_API_KEY
  • CAPTCHA: Google reCAPTCHA via CAPTCHA_SITE_KEY / CAPTCHA_SITE_SECRET

API Summary

  • GET /api/livez → health
  • GET /api/config{ "captcha": <site_key> }
  • POST /api/query
    • Body: { "query": string, "captcha_token": string }
    • Response: { "answer": markdown, "sources": [{ "url", "title", "score" }] }
    • Notes: top‑k=5 similarity; score filtering; forum Q→A chunk replacement; Q/A logged

Minimal Local Run (optional; not guaranteed due to deprecation)

  • Set required env vars: HOST, LLM_API_KEY, VECTOR_DB_API_KEY, VECTOR_DB_INDEX_NAME, DATABASE_URL, CAPTCHA_SITE_KEY, CAPTCHA_SITE_SECRET.
  • Standard FastAPI + Vite flow (backend on 8080, frontend dev server as origin). See Dockerfile for container build.

Example .env (placeholders only):

HOST=http://localhost:5173
LLM_API_KEY=...
VECTOR_DB_API_KEY=...
VECTOR_DB_INDEX_NAME=cpal-index
DATABASE_URL=postgresql+psycopg2://user:pass@host:5432/db
CAPTCHA_SITE_KEY=...
CAPTCHA_SITE_SECRET=...

Limitations

  • Source URLs and selectors are tied to the legacy Community site and may be invalid.
  • Project is not actively maintained; content freshness and security are not guaranteed.

Project Layout

backend/           FastAPI app, services, models
frontend/          Vite/React UI
pipeline/          Go extractor + Python preparation/embedding/store steps
doc/cpal-demo.png  Screenshot
Dockerfile         Multi-stage build serving static + API
fly.toml           Fly.io deployment config

License

AGPL-3.0 — see LICENSE.

Acknowledgements

Canvas Community content; FastAPI; React/Vite; Sentence-Transformers;