Combines the Learning Agent (LinUCB affect adaptation) and the Conversation Agent (persona-RAG + LLM) into a single, fully stateless FastAPI service.
elara_service/
├── app.py ← Unified FastAPI entry point
│
├── learning_agent/
│ ├── __init__.py
│ ├── schemas.py
│ ├── nlp_layer.py ← VADER + Jaccard + keyword signals
│ ├── state_classifier.py ← Affect classification + escalation smoother
│ ├── bandit.py ← Discounted LinUCB bandit
│ ├── config_applier.py ← Action → config delta
│ └── storage.py ← Per-user atomic A/b matrix persistence
│
├── conversation_agent/
│ ├── __init__.py
│ ├── adapter.py ← Glue: session state, distress watchdog, immediate reward
│ ├── llm.py ← Ollama / Groq streaming backends
│ ├── rag.py ← Keyword persona retrieval + system prompt builder
│ ├── audio.py ← Barge-in audio I/O (edge device)
│ └── persona.json ← User profile
│
├── example_client.py ← Reference client
├── requirements.txt
└── README.md
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtOption A — Ollama (local, default)
ollama serve
ollama pull qwen2.5:1.5b # or any model — override via `model` fieldOption B — Groq (cloud, faster)
export GROQ_API_KEY=your_key_here # get free key at console.groq.comuvicorn app:app --reload --port 8000Service is now running at http://localhost:8000.
Liveness probe.
{ "status": "ok", "service": "ELARA Unified", "version": "2.1.0" }Send state: null on the first turn. Echo back the state from each response on the next request.
Request
{
"message": "I forgot if I took my tablet this morning.",
"state": null,
"backend": "ollama",
"model": null
}Response
{
"reply": "Don't worry — I checked for you. Yes, you did take your tablet this morning.",
"state": {
"session_id": "elara-a1b2c3d4",
"interaction_count": 1,
"consecutive_distress_turns": 0,
"config": {
"pace": "normal",
"clarity_level": 2,
"confirmation_frequency": "low",
"patience_mode": false
},
"bandit": {
"previous_affect": "calm",
"previous_action_id": 0,
"previous_context_id": 31,
"previous_config": { "pace": "normal", "clarity_level": 2, "confirmation_frequency": "low", "patience_mode": false },
"affect_window": ["calm"]
},
"history": [
{ "role": "user", "content": "I forgot if I took my tablet this morning.", "timestamp": "..." },
{ "role": "assistant", "content": "Don't worry — I checked...", "timestamp": "..." }
]
},
"diagnostics": {
"affect": "calm",
"confidence": 1.0,
"signals_used": [],
"config_changes": {},
"reward_applied": null,
"ucb_action_id": 0,
"ucb_scores": [-0.08, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
"escalation_rule": null,
"distress_turns": 0,
"caregiver_alert": false,
"immediate_reward_applied": false
}
}Same request schema as /chat. Instead of waiting for the full LLM reply, streams tokens as Server-Sent Events so the edge device can begin TTS on the first sentence while the model is still generating.
Event stream format
data: {"token": "Don't worry"}
data: {"token": " — I checked"}
data: {"token": " for you."}
data: {"done": true, "reply": "Don't worry — I checked for you.", "state": {...}, "diagnostics": {...}}
tokenevents arrive as the LLM generates output.- The final
doneevent carries the completeChatResponsepayload (identical structure to/chat) so the caller can update session state.
Example (curl)
curl -N -X POST http://localhost:8000/chat/stream \
-H "Content-Type: application/json" \
-d '{"message": "Good morning!", "state": null, "backend": "ollama"}'Edge device integration
Feed token events into audio.py's sentence_chunks() helper to start speaking on the first complete sentence before generation finishes:
# Pseudocode — wire SSE token stream → sentence_chunks → TTS
for sentence in sentence_chunks(token_stream_from_sse()):
interrupted, _ = audio_manager.speak(sentence)
if interrupted:
breakAccepts the original AnalyseRequest schema (v1.0 / v1.1). Unchanged behaviour.
The server holds zero per-session memory. Every piece of state lives in the state field of the response and must be echoed back on the next request.
| Field | What it tracks |
|---|---|
state.session_id |
Unique session identifier (also keys the per-user bandit matrices on disk) |
state.interaction_count |
Turn counter |
state.consecutive_distress_turns |
Consecutive non-calm turns (feeds the caregiver watchdog) |
state.config |
Live ELARA behaviour config (updated by Learning Agent each turn) |
state.bandit |
LinUCB tracking — affect window, previous action/config for reward attribution |
state.history |
Last 10 exchanges for LLM context window |
Client
│
│ POST /chat {message, state} POST /chat/stream (SSE)
▼ ▼
app.py (FastAPI)
│
├─── adapter.py (ConversationAdapter.handle_turn)
│ │
│ ├── Restore session state from request
│ │
│ ├── Immediate reward — if user says "thank you / that helped",
│ │ inject +1.0 reward for previous action now, skip pipeline update
│ │
│ ├── Build AnalyseRequest → _run_learning_pipeline()
│ │ │
│ │ ├── nlp_layer.py (VADER sentiment + Jaccard repetition + keyword scores)
│ │ ├── state_classifier.py (affect + escalation smoother)
│ │ ├── bandit.py (Discounted LinUCB update + select)
│ │ ├── config_applier.py (action_id → config delta)
│ │ └── storage.py (atomic per-user A/b matrix persistence)
│ │
│ ├── Update distress watchdog counter; log caregiver alert if ≥7 non-calm turns
│ │
│ ├── Apply config delta to session state
│ │
│ ├── rag.py (keyword persona retrieval → system prompt with pace/clarity/patience instructions)
│ │
│ └── llm.py (Ollama / Groq — buffered for /chat, streamed for /chat/stream)
│
└─── Return ChatResponse {reply, state, diagnostics}
After every turn the Learning Agent classifies the user's emotional state (calm, confused, frustrated, sad, disengaged) and the LinUCB bandit selects one of 7 config actions:
| Action | Effect |
|---|---|
| DO_NOTHING | No change; if calm, recover one config step toward defaults |
| DECREASE_CLARITY | Simpler language (clarity 3→2→1) |
| DECREASE_PACE | Slower replies and shorter sentences (fast→normal→slow) |
| INCREASE_CONFIRMATION | ELARA repeats back what it understood |
| ENABLE_PATIENCE | Every reply opens with a warm empathetic acknowledgement |
| DECREASE_CLARITY_AND_PACE | Both of the above together |
| CLARITY_AND_CONFIRMATION | Simpler language + confirmation |
The bandit learns per user over time — matrices are persisted to tables/<session_id>_bandit_A.npy / _b.npy.
pace controls both the token budget and the LLM's speaking style:
| Pace | Max tokens | System prompt instruction |
|---|---|---|
| slow | 400 | "Speak slowly and gently. Use short sentences with pauses between ideas. One thought at a time." |
| normal | 300 | (no extra instruction) |
| fast | 100 | "Be brief and to the point." |
If the user registers 7 or more consecutive non-calm turns, a caregiver_alert: true flag is set in the response diagnostics and a warning is logged. Wire up _check_distress_watchdog() in adapter.py to your notification system (SMS/push/email) to alert a caregiver.
If the user explicitly signals satisfaction ("thank you", "that helped", "much better", etc.), a +1.0 reward is injected into the bandit for the previous action immediately — without waiting for the next-turn affect transition. This teaches ELARA what actually worked, not just that things improved.
A 5-turn rolling affect window prevents single-turn overreaction:
- A single confused/frustrated message after a calm history is dampened.
frustratedrequires ≥2 consecutive non-calm turns (or all three strong signals) to stand.- A single short reply ("Yes.") after calm history is reclassified as
calm, notdisengaged.
# In a second terminal (service must be running)
python example_client.py# Ollama (default)
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Good morning!", "state": null, "backend": "ollama"}'
# Groq (requires GROQ_API_KEY)
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Good morning!", "state": null, "backend": "groq"}'- Audio / Voice mode:
audio.pyhandles always-on mic, barge-in detection, and TTS but is not wired into the HTTP endpoints. For voice mode, add a thin layer that calls STT before the request and feeds the/chat/streamtoken events intosentence_chunks()for low-latency TTS. - Multi-user: Each
session_idgets its own bandit matrix files intables/. File-based locking (threading.Lock+fcntl.flock) is safe for a single server process. For a care facility with many concurrent users, replacestorage.py's backend with Redis. - Privacy: STT (faster-whisper) runs fully offline. The LLM can be run locally via Ollama — no data leaves the device unless you opt into the Groq backend.