model-fallback-skill/ARCHITECTURE.md at main · capt-marbles/model-fallback-skill

"System Architecture\n\nDesign and architecture of the model-fallback-skill system.\n\n## Table of Contents\n\n- Overview\n- System Components\n- Data Flow\n- Failure Detection\n- Fallback Decision Tree\n- File Structure\n\n---\n\n## Overview\n\nThe model-fallback-skill provides two complementary mechanisms for ensuring nanobot reliability:\n\n1. Proactive Health Monitoring - Continuously monitors model health and switches before failures occur\n2. Reactive Fallback - Responds to actual failures and switches to backup models\n\nBoth mechanisms work together to provide zero-downtime operation.\n\n---\n\n## System Components\n\n\n\u252c\u2500 nanobot-wrapper.sh\n\u2502 \u2514\u2500 nanobot gateway\n\u2502\n\u251c\u2500 health-check.py (daemon)\n\u2502 \u2514\u2500 Monitors model health every 5 minutes\n\u2502 \u2514\u2500 Tests response time, error rate, timeout rate\n\u2502 \u2514\u2500 Triggers fallback if thresholds exceeded\n\u2502\n\u251c\u2500 fallback-trigger.py (on-demand)\n\u2502 \u2514\u2500 Manually trigger fallback\n\u2502 \u2514\u2500 Check fallback status\n\u2502 \u2514\u2500 View fallback history\n\u2502\n\u251c\u2500 config.json\n\u2502 \u2514\u2500 Current model configuration\n\u2502 \u2514\u2500 Fallback chain\n\u2502 \u2514\u2500 Provider API keys\n\u2502\n\u2514\u2500 Logs/\n \u2514\u2500 model-health.log\n \u2514\u2500 model-fallback.log\n \u2514\u2500 fallback-history.json\n\n\n### Component Descriptions\n\n#### nanobot-wrapper.sh\n- Wraps nanobot gateway in a restart loop\n- Monitors for restart trigger file\n- Automatically restarts nanobot when triggered\n\n#### health-check.py\n- Runs as a background daemon\n- Periodically tests model health\n- Maintains health metrics\n- Triggers fallback when thresholds exceeded\n\n#### fallback-trigger.py\n- Manages fallback operations\n- Updates configuration with new model\n- Creates restart trigger\n- Logs fallback events\n\n---\n\n## Data Flow\n\n### Health Check Flow\n\n\n\u252c\u2500 Start (every 5 minutes)\n\u2502\n\u251c\u2500 Read config.json\n\u2502 \u2514\u2500 Get current model\n\u2502 \u2514\u2500 Get fallback chain\n\u2502\n\u251c\u2500 Test Model Health\n\u2502 \u251c\u2500 Send test request\n\u2502 \u251c\u2500 Measure response time\n\u2502 \u251c\u2500 Check for errors\n\u2502 \u2514\u2500 Check for timeouts\n\u2502\n\u251c\u2500 Calculate Metrics\n\u2502 \u251c\u2500 Average response time\n\u2502 \u251c\u2500 Error rate (last N requests)\n\u2502 \u251c\u2500 Timeout rate (last N requests)\n\u2502 \u2514\u2500 Overall health score\n\u2502\n\u251c\u2500 Evaluate Thresholds\n\u2502 \u251c\u2500 Response time < MAX_RESPONSE_TIME?\n\u2502 \u251c\u2500 Error rate < MAX_ERROR_RATE?\n\u2502 \u251c\u2500 Timeout rate < MAX_TIMEOUT_RATE?\n\u2502 \u2514\u2500 Overall score acceptable?\n\u2502\n\u251c\u2500 Decision\n\u2502 \u251c\u2500 HEALTHY \u2192 Log and continue\n\u2502 \u2514\u2500 UNHEALTHY \u2192 Trigger fallback\n\u2502 \u251c\u2500 Call fallback-trigger.py\n\u2502 \u251c\u2500 Update config\n\u2502 \u251c\u2500 Create restart trigger\n\u2502 \u2514\u2500 Log event\n\u2502\n\u2514\u2500 Wait for next interval\n\n\n### Fallback Trigger Flow\n\n\n\u252c\u2500 Trigger Request\n\u2502\n\u251c\u2500 Validate Configuration\n\u2502 \u251c\u2500 Config file exists?\n\u2502 \u251c\u2500 Current model configured?\n\u2502 \u251c\u2500 Fallback models available?\n\u2502 \u2514\u2500 Not at end of chain?\n\u2502\n\u251c\u2500 Select Next Model\n\u2502 \u251c\u2500 Get current model from config\n\u2502 \u251c\u2500 Find current in fallback chain\n\u2502 \u2514\u2500 Select next in chain\n\u2502\n\u251c\u2500 Update Configuration\n\u2502 \u251c\u2500 Read config.json\n\u2502 \u251c\u2500 Update default model\n\u2502 \u251c\u2500 Write config.json\n\u2502 \u2514\u2500 Verify changes\n\u2502\n\u251c\u2500 Trigger Restart\n\u2502 \u251c\u2500 Create /tmp/nanobot-restart\n\u2502 \u251c\u2500 Signal nanobot process\n\u2502 \u2514\u2500 Wrapper detects and restarts\n\u2502\n\u251c\u2500 Log Event\n\u2502 \u251c\u2500 Log to model-fallback.log\n\u2502 \u251c\u2500 Record in fallback-history.json\n\u2502 \u2514\u2500 Include reason and timestamp\n\u2502\n\u2514\u2500 Return Success\n\n\n---\n\n## Failure Detection\n\n### Health Metrics\n\n| Metric | Description | Threshold | Weight |\n|--------|-------------|-----------|--------|\n| Response Time | Time to receive response | 30 seconds | 40% |\n| Error Rate | Percentage of failed requests | 10% | 35% |\n| Timeout Rate | Percentage of timed out requests | 20% | 25% |\n\n### Health Score Calculation\n\n\nHealth Score = (Response Score \u00d7 0.4) +\n (Error Score \u00d7 0.35) +\n (Timeout Score \u00d7 0.25)\n\nWhere:\n- Response Score = 1 - (Response Time / Max Response Time)\n- Error Score = 1 - (Error Rate / Max Error Rate)\n- Timeout Score = 1 - (Timeout Rate / Max Timeout Rate)\n\nHealth Score < 0.7 = UNHEALTHY\n\n\n### Sample Size\n\n- Last 10 requests are sampled for rate calculations\n- Rolling window updates with each check\n- Provides accurate recent performance\n\n---\n\n## Fallback Decision Tree\n\n\n\u252c\u2500 Fallback Triggered\n\u2502\n\u251c\u2500 Check Fallback Chain\n\u2502 \u251c\u2500 Any models in chain?\n\u2502 \u2514\u2500 Yes \u2192 Continue\n\u2502 No \u2192 Log error, exit\n\u2502\n\u251c\u2500 Find Current Model Position\n\u2502 \u251c\u2500 Current model in chain?\n\u2502 \u2514\u2500 Yes \u2192 Get next model\n\u2502 No \u2192 Use first model in chain\n\u2502\n\u251c\u2500 Select Next Model\n\u2502 \u251c\u2500 Next model available?\n\u2502 \u2514\u2500 Yes \u2192 Switch to next\n\u2502 No \u2192 Log \"End of chain\", exit\n\u2502\n\u251c\u2500 Validate Next Model\n\u2502 \u251c\u2500 API key configured?\n\u2502 \u251c\u2500 Model valid?\n\u2502 \u2514\u2500 Yes \u2192 Proceed\n\u2502 No \u2192 Try next in chain\n\u2502\n\u251c\u2500 Update Config\n\u2502 \u251c\u2500 Set new default model\n\u2502 \u2514\u2500 Write to config.json\n\u2502\n\u251c\u2500 Trigger Restart\n\u2502 \u251c\u2500 Create restart trigger file\n\u2502 \u251c\u2500 Signal nanobot\n\u2502 \u2514\u2500 Wrapper handles restart\n\u2502\n\u251c\u2500 Log Event\n\u2502 \u251c\u2500 Old model \u2192 New model\n\u2502 \u251c\u2500 Timestamp\n\u2502 \u251c\u2500 Reason\n\u2502 \u2514\u2500 Health metrics\n\u2502\n\u2514\u2500 Return Success\n\n\n---\n\n## File Structure\n\n\nmodel-fallback-skill/\n\u251c\u2500 SKILL.md # Nanobot skill documentation\n\u251c\u2500 README.md # Main documentation\n\u251c\u2500 CONTRIBUTING.md # Contribution guidelines\n\u251c\u2500 CHANGELOG.md # Version history\n\u251c\u2500 TROUBLESHOOTING.md # Troubleshooting guide\n\u251c\u2500 API.md # API documentation\n\u251c\u2500 ARCHITECTURE.md # This file\n\u251c\u2500 LICENSE # MIT License\n\u251c\u2500 install.sh # Installation script\n\u251c\u2500 .git/ # Git repository\n\u251c\u2500 scripts/\n\u2502 \u251c\u2500 health-check.py # Health monitoring daemon\n\u2502 \u2514\u2500 fallback-trigger.py # Fallback management\n\u2514\u2500 examples/\n \u251c\u2500 README.md # Examples documentation\n \u251c\u2500 config-minimax-openrouter.json\n \u251c\u2500 config-openrouter-only.json\n \u2514\u2500 config-production.json\n\n\n### External Files\n\n\n~/.nanobot/\n\u251c\u2500 config.json # nanobot configuration\n\u251c\u2500 nanobot-wrapper.sh # nanobot wrapper\n\u2514\u2500 logs/\n \u251c\u2500 model-health.log # Health check logs\n \u251c\u2500 model-fallback.log # Fallback event logs\n \u2514\u2500 fallback-history.json # Fallback history\n\n/tmp/\n\u2514\u2500 nanobot-restart # Restart trigger file\n"]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls