This document outlines the key concerns and management strategy for SLM deployment across all projects.
Concern
Priority
Projects Affected
Model Selection
High
All
Cost Management
High
All
Latency Requirements
High
Gateway, Rooivalk
Edge Deployment
High
Rooivalk
Security & Privacy
High
Gateway, Cognitive Mesh
Reliability
Medium
All
Observability
Medium
All
Versioning
Medium
All
Maintain a tiered model portfolio:
Tier
Models
Use Cases
Cost
Ultra-light
Phi-3 Mini, Gemma 2B
Classification, routing
$0.0001/request
Light
Phi-3, Llama 3 8B
Tool selection, log analysis
$0.001/request
Medium
Llama 3 70B
Complex routing, decomposition
$0.01/request
Heavy
GPT-4 class
Reasoning, synthesis
$0.05+/request
Central model registry with capability matrix
A/B testing framework for model comparisons
Performance benchmarks per use case category
Implement cost controls at each layer:
Cost Control Layers
┌─────────────────────────────────────┐
│ 1. Budget caps per project │
├─────────────────────────────────────┤
│ 2. SLM-first routing (80%+ target) │
├─────────────────────────────────────┤
│ 3. Confidence-based escalation │
├─────────────────────────────────────┤
│ 4. Request caching │
├─────────────────────────────────────┤
│ 5. Telemetry & alerting │
└─────────────────────────────────────┘
Metric
Target
SLM routing %
>80%
Cost per 1K requests
<$5
LLM escalation rate
<20%
Cache hit rate
>30%
Cost spike >20% day-over-day
LLM escalation >25%
Budget utilization >80%
Project
Target P99
Critical Path
AI Gateway
<100ms
routing decision
PhoenixRooivalk
<50ms
threat classification
CodeFlow
<2s
PR classification
Cognitive Mesh
<500ms
agent selection
AgentKit Forge
<1s
tool selection
Model quantization for edge (int4)
Caching of frequent decisions
Batch processing for non-critical tasks
Connection pooling to inference endpoints
4. Edge Deployment (PhoenixRooivalk)
Critical: SLM is NOT Primary
Never use SLM for safety-critical decisions.
SLM is only for:
Operator-facing summaries
Report generation
Post-mission narratives
Core detection uses:
Rules + signal models + fusion engine
Requirement
Solution
Hardware diversity
Support Jetson, CPU, mobile
Offline operation
Full local inference capability
Model updates
OTA with rollback
Security
No external connectivity
# Standard edge optimization pipeline
optimizations = [
quantization (weights = "int4" ),
pruning (structured = 0.3 ),
distillation (student = phi3_mini ),
compilation (target = "cuda|cpu" )
]
Layer
Controls
Input
Prompt injection detection, PII filtering
Processing
No data leaves boundary
Output
Content filtering, audit logging
Access
Role-based model access
async def security_pipeline (request : Request ) -> SecurityResult :
# 1. Prompt injection check
injection = await slm_check_injection (request .prompt )
if injection .detected :
return blocked (injection .reason )
# 2. PII detection
pii = await slm_check_pii (request .prompt )
if pii .found :
return blocked ("PII detected" )
# 3. Policy check
policy = await slm_check_policy (request .prompt )
if policy .violation :
return blocked (policy .violation )
return allowed ()
Concern
Mitigation
Model downtime
Fallback models per tier
Latency spikes
Timeout + escalation
Quality degradation
Continuous evaluation
Hallucinations
Confidence thresholds
Request
│
▼ Primary SLM
│
├─ Success → Return
│
├─ Timeout → Fallback SLM
│
├─ Low confidence → LLM verification
│
└─ Failure → Error with telemetry
Metric Type
Collection
Request volume
Per model, per project
Latency
P50, P95, P99 per endpoint
Error rate
By error type, model
Cost
Per project, per user
Quality
Accuracy, escalation rate
Cost Dashboard : Spend by project, model, day
Performance Dashboard : Latency by tier
Quality Dashboard : Accuracy, false positives
Component
Versioning
Update Frequency
Models
Semantic (1.0.0)
Monthly evaluation
Prompts
Git-based
Per task
Infrastructure
Terraform
Per deployment
Discovery → Testing → Staging → Production → Deprecated → Retired
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Evaluate A/B test Shadow mode Active Fallback
Project-Specific Concerns
High-volume routing
Security-first evaluation
Real-time cost tracking
Agent capability mapping
Task decomposition accuracy
Multi-agent coordination
CRITICAL : SLM NOT for safety decisions
Edge hardware diversity
Offline reliability
Minimal latency
PR classification accuracy
CI log analysis quality
Auto-merge reliability
Tool selection accuracy
Context compression ratio
LLM call reduction
Use SLMs to decide, filter, classify, compress, and prepare.
Use LLMs to reason, reconcile, synthesize, and communicate.
Establish model registry with tiered selection
Implement cost tracking per project
Set up latency monitoring dashboards
Create edge deployment pipeline
Build security check pipeline
Define fallback hierarchies
Implement observability stack
Document model lifecycle process
Add explicit safety boundary for PhoenixRooivalk