Known Limitations & Model Requirements

Selfware is powerful but not magic. This document explains what works, what doesn't, and what model capabilities are needed for each feature.

Model Size Requirements by Feature

Not every feature works with every model. Here's what to expect:

Feature	Minimum Model	Recommended	Notes
Simple file edits	1.5B	4B	Basic search/replace
Bug fixes (easy)	4B	9B	Single-file, clear errors
Bug fixes (hard)	9B	27B+	Multi-file, subtle logic
Test generation	4B	9B	Quality scales with size
Refactoring	9B	27B+	Needs architectural understanding
Multi-step tasks	9B	27B+	Planning + execution
Tool calling (native)	4B	9B	Requires backend support (sglang/vllm)
Tool calling (XML)	1.5B	4B	Works with any backend
Swarm coordination	27B+	35B+	Multiple agents need strong reasoning
Evolution engine	27B+	35B+	Hypothesis generation requires deep understanding
Code generation (new project)	4B	9B	Templates help weaker models
Security audit	27B+	35B+	Requires security domain knowledge
Visual analysis (VLM)	7B VLM	30B VLM	Requires vision-capable model

What Selfware Cannot Do

Fundamental Limitations

Cannot guarantee correct code — LLM output is probabilistic, not proven
Cannot replace code review — generated code should always be reviewed by a human
Cannot understand business context — it knows code, not your business requirements
Cannot access the internet (unless you configure MCP servers) — fully local by design

Safety Limitations

Cannot prevent all prompt injection — adversarial inputs may influence the agent
Cannot guarantee sandbox escape is impossible — defense-in-depth, not absolute containment
Cannot detect all malicious patterns — shell command filtering is regex-based, not a full parser
Cannot prevent TOCTOU races in all cases — mitigated but not eliminated

Performance Limitations

Slow with small models — 1.5B models may take minutes per response
Context window limits — very large codebases may not fit in context even at 128K tokens
Memory usage scales with context — 128K context on a 27B model needs significant RAM
No GPU sharing — selfware doesn't manage GPU allocation between agents

Feature Limitations

Evolution engine is experimental — feature-gated, requires careful configuration
Computer control is platform-dependent — full support on macOS, limited on Linux/Windows
PTY shell doesn't work on Windows — uses Unix pseudo-terminals
LSP requires language servers installed — rust-analyzer, pyright, etc. must be on PATH
Visual verification requires VLM — standard text models cannot analyze screenshots

Weaker Model Tips

If you're running a 4B or smaller model:

Use templates — selfware init with --template rust scaffolds the project correctly so the model only fills in business logic
Use Plan Mode — /plan lets the model think before acting, reducing errors
Keep tasks small — "add a function that does X" works better than "build the entire auth system"
Enable native function calling — native_function_calling = true in config, with sglang/vllm backend. XML parsing is less reliable with small models
Lower max_iterations — prevent long loops: max_iterations = 20
Use the interview mode — structured questions help the model understand what you want
Check the doctor — selfware doctor ensures your environment is set up correctly

Benchmark Reproduction

To reproduce the SAB 90/100 score:

# Requirements:
# - Qwen3-Coder-Next-FP8 on an H100 80GB (or equivalent)
# - sglang with native tool calling
# - 131072 context length

# Start the backend
python -m sglang.launch_server \
  --model-path Qwen/Qwen3-Coder-Next-FP8 \
  --context-length 131072 \
  --tool-call-parser qwen \
  --reasoning-parser qwen3 \
  --port 8000

# Configure selfware
cat > selfware.toml << 'EOF'
endpoint = "http://localhost:8000/v1"
model = "Qwen/Qwen3-Coder-Next-FP8"
max_tokens = 65536

[agent]
native_function_calling = true
max_iterations = 100
step_timeout_secs = 600
EOF

# Run the benchmark
export ENDPOINT="http://localhost:8000/v1"
export MODEL="Qwen/Qwen3-Coder-Next-FP8"
export MAX_PARALLEL=6
bash system_tests/projecte2e/run_full_sab.sh

Results vary by model, quantization, and context length. The 90/100 score was achieved with:

Model: Qwen3-Coder-Next-FP8 (FP8 quantization)
Context: 131072 tokens
Backend: sglang with native tool calling
Hardware: RTX 6000 Pro
Rounds: 27 rounds, 323 scenario runs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Known Limitations & Model Requirements

Model Size Requirements by Feature

What Selfware Cannot Do

Fundamental Limitations

Safety Limitations

Performance Limitations

Feature Limitations

Weaker Model Tips

Benchmark Reproduction

FilesExpand file tree

limitations.md

Latest commit

History

limitations.md

File metadata and controls

Known Limitations & Model Requirements

Model Size Requirements by Feature

What Selfware Cannot Do

Fundamental Limitations

Safety Limitations

Performance Limitations

Feature Limitations

Weaker Model Tips

Benchmark Reproduction