Open-source implementation of the autonomous AI scientist described in Lu et al. (2024). The original paper reported 79.4% accuracy on scientific statements and 7 validated discoveries, but omitted implementation details for 6 critical components. This repository provides those implementations using patterns from the K-Dense ecosystem.
Current state: All 6 gaps implemented. E2E testing in progress with local LLMs via Ollama/LiteLLM. See Project Status for honest assessment of remaining work.
The original Kosmos paper demonstrated results but left critical implementation details unspecified. Analysis in OPEN_QUESTIONS.md identified 6 gaps blocking reproduction:
| Gap | Problem | Severity |
|---|---|---|
| 0 | Paper processes 1,500 papers and 42,000 lines of code per run, exceeding any LLM context window | Foundational |
| 1 | State Manager described as "core advancement" but no schema, storage strategy, or update mechanisms provided | Critical |
| 2 | Strategic reasoning algorithm for generating research tasks completely unstated | Critical |
| 3 | System prompts, output formats, and domain expertise injection mechanisms not specified | Critical |
| 4 | Paper contradicts itself on R vs Python usage; code execution environment not described | High |
| 5 | Paper reports 57.9% interpretation accuracy but quality metrics and filtering criteria not specified | Moderate |
Each gap was addressed using patterns from the K-Dense ecosystem. Detailed analysis in OPENQUESTIONS_SOLUTION.md.
Problem: 1,500 papers + 42,000 lines of code cannot fit in any LLM context window.
Solution: Hierarchical 3-tier compression achieving 20:1 ratio.
- Tier 1: Task-level compression (42K lines -> 2-line summary + statistics)
- Tier 2: Cycle-level compression (10 task summaries -> 1 cycle overview)
- Tier 3: Final synthesis with lazy loading for full content retrieval
Pattern source: kosmos-claude-skills-mcp (progressive disclosure)
Implementation: kosmos/compression/
Problem: Paper's "core advancement" has no schema specification.
Solution: Hybrid 4-layer architecture.
- Layer 1: JSON artifacts (human-readable, version-controllable)
- Layer 2: Knowledge graph (structural queries via Neo4j, optional)
- Layer 3: Vector store (semantic search, optional)
- Layer 4: Citation tracking (evidence chains)
Implementation: kosmos/world_model/artifacts.py
Problem: How does the system generate 10 strategic research tasks per cycle?
Solution: Plan Creator + Plan Reviewer orchestration pattern.
- Plan Creator: Generates tasks with exploration/exploitation ratio (70% early cycles, 30% late cycles)
- Plan Reviewer: 5-dimension scoring (specificity, relevance, novelty, coverage, feasibility)
- Novelty Detector: Prevents redundant analyses across 200 rollouts
- Delegation Manager: Routes tasks to appropriate agents
Pattern source: kosmos-karpathy (orchestration patterns)
Implementation: kosmos/orchestration/ (1,949 lines across 6 files)
Problem: How are domain-specific capabilities injected into agents?
Solution: Skill loader with 566 domain-specific scientific prompts auto-loaded by domain matching.
Pattern source: kosmos-claude-scientific-skills (566 skills)
Implementation: kosmos/agents/skill_loader.py
Skills submodule: kosmos-claude-scientific-skills/
Problem: Paper contradicts itself on R vs Python. No execution environment described.
Solution: Docker-based Jupyter sandbox with:
- Container pooling for performance (pre-warmed containers)
- Automatic package resolution and installation
- Resource limits (memory, CPU, timeout)
- Security constraints (network isolation, read-only rootfs, dropped capabilities)
This was the final gap implemented. The execution environment is now production-ready pending Docker availability.
Implementation: kosmos/execution/
Key files:
docker_manager.py- Container lifecycle management with poolingjupyter_client.py- Kernel gateway integration for code executionpackage_resolver.py- Automatic dependency detection and installationproduction_executor.py- Unified execution interface
Problem: How are discoveries evaluated before inclusion in reports?
Solution: ScholarEval 8-dimension quality framework with weighted scoring.
Dimensions evaluated:
- Statistical validity
- Reproducibility
- Novelty
- Significance
- Methodological soundness
- Evidence quality
- Claim calibration
- Citation support
Pattern source: kosmos-claude-scientific-writer (validation patterns)
Implementation: kosmos/validation/
This implementation draws from the K-Dense ecosystem:
| Repository | Contribution | Gap |
|---|---|---|
| kosmos-claude-skills-mcp | Context compression, progressive disclosure | 0 |
| kosmos-karpathy | Orchestration, plan creator/reviewer pattern | 2 |
| kosmos-claude-scientific-skills | 566 domain-specific scientific prompts | 3 |
| kosmos-claude-scientific-writer | ScholarEval validation framework | 5 |
Reference repositories in kosmos-reference/. Skills integrated as git subtree at project root.
| Category | Total | Pass | Fail | Skip | Notes |
|---|---|---|---|---|---|
| Unit tests | 273+ | 273+ | 0 | 0 | Core gap implementations |
| LiteLLM provider | 20 | 20 | 0 | 0 | Multi-provider support |
| E2E tests | 39 | 32 | 0 | 7 | Tested with Ollama |
The 7 skipped E2E tests are due to:
- Docker sandbox not configured (1 test)
- Neo4j not configured (1 test)
- Test setup complexity requiring refactoring (4 tests)
- Database model minor issue (1 test)
- Research workflow initialization and hypothesis generation
- Experiment design from hypotheses via LLM
- Result analysis and interpretation
- Multi-provider LLM support (Anthropic, OpenAI, LiteLLM/Ollama)
- Basic research cycle progression
-
Docker sandbox execution: Code execution currently runs without containerization. The sandbox implementation exists but is not integrated into tests.
-
Knowledge graph: Neo4j integration requires external database setup. The code exists but is untested in E2E flows.
-
Full autonomous loop: While individual components work, running 20 cycles with 10 tasks each (as described in the paper) has not been validated. The workflow tends to converge early or requires manual intervention.
-
Cost tracking: No mechanism to track or limit API costs during long research runs.
-
Literature search: The
arxivpackage has Python 3.11+ compatibility issues. Literature features are limited.
This implementation provides the architectural skeleton described in the Lu et al. paper. The 6 gaps identified in the paper have been filled with working code. However:
- We have not reproduced the paper's claimed 79.4% accuracy or 7 validated discoveries
- The system has been tested primarily with small local models (Qwen 4B via Ollama), not production-scale LLMs
- Multi-cycle autonomous research runs have not been validated end-to-end
- The codebase has accumulated technical debt from rapid development
The project is suitable for experimentation and further development, not production research use.
- Validate full research cycles with production LLMs
- Integrate Docker sandbox into E2E test suite
- Fix remaining test setup issues
- Add cost tracking and budget limits
- Document actual performance vs paper claims
-
Docker required: Gap 4 execution environment requires Docker. Without it, code execution uses mock implementations.
-
Dependency compatibility: The
arxivpackage fails to build on Python 3.11+ due tosgmllib3kincompatibility. Literature search features are limited without this package. -
Python only: The paper references R packages (MendelianRandomization, susieR). This implementation is Python-only.
-
LLM costs: Running 20 research cycles with 10 tasks each requires significant API usage. No cost optimization beyond caching.
-
Single-user: No multi-tenancy or user isolation.
-
Not a reproduction study: We have not reproduced the paper's 7 validated discoveries. This is an implementation of the architecture, not a validation of the results.
-
Integration test maintenance: Some integration tests have API mismatches with current implementation.
- Python 3.11+
- Anthropic API key or OpenAI API key
- Docker (for sandboxed code execution)
git clone https://github.com/jimmc414/Kosmos.git
cd Kosmos
pip install -e .
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY or OPENAI_API_KEY# Run smoke tests
python scripts/smoke_test.py
# Run unit tests for gap modules
pytest tests/unit/compression/ tests/unit/orchestration/ \
tests/unit/validation/ tests/unit/workflow/ \
tests/unit/agents/test_skill_loader.py \
tests/unit/world_model/test_artifacts.py -vimport asyncio
from kosmos.workflow.research_loop import ResearchWorkflow
async def run():
workflow = ResearchWorkflow(
research_objective="Your research question here",
artifacts_dir="./artifacts"
)
result = await workflow.run(num_cycles=5, tasks_per_cycle=10)
report = await workflow.generate_report()
print(report)
asyncio.run(run())See GETTING_STARTED.md for detailed examples.
All configuration via environment variables. See .env.example for full list.
# Anthropic (default)
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-5-nano
# LiteLLM (supports 100+ providers including local models)
LLM_PROVIDER=litellm
LITELLM_MODEL=ollama/llama3.1:8b
LITELLM_API_BASE=http://localhost:11434
LITELLM_TIMEOUT=300
# DeepSeek via LiteLLM
LLM_PROVIDER=litellm
LITELLM_MODEL=deepseek/deepseek-chat
LITELLM_API_KEY=sk-...# Neo4j (optional, for knowledge graph features)
NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=your-password
# Redis (optional, for distributed caching)
REDIS_URL=redis://localhost:6379kosmos/
├── compression/ # Gap 0: Context compression (20:1 ratio)
├── world_model/ # Gap 1: State manager (JSON artifacts + optional graph)
├── orchestration/ # Gap 2: Task generation (plan creator/reviewer)
├── agents/ # Gap 3: Agent integration (skill loader)
├── execution/ # Gap 4: Sandboxed execution (Docker + Jupyter)
├── validation/ # Gap 5: Discovery validation (ScholarEval)
├── workflow/ # Integration layer combining all components
├── core/ # LLM clients, configuration
├── literature/ # Literature search (arXiv, PubMed, Semantic Scholar)
├── knowledge/ # Vector store, embeddings
└── cli/ # Command-line interface
- OPEN_QUESTIONS.md - Original gap analysis
- OPENQUESTIONS_SOLUTION.md - How gaps were addressed
- IMPLEMENTATION_REPORT.md - Architecture decisions
- PRODUCTION_READINESS_REPORT.md - Current status
- TESTS_STATUS.md - Test coverage
- GETTING_STARTED.md - Usage examples
- Paper: Kosmos: An AI Scientist for Autonomous Discovery (Lu et al., 2024)
- K-Dense ecosystem: Pattern repositories for AI agent systems
- kosmos-figures: Analysis patterns
See CONTRIBUTING.md.
Areas where contributions would be useful:
- Docker sandbox testing and hardening
- Integration test updates
- R language support via rpy2
- Additional scientific domain skills
- Performance benchmarking
MIT License - see LICENSE.
Version: 0.2.0-alpha Gap Implementation: 6/6 complete Test Coverage: 273 unit tests passing (core modules) Next Step: End-to-end integration testing Last Updated: 2025-11-25