Kosmos

Open-source implementation of the autonomous AI scientist described in Lu et al. (2024). The original paper reported 79.4% accuracy on scientific statements and 7 validated discoveries, but omitted implementation details for 6 critical components. This repository provides those implementations using patterns from the K-Dense ecosystem.

Current state: All 6 gaps implemented. E2E testing in progress with local LLMs via Ollama/LiteLLM. See Project Status for honest assessment of remaining work.

Paper Gap Analysis

The original Kosmos paper demonstrated results but left critical implementation details unspecified. Analysis in OPEN_QUESTIONS.md identified 6 gaps blocking reproduction:

Gap	Problem	Severity
0	Paper processes 1,500 papers and 42,000 lines of code per run, exceeding any LLM context window	Foundational
1	State Manager described as "core advancement" but no schema, storage strategy, or update mechanisms provided	Critical
2	Strategic reasoning algorithm for generating research tasks completely unstated	Critical
3	System prompts, output formats, and domain expertise injection mechanisms not specified	Critical
4	Paper contradicts itself on R vs Python usage; code execution environment not described	High
5	Paper reports 57.9% interpretation accuracy but quality metrics and filtering criteria not specified	Moderate

Gap Solutions

Each gap was addressed using patterns from the K-Dense ecosystem. Detailed analysis in OPENQUESTIONS_SOLUTION.md.

Gap 0: Context Compression (Complete)

Problem: 1,500 papers + 42,000 lines of code cannot fit in any LLM context window.

Solution: Hierarchical 3-tier compression achieving 20:1 ratio.

Tier 1: Task-level compression (42K lines -> 2-line summary + statistics)
Tier 2: Cycle-level compression (10 task summaries -> 1 cycle overview)
Tier 3: Final synthesis with lazy loading for full content retrieval

Pattern source: kosmos-claude-skills-mcp (progressive disclosure)

Implementation: kosmos/compression/

Gap 1: State Manager (Complete)

Problem: Paper's "core advancement" has no schema specification.

Solution: Hybrid 4-layer architecture.

Layer 1: JSON artifacts (human-readable, version-controllable)
Layer 2: Knowledge graph (structural queries via Neo4j, optional)
Layer 3: Vector store (semantic search, optional)
Layer 4: Citation tracking (evidence chains)

Implementation: kosmos/world_model/artifacts.py

Gap 2: Task Generation (Complete)

Problem: How does the system generate 10 strategic research tasks per cycle?

Solution: Plan Creator + Plan Reviewer orchestration pattern.

Plan Creator: Generates tasks with exploration/exploitation ratio (70% early cycles, 30% late cycles)
Plan Reviewer: 5-dimension scoring (specificity, relevance, novelty, coverage, feasibility)
Novelty Detector: Prevents redundant analyses across 200 rollouts
Delegation Manager: Routes tasks to appropriate agents

Pattern source: kosmos-karpathy (orchestration patterns)

Implementation: kosmos/orchestration/ (1,949 lines across 6 files)

Gap 3: Agent Integration (Complete)

Problem: How are domain-specific capabilities injected into agents?

Solution: Skill loader with 566 domain-specific scientific prompts auto-loaded by domain matching.

Pattern source: kosmos-claude-scientific-skills (566 skills)

Implementation: kosmos/agents/skill_loader.py

Skills submodule: kosmos-claude-scientific-skills/

Gap 4: Execution Environment (Complete)

Problem: Paper contradicts itself on R vs Python. No execution environment described.

Solution: Docker-based Jupyter sandbox with:

Container pooling for performance (pre-warmed containers)
Automatic package resolution and installation
Resource limits (memory, CPU, timeout)
Security constraints (network isolation, read-only rootfs, dropped capabilities)

This was the final gap implemented. The execution environment is now production-ready pending Docker availability.

Implementation: kosmos/execution/

Key files:

docker_manager.py - Container lifecycle management with pooling
jupyter_client.py - Kernel gateway integration for code execution
package_resolver.py - Automatic dependency detection and installation
production_executor.py - Unified execution interface

Gap 5: Discovery Validation (Complete)

Problem: How are discoveries evaluated before inclusion in reports?

Solution: ScholarEval 8-dimension quality framework with weighted scoring.

Dimensions evaluated:

Statistical validity
Reproducibility
Novelty
Significance
Methodological soundness
Evidence quality
Claim calibration
Citation support

Pattern source: kosmos-claude-scientific-writer (validation patterns)

Implementation: kosmos/validation/

K-Dense Pattern Sources

This implementation draws from the K-Dense ecosystem:

Repository	Contribution	Gap
kosmos-claude-skills-mcp	Context compression, progressive disclosure	0
kosmos-karpathy	Orchestration, plan creator/reviewer pattern	2
kosmos-claude-scientific-skills	566 domain-specific scientific prompts	3
kosmos-claude-scientific-writer	ScholarEval validation framework	5

Reference repositories in kosmos-reference/. Skills integrated as git subtree at project root.

Project Status

Test Results (as of 2025-11-27)

Category	Total	Pass	Skip	Notes
Unit tests	273+	273+	0	Core gap implementations
LiteLLM provider	20	20	0	Multi-provider support
E2E tests	39	32	7	Tested with Ollama

The 7 skipped E2E tests are due to:

Docker sandbox not configured (1 test)
Neo4j not configured (1 test)
Test setup complexity requiring refactoring (4 tests)
Database model minor issue (1 test)

What Works

Research workflow initialization and hypothesis generation
Experiment design from hypotheses via LLM
Result analysis and interpretation
Multi-provider LLM support (Anthropic, OpenAI, LiteLLM/Ollama)
Basic research cycle progression

What Does Not Work Yet

Docker sandbox execution: Code execution currently runs without containerization. The sandbox implementation exists but is not integrated into tests.
Knowledge graph: Neo4j integration requires external database setup. The code exists but is untested in E2E flows.
Full autonomous loop: While individual components work, running 20 cycles with 10 tasks each (as described in the paper) has not been validated. The workflow tends to converge early or requires manual intervention.
Cost tracking: No mechanism to track or limit API costs during long research runs.
Literature search: The arxiv package has Python 3.11+ compatibility issues. Literature features are limited.

Honest Assessment

This implementation provides the architectural skeleton described in the Lu et al. paper. The 6 gaps identified in the paper have been filled with working code. However:

We have not reproduced the paper's claimed 79.4% accuracy or 7 validated discoveries
The system has been tested primarily with small local models (Qwen 4B via Ollama), not production-scale LLMs
Multi-cycle autonomous research runs have not been validated end-to-end
The codebase has accumulated technical debt from rapid development

The project is suitable for experimentation and further development, not production research use.

Next Steps

Validate full research cycles with production LLMs
Integrate Docker sandbox into E2E test suite
Fix remaining test setup issues
Add cost tracking and budget limits
Document actual performance vs paper claims

Limitations

Docker required: Gap 4 execution environment requires Docker. Without it, code execution uses mock implementations.
Dependency compatibility: The arxiv package fails to build on Python 3.11+ due to sgmllib3k incompatibility. Literature search features are limited without this package.
Python only: The paper references R packages (MendelianRandomization, susieR). This implementation is Python-only.
LLM costs: Running 20 research cycles with 10 tasks each requires significant API usage. No cost optimization beyond caching.
Single-user: No multi-tenancy or user isolation.
Not a reproduction study: We have not reproduced the paper's 7 validated discoveries. This is an implementation of the architecture, not a validation of the results.
Integration test maintenance: Some integration tests have API mismatches with current implementation.

Getting Started

Requirements

Python 3.11+
Anthropic API key or OpenAI API key
Docker (for sandboxed code execution)

Installation

git clone https://github.com/jimmc414/Kosmos.git
cd Kosmos
pip install -e .
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY or OPENAI_API_KEY

Verify Installation

# Run smoke tests
python scripts/smoke_test.py

# Run unit tests for gap modules
pytest tests/unit/compression/ tests/unit/orchestration/ \
       tests/unit/validation/ tests/unit/workflow/ \
       tests/unit/agents/test_skill_loader.py \
       tests/unit/world_model/test_artifacts.py -v

Run Research Workflow

import asyncio
from kosmos.workflow.research_loop import ResearchWorkflow

async def run():
    workflow = ResearchWorkflow(
        research_objective="Your research question here",
        artifacts_dir="./artifacts"
    )
    result = await workflow.run(num_cycles=5, tasks_per_cycle=10)
    report = await workflow.generate_report()
    print(report)

asyncio.run(run())

See GETTING_STARTED.md for detailed examples.

Configuration

All configuration via environment variables. See .env.example for full list.

LLM Provider

# Anthropic (default)
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...

# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-5-nano

# LiteLLM (supports 100+ providers including local models)
LLM_PROVIDER=litellm
LITELLM_MODEL=ollama/llama3.1:8b
LITELLM_API_BASE=http://localhost:11434
LITELLM_TIMEOUT=300

# DeepSeek via LiteLLM
LLM_PROVIDER=litellm
LITELLM_MODEL=deepseek/deepseek-chat
LITELLM_API_KEY=sk-...

Optional Services

# Neo4j (optional, for knowledge graph features)
NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=your-password

# Redis (optional, for distributed caching)
REDIS_URL=redis://localhost:6379

Architecture

kosmos/
├── compression/      # Gap 0: Context compression (20:1 ratio)
├── world_model/      # Gap 1: State manager (JSON artifacts + optional graph)
├── orchestration/    # Gap 2: Task generation (plan creator/reviewer)
├── agents/           # Gap 3: Agent integration (skill loader)
├── execution/        # Gap 4: Sandboxed execution (Docker + Jupyter)
├── validation/       # Gap 5: Discovery validation (ScholarEval)
├── workflow/         # Integration layer combining all components
├── core/             # LLM clients, configuration
├── literature/       # Literature search (arXiv, PubMed, Semantic Scholar)
├── knowledge/        # Vector store, embeddings
└── cli/              # Command-line interface

Documentation

OPEN_QUESTIONS.md - Original gap analysis
OPENQUESTIONS_SOLUTION.md - How gaps were addressed
IMPLEMENTATION_REPORT.md - Architecture decisions
PRODUCTION_READINESS_REPORT.md - Current status
TESTS_STATUS.md - Test coverage
GETTING_STARTED.md - Usage examples

Based On

Paper: Kosmos: An AI Scientist for Autonomous Discovery (Lu et al., 2024)
K-Dense ecosystem: Pattern repositories for AI agent systems
kosmos-figures: Analysis patterns

Contributing

See CONTRIBUTING.md.

Areas where contributions would be useful:

Docker sandbox testing and hardening
Integration test updates
R language support via rpy2
Additional scientific domain skills
Performance benchmarking

License

MIT License - see LICENSE.

Version: 0.2.0-alpha Gap Implementation: 6/6 complete Test Coverage: 273 unit tests passing (core modules) Next Step: End-to-end integration testing Last Updated: 2025-11-25

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.claude		.claude
.literature_analysis_cache		.literature_analysis_cache
.literature_cache		.literature_cache
alembic		alembic
archive		archive
archived		archived
docker/sandbox		docker/sandbox
docs		docs
examples		examples
k8s		k8s
kosmos-claude-scientific-skills		kosmos-claude-scientific-skills
kosmos-figures @ 8194472		kosmos-figures @ 8194472
kosmos-reference		kosmos-reference
kosmos		kosmos
kosmos_ai_scientist.egg-info		kosmos_ai_scientist.egg-info
paper		paper
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
E2E_CHECKPOINT_20251126.md		E2E_CHECKPOINT_20251126.md
E2E_CHECKPOINT_20251127.md		E2E_CHECKPOINT_20251127.md
E2E_CHECKPOINT_20251127_SESSION2.md		E2E_CHECKPOINT_20251127_SESSION2.md
E2E_CHECKPOINT_20251127_SESSION3.md		E2E_CHECKPOINT_20251127_SESSION3.md
E2E_CHECKPOINT_20251127_SESSION4.md		E2E_CHECKPOINT_20251127_SESSION4.md
E2E_CHECKPOINT_20251127_SESSION5.md		E2E_CHECKPOINT_20251127_SESSION5.md
E2E_CHECKPOINT_20251127_SESSION6.md		E2E_CHECKPOINT_20251127_SESSION6.md
E2E_CHECKPOINT_20251127_SESSION7.md		E2E_CHECKPOINT_20251127_SESSION7.md
E2E_CHECKPOINT_20251127_SESSION7_FINAL.md		E2E_CHECKPOINT_20251127_SESSION7_FINAL.md
E2E_CHECKPOINT_20251127_SESSION8.md		E2E_CHECKPOINT_20251127_SESSION8.md
E2E_DEPENDENCY_REMEDIATION_CHECKLIST.md		E2E_DEPENDENCY_REMEDIATION_CHECKLIST.md
E2E_RESUME_PROMPT.md		E2E_RESUME_PROMPT.md
E2E_RESUME_PROMPT_2.md		E2E_RESUME_PROMPT_2.md
E2E_RESUME_PROMPT_3.md		E2E_RESUME_PROMPT_3.md
E2E_RESUME_PROMPT_4.md		E2E_RESUME_PROMPT_4.md
E2E_RESUME_PROMPT_5.md		E2E_RESUME_PROMPT_5.md
E2E_RESUME_PROMPT_6.md		E2E_RESUME_PROMPT_6.md
E2E_RESUME_PROMPT_7.md		E2E_RESUME_PROMPT_7.md
E2E_RESUME_PROMPT_8.md		E2E_RESUME_PROMPT_8.md
E2E_RESUME_PROMPT_9.md		E2E_RESUME_PROMPT_9.md
E2E_TESTING_DEPENDENCY_REPORT.md		E2E_TESTING_DEPENDENCY_REPORT.md
E2E_TESTING_GUIDE.md		E2E_TESTING_GUIDE.md
E2E_TESTING_IMPLEMENTATION_PLAN.md		E2E_TESTING_IMPLEMENTATION_PLAN.md
E2E_TESTING_WORKFLOW_GUIDE.md		E2E_TESTING_WORKFLOW_GUIDE.md
E2E_WORKFLOW_CHECKLIST.md		E2E_WORKFLOW_CHECKLIST.md
E2E_WORKFLOW_KICKOFF_PROMPT.md		E2E_WORKFLOW_KICKOFF_PROMPT.md
GETTING_STARTED.md		GETTING_STARTED.md
KOSMOS_E2E_CHEATSHEET.md		KOSMOS_E2E_CHEATSHEET.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
REQUIREMENTS.md		REQUIREMENTS.md
STEP2_CONTINUATION_PROMPT.md		STEP2_CONTINUATION_PROMPT.md
STEP3_CONTINUATION_PROMPT.md		STEP3_CONTINUATION_PROMPT.md
TESTS_STATUS.md		TESTS_STATUS.md
VALIDATION_ROADMAP.md		VALIDATION_ROADMAP.md
alembic.ini		alembic.ini
code_review1126.md		code_review1126.md
docker-compose.yml		docker-compose.yml
e2e_dependency_remediation_checklist_prompt.md		e2e_dependency_remediation_checklist_prompt.md
e2e_testing_implementation_prompt.md		e2e_testing_implementation_prompt.md
e2e_testing_missing_dependencies_report_prompt.md		e2e_testing_missing_dependencies_report_prompt.md
e2e_testing_prompt.md		e2e_testing_prompt.md
human_review_audit.jsonl		human_review_audit.jsonl
issue29_analysis.md		issue29_analysis.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt

Folders and files

Latest commit

History

Repository files navigation

Kosmos

Paper Gap Analysis

Gap Solutions

Gap 0: Context Compression (Complete)

Gap 1: State Manager (Complete)

Gap 2: Task Generation (Complete)

Gap 3: Agent Integration (Complete)

Gap 4: Execution Environment (Complete)

Gap 5: Discovery Validation (Complete)

K-Dense Pattern Sources

Project Status

Test Results (as of 2025-11-27)

What Works

What Does Not Work Yet

Honest Assessment

Next Steps

Limitations

Getting Started

Requirements

Installation

Verify Installation

Run Research Workflow

Configuration

LLM Provider

Optional Services

Architecture

Documentation

Based On

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages