Skip to content

zhangchenhaobest/Kosmos

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

190 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kosmos

Open-source implementation of the autonomous AI scientist described in Lu et al. (2024). The original paper reported 79.4% accuracy on scientific statements and 7 validated discoveries, but omitted implementation details for 6 critical components. This repository provides those implementations using patterns from the K-Dense ecosystem.

Version Status Implementation Tests

Current state: All 6 gaps implemented. E2E testing in progress with local LLMs via Ollama/LiteLLM. See Project Status for honest assessment of remaining work.

Paper Gap Analysis

The original Kosmos paper demonstrated results but left critical implementation details unspecified. Analysis in OPEN_QUESTIONS.md identified 6 gaps blocking reproduction:

Gap Problem Severity
0 Paper processes 1,500 papers and 42,000 lines of code per run, exceeding any LLM context window Foundational
1 State Manager described as "core advancement" but no schema, storage strategy, or update mechanisms provided Critical
2 Strategic reasoning algorithm for generating research tasks completely unstated Critical
3 System prompts, output formats, and domain expertise injection mechanisms not specified Critical
4 Paper contradicts itself on R vs Python usage; code execution environment not described High
5 Paper reports 57.9% interpretation accuracy but quality metrics and filtering criteria not specified Moderate

Gap Solutions

Each gap was addressed using patterns from the K-Dense ecosystem. Detailed analysis in OPENQUESTIONS_SOLUTION.md.

Gap 0: Context Compression (Complete)

Problem: 1,500 papers + 42,000 lines of code cannot fit in any LLM context window.

Solution: Hierarchical 3-tier compression achieving 20:1 ratio.

  • Tier 1: Task-level compression (42K lines -> 2-line summary + statistics)
  • Tier 2: Cycle-level compression (10 task summaries -> 1 cycle overview)
  • Tier 3: Final synthesis with lazy loading for full content retrieval

Pattern source: kosmos-claude-skills-mcp (progressive disclosure)

Implementation: kosmos/compression/

Gap 1: State Manager (Complete)

Problem: Paper's "core advancement" has no schema specification.

Solution: Hybrid 4-layer architecture.

  • Layer 1: JSON artifacts (human-readable, version-controllable)
  • Layer 2: Knowledge graph (structural queries via Neo4j, optional)
  • Layer 3: Vector store (semantic search, optional)
  • Layer 4: Citation tracking (evidence chains)

Implementation: kosmos/world_model/artifacts.py

Gap 2: Task Generation (Complete)

Problem: How does the system generate 10 strategic research tasks per cycle?

Solution: Plan Creator + Plan Reviewer orchestration pattern.

  • Plan Creator: Generates tasks with exploration/exploitation ratio (70% early cycles, 30% late cycles)
  • Plan Reviewer: 5-dimension scoring (specificity, relevance, novelty, coverage, feasibility)
  • Novelty Detector: Prevents redundant analyses across 200 rollouts
  • Delegation Manager: Routes tasks to appropriate agents

Pattern source: kosmos-karpathy (orchestration patterns)

Implementation: kosmos/orchestration/ (1,949 lines across 6 files)

Gap 3: Agent Integration (Complete)

Problem: How are domain-specific capabilities injected into agents?

Solution: Skill loader with 566 domain-specific scientific prompts auto-loaded by domain matching.

Pattern source: kosmos-claude-scientific-skills (566 skills)

Implementation: kosmos/agents/skill_loader.py

Skills submodule: kosmos-claude-scientific-skills/

Gap 4: Execution Environment (Complete)

Problem: Paper contradicts itself on R vs Python. No execution environment described.

Solution: Docker-based Jupyter sandbox with:

  • Container pooling for performance (pre-warmed containers)
  • Automatic package resolution and installation
  • Resource limits (memory, CPU, timeout)
  • Security constraints (network isolation, read-only rootfs, dropped capabilities)

This was the final gap implemented. The execution environment is now production-ready pending Docker availability.

Implementation: kosmos/execution/

Key files:

  • docker_manager.py - Container lifecycle management with pooling
  • jupyter_client.py - Kernel gateway integration for code execution
  • package_resolver.py - Automatic dependency detection and installation
  • production_executor.py - Unified execution interface

Gap 5: Discovery Validation (Complete)

Problem: How are discoveries evaluated before inclusion in reports?

Solution: ScholarEval 8-dimension quality framework with weighted scoring.

Dimensions evaluated:

  1. Statistical validity
  2. Reproducibility
  3. Novelty
  4. Significance
  5. Methodological soundness
  6. Evidence quality
  7. Claim calibration
  8. Citation support

Pattern source: kosmos-claude-scientific-writer (validation patterns)

Implementation: kosmos/validation/

K-Dense Pattern Sources

This implementation draws from the K-Dense ecosystem:

Repository Contribution Gap
kosmos-claude-skills-mcp Context compression, progressive disclosure 0
kosmos-karpathy Orchestration, plan creator/reviewer pattern 2
kosmos-claude-scientific-skills 566 domain-specific scientific prompts 3
kosmos-claude-scientific-writer ScholarEval validation framework 5

Reference repositories in kosmos-reference/. Skills integrated as git subtree at project root.

Project Status

Test Results (as of 2025-11-27)

Category Total Pass Fail Skip Notes
Unit tests 273+ 273+ 0 0 Core gap implementations
LiteLLM provider 20 20 0 0 Multi-provider support
E2E tests 39 32 0 7 Tested with Ollama

The 7 skipped E2E tests are due to:

  • Docker sandbox not configured (1 test)
  • Neo4j not configured (1 test)
  • Test setup complexity requiring refactoring (4 tests)
  • Database model minor issue (1 test)

What Works

  • Research workflow initialization and hypothesis generation
  • Experiment design from hypotheses via LLM
  • Result analysis and interpretation
  • Multi-provider LLM support (Anthropic, OpenAI, LiteLLM/Ollama)
  • Basic research cycle progression

What Does Not Work Yet

  1. Docker sandbox execution: Code execution currently runs without containerization. The sandbox implementation exists but is not integrated into tests.

  2. Knowledge graph: Neo4j integration requires external database setup. The code exists but is untested in E2E flows.

  3. Full autonomous loop: While individual components work, running 20 cycles with 10 tasks each (as described in the paper) has not been validated. The workflow tends to converge early or requires manual intervention.

  4. Cost tracking: No mechanism to track or limit API costs during long research runs.

  5. Literature search: The arxiv package has Python 3.11+ compatibility issues. Literature features are limited.

Honest Assessment

This implementation provides the architectural skeleton described in the Lu et al. paper. The 6 gaps identified in the paper have been filled with working code. However:

  • We have not reproduced the paper's claimed 79.4% accuracy or 7 validated discoveries
  • The system has been tested primarily with small local models (Qwen 4B via Ollama), not production-scale LLMs
  • Multi-cycle autonomous research runs have not been validated end-to-end
  • The codebase has accumulated technical debt from rapid development

The project is suitable for experimentation and further development, not production research use.

Next Steps

  1. Validate full research cycles with production LLMs
  2. Integrate Docker sandbox into E2E test suite
  3. Fix remaining test setup issues
  4. Add cost tracking and budget limits
  5. Document actual performance vs paper claims

Limitations

  1. Docker required: Gap 4 execution environment requires Docker. Without it, code execution uses mock implementations.

  2. Dependency compatibility: The arxiv package fails to build on Python 3.11+ due to sgmllib3k incompatibility. Literature search features are limited without this package.

  3. Python only: The paper references R packages (MendelianRandomization, susieR). This implementation is Python-only.

  4. LLM costs: Running 20 research cycles with 10 tasks each requires significant API usage. No cost optimization beyond caching.

  5. Single-user: No multi-tenancy or user isolation.

  6. Not a reproduction study: We have not reproduced the paper's 7 validated discoveries. This is an implementation of the architecture, not a validation of the results.

  7. Integration test maintenance: Some integration tests have API mismatches with current implementation.

Getting Started

Requirements

  • Python 3.11+
  • Anthropic API key or OpenAI API key
  • Docker (for sandboxed code execution)

Installation

git clone https://github.com/jimmc414/Kosmos.git
cd Kosmos
pip install -e .
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY or OPENAI_API_KEY

Verify Installation

# Run smoke tests
python scripts/smoke_test.py

# Run unit tests for gap modules
pytest tests/unit/compression/ tests/unit/orchestration/ \
       tests/unit/validation/ tests/unit/workflow/ \
       tests/unit/agents/test_skill_loader.py \
       tests/unit/world_model/test_artifacts.py -v

Run Research Workflow

import asyncio
from kosmos.workflow.research_loop import ResearchWorkflow

async def run():
    workflow = ResearchWorkflow(
        research_objective="Your research question here",
        artifacts_dir="./artifacts"
    )
    result = await workflow.run(num_cycles=5, tasks_per_cycle=10)
    report = await workflow.generate_report()
    print(report)

asyncio.run(run())

See GETTING_STARTED.md for detailed examples.

Configuration

All configuration via environment variables. See .env.example for full list.

LLM Provider

# Anthropic (default)
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...

# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-5-nano

# LiteLLM (supports 100+ providers including local models)
LLM_PROVIDER=litellm
LITELLM_MODEL=ollama/llama3.1:8b
LITELLM_API_BASE=http://localhost:11434
LITELLM_TIMEOUT=300

# DeepSeek via LiteLLM
LLM_PROVIDER=litellm
LITELLM_MODEL=deepseek/deepseek-chat
LITELLM_API_KEY=sk-...

Optional Services

# Neo4j (optional, for knowledge graph features)
NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=your-password

# Redis (optional, for distributed caching)
REDIS_URL=redis://localhost:6379

Architecture

kosmos/
├── compression/      # Gap 0: Context compression (20:1 ratio)
├── world_model/      # Gap 1: State manager (JSON artifacts + optional graph)
├── orchestration/    # Gap 2: Task generation (plan creator/reviewer)
├── agents/           # Gap 3: Agent integration (skill loader)
├── execution/        # Gap 4: Sandboxed execution (Docker + Jupyter)
├── validation/       # Gap 5: Discovery validation (ScholarEval)
├── workflow/         # Integration layer combining all components
├── core/             # LLM clients, configuration
├── literature/       # Literature search (arXiv, PubMed, Semantic Scholar)
├── knowledge/        # Vector store, embeddings
└── cli/              # Command-line interface

Documentation

Based On

Contributing

See CONTRIBUTING.md.

Areas where contributions would be useful:

  • Docker sandbox testing and hardening
  • Integration test updates
  • R language support via rpy2
  • Additional scientific domain skills
  • Performance benchmarking

License

MIT License - see LICENSE.


Version: 0.2.0-alpha Gap Implementation: 6/6 complete Test Coverage: 273 unit tests passing (core modules) Next Step: End-to-end integration testing Last Updated: 2025-11-25

About

Kosmos: An AI Scientist for Autonomous Discovery - An implementation and adaptation to be driven by Claude Code or API - Based on the Kosmos AI Paper - https://arxiv.org/abs/2511.02824

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 88.8%
  • TeX 8.9%
  • JavaScript 1.2%
  • Shell 0.9%
  • Makefile 0.1%
  • Dockerfile 0.1%