Skip to content

fix: guard pufferlib test files against missing pufferlib in CI#75

Open
lee101 wants to merge 1059 commits intomainfrom
ci-fix/stock-prediction-fast-unit-tests-v3
Open

fix: guard pufferlib test files against missing pufferlib in CI#75
lee101 wants to merge 1059 commits intomainfrom
ci-fix/stock-prediction-fast-unit-tests-v3

Conversation

@lee101
Copy link
Owner

@lee101 lee101 commented Mar 23, 2026

Summary

  • tests/test_evaluate_ttt.py: Added try/except (ImportError, ModuleNotFoundError) guard around module-level imports from pufferlib_market.lora and pufferlib_market.train, with pytest.skip(..., allow_module_level=True) when unavailable
  • tests/test_pufferlib_market_train_arch.py: Same pattern applied to imports from pufferlib_market.train
  • tests/test_root_evaluate_ttt.py: Same pattern applied to import of TradingPolicy, ResidualTradingPolicy

Root Cause

The Fast CI (fast-unit-tests) job installs requirements-ci.txt which excludes pufferlib (commented out due to numpy version conflicts). The newly added test files imported pufferlib_market.train at module level. In the CI's code version, pufferlib_market/train.py imported pufferlib at module level (line 41), causing ModuleNotFoundError: No module named 'pufferlib' during pytest collection — before any marker-based deselection could happen — and aborting the entire test run with exit code 2.

Exact CI failure:

ERROR collecting tests/test_evaluate_ttt.py
    from pufferlib_market.lora import LoRALinear, LoRAPolicy, reset_adam_state
    pufferlib_market/train.py:41: in <module>
        import pufferlib
E   ModuleNotFoundError: No module named 'pufferlib'

ERROR collecting tests/test_pufferlib_market_train_arch.py
    from pufferlib_market.train import (
    pufferlib_market/train.py:41: in <module>
        import pufferlib
E   ModuleNotFoundError: No module named 'pufferlib'

Fix

Follows the existing pattern in tests/test_pufferlib_market_train_resume.py — wrapping pufferlib_market imports in try/except with pytest.skip(..., allow_module_level=True). This makes pytest skip the entire module gracefully when pufferlib is unavailable, rather than failing collection.

Test plan

  • Verified pytest --collect-only succeeds without errors for the three fixed files locally
  • The tests skip gracefully when pufferlib is not installed (as in CI)
  • When pufferlib_market is importable, the tests collect and run normally

🤖 Generated with Claude Code

lee101 and others added 30 commits March 20, 2026 10:22
- Remove --gemini flag from worksteal-daily launch (run pure rule-based)
- Add _chronos2_fallback_signal() to hybrid-spot for when LLM fails
- Add rate-limit detection with exponential backoff in providers.py
- Early-branch in get_hybrid_signal() to skip unnecessary work in fallback path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add unified_orchestrator/service_config.json assigning exclusive symbol
  ownership across the three live services: unified-orchestrator gets
  SOLUSD/LTCUSD/AVAXUSD (crypto) and NVDA/PLTR/META/MSFT/NET (stocks);
  alpaca-hourly-trader exclusively owns BTCUSD/ETHUSD; trade-unified-hourly-meta
  owns the remaining stock set (GOOG, DBX, TRIP, MTCH, etc.).
- Add unified_orchestrator/symbol_lock.py with load_service_symbols(),
  find_symbol_overlaps(), assert_no_overlaps(), and warn_position_conflicts()
  helpers so any service can validate ownership at startup.
- Update orchestrator.py to load CRYPTO_SYMBOLS and STOCK_SYMBOLS from
  service_config.json at import time (hardcoded defaults as fallback); removes
  duplicate STOCK_SYMBOLS constant; adds startup conflict check that warns on
  any open position belonging to another service.
- Add tests/test_symbol_conflict.py: 21 tests covering config loading, symbol
  set disjointness, production config assertions, and conflict detection.

Status: alpaca-hourly-trader is inactive (dead); unified-orchestrator and
trade-unified-hourly-meta are both running. PLTR conflict already visible in
daily-rl-trader logs — all five overlapping stock symbols (NVDA/PLTR/META/MSFT/NET)
are now exclusively assigned to unified-orchestrator.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SMA-24 RL hint filter (pre-LLM): was zeroing RL LONG hints for ANY
  price < SMA-24. Now: <5% discount reduces allocation 50%, >5% hard
  suppresses. This was blocking all trades during March 2026 crypto dip
  where prices were typically 0.1-1% below a lagging 24h SMA.

- SMA-24 post-LLM hard block: was force-HOLD for ANY price < SMA-24.
  Now tiered: <2% passes through unchanged, 2-5% reduces allocation 50%,
  >5% hard blocks. Production logs showed 0 trades with BTC 0.04% below
  SMA, ETH 0.51%, AVAX 0.83% — all now pass through.

- MIN_CONFIDENCE_CRYPTO: lowered from 0.4 to 0.25. 0.4 too restrictive
  in sideways markets where RL models express lower certainty.

- Add [CRYPTO_SIGNAL] structured log prefix throughout signal generation
  path: RL hint inject, LLM output, filter fires, execution.

- Add tests/test_sma_filter.py: 20 unit tests covering all thresholds,
  multi-symbol independence, and regression cases from production logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ve dip, risk-off threshold

The SMA-20 filter required close >= SMA while dip proximity required close near
buy_target (20% below ref_high). These were mutually exclusive, producing 0 trades.

Changes:
- sma_check_method="pre_dip" (default): check if any of last 5 closes were >= SMA
- proximity_pct default 0.005 -> 0.03 (3% window vs 0.5%)
- adaptive_dip option: ATR-based dip threshold (2.5*ATR/ref_high, clamped [5%,dip_pct])
- risk_off_momentum_threshold=-0.05 (was hardcoded 0.0, fired on any negative momentum)
- Extract passes_sma_filter() and compute_buy_target() shared helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…try logic

Key changes to unified_orchestrator/rl_gemini_bridge.py:
- Add signal source constants (gemini_rl, rl_only, fallback_hold) logged at INFO
  every cycle so Gemini reliability is visible in production logs
- Add _rl_only_plan() helper: derives limit prices from RL direction + price,
  discounts confidence by 20%, tags reasoning with [rl_only]
- Add _hold_plan() helper: returns neutral HOLD tagged with [fallback_hold]
- Add _tag_plan_source() helper: prepends [source] to plan.reasoning
- _load_checkpoint_payload(): check checkpoint existence first and raise
  FileNotFoundError with clear message (not an opaque torch error)
- _load_policy(): log ERROR (not just raise) on obs_size/num_actions mismatch
  with actionable diagnostic messages
- Replace bare except+silent-fallback in generate_plans with
  _call_llm_with_fallback() which:
  * On 429 rate-limit: sleeps _RATE_LIMIT_BACKOFF_S seconds, retries once
    with simplified prompt (no history, no portfolio context)
  * On any other error (attempt 0): retries with simplified prompt
  * After 2 failures: returns RL-only plan (never None)
  * On success: tags plan with [gemini_rl] source
- generate_plans() in dry_run mode now uses _rl_only_plan() consistently
- Import loguru logger (_logger) for structured logging throughout

tests/test_rl_gemini_bridge.py:
- Import new symbols: _rl_only_plan, _hold_plan, _tag_plan_source, constants
- 20 new tests covering: plan helpers, checkpoint not found, dimension mismatch
  with clear messages, Gemini failure → RL-only (not None), rate-limit backoff
  and retry, both-attempts-fail fallback, success tagging, dry_run, flat/missing
  signal handling

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to annualization to 365

- src/metrics_utils.py: Change annualized_sortino to use C env formula exactly:
  downside_dev = sqrt(sum_neg_sq / total_n)  [sum of neg squares / ALL step count]
  Previously used std(neg, ddof=1) which differs from C in both divisor and mean-centering.
  Also change default periods_per_year from 252 to 365 (crypto trades 24/7, 365 days).
  Add module docstring documenting the canonical Sortino definition and why it matches C.

- marketsimulator/run_trade_loop.py: Fix periods_per_year=252 to 365 in _compute_step_metrics.

- tests/test_metrics_utils.py: Update assertion to use C env formula instead of ddof=1.

- tests/test_metrics.py: New integration test file with 13 tests covering:
  * Python/C Sortino consistency within 1% (daily and hourly)
  * Formula uses C-style partial RMS, not ddof=1
  * Annualisation uses 365 for daily crypto (not 252)
  * Annualisation uses 8760 for hourly crypto
  * Default periods_per_year is 365 for both sharpe and sortino
  * Edge cases: empty, single return, all-positive, NaN/inf inputs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e, thread safety

- orchestrator.py: keep existing buy orders within 0.1% of new target price to
  eliminate the cancel→no-order→re-place timing gap; sleep 1s after real
  cancellations before account refresh so cash balance reads settled buying power;
  remove redundant `import time as _cancel_sleep` aliases (top-level `time` used);
  simplify `{sym for sym in signals}` → `set(signals)`
- position_tracker.py: reset stored peak to current price when peak < current
  (asset rallied during a restart gap) so trailing stop doesn't misfired on
  restart; eliminate triple `peaks.get(sym)` lookup by capturing `stored_peak`
  once; fix docstring typo and clarify restart-safety explanation
- alpaca_watcher.py: log (not swallow) API errors in `_check_position_exists`
- conditional_orders.py: log (not swallow) malformed lines in `read_pending_fills`
- state.py: log (not swallow) Binance open-order fetch errors
- tests/test_orchestrator_order_mgmt.py: 24 unit tests covering all four bug areas

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge modeling

Add production-realistic execution constraints to simulate_daily_policy() in
hourly_replay.py and wire them through fast_marketsim_eval.py:

- trailing_stop_pct (default 0.0, production uses 0.003): force-exit long
  positions when price drops 0.3% below peak since entry; tracks pos_peak_price
  per open position, applied after market-move at each bar
- max_hold_bars (default 0, production uses 6): force-exit after 6 bars held,
  matching MAX_HOLD_HOURS=6 in the production orchestrator
- slippage_bps (default 0.0, production uses 3): adverse fill slippage in bps
  applied on top of fee_rate (buy fills at price*(1+slip), sell at price*(1-slip))
  as a separate parameter from fee_rate for clarity
- min_notional_usd (default 0.0, production uses 12.0): skip opening positions
  below the $12 Alpaca minimum notional

fast_marketsim_eval.py constants updated:
  FEE_RATE = 0.0           (was 0.001=10bps; Alpaca crypto has 0 commission)
  SLIPPAGE_BPS = 3.0       (new; ~3bps market-impact per fill)
  TRAILING_STOP_PCT = 0.003 (new; matches production)
  MAX_HOLD_BARS = 6        (new; matches production)
  MIN_NOTIONAL_USD = 12.0  (new; matches Alpaca minimum)

Measured effect on slip_5bps checkpoint (90d crypto5 val):
  trailing_stop alone:  7.72% → 3.07% (-4.65pp) — fires in whipsaw markets
  max_hold_bars=6:      0pp effect (avg hold is already ~1.6 bars naturally)
  slippage_bps=3 vs old fee 10bps: +1.47pp (less punishing than 10bps fee)
  combined production-realistic: 8.04% (+0.32pp vs old unconstrained)

Added tests/test_sim_fidelity.py with 18 tests covering:
  - trailing stop fires after 0.3% drawdown from peak
  - trailing stop does not fire below threshold
  - peak price tracks correctly through rising prices
  - max hold exits after N bars, disabled when 0, fires every bar when 1
  - slippage reduces return, adds linearly to fee_rate
  - fee/slippage arithmetic: fee=0.001+slip=10bps == fee=0.002
  - min notional blocks tiny accounts, allows sufficient cash
  - production defaults give more trades than unconstrained
  - backward compat: old callers work without new params

All 18 new tests pass. No regressions in related test suites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix worksteal entry logic: resolve SMA/dip contradiction
…ine, C sim parity, sweep, Gemini overlay

- Fix SMA-dip contradiction: pre_dip SMA check, wider proximity (3%), configurable risk-off threshold
- Add build_entry_candidates/resolve_entry_config/compute_market_breadth_skip standalone functions
- Fix _risk_off_triggered to use risk_off_trigger_momentum_period and risk_off_trigger_sma_period
- Live bot: entry_proximity_bps raised to 3000, diagnostic logging, 4h polling, health reports, JSONL events
- C simulator: ATR computation, market breadth, risk-off triggers, per-symbol FDUSD fees, tighter parity
- Neural pipeline: per-symbol transformer encoder, multi-step rollout, position state tracking, cosine LR
- Sweep: multi-window evaluation, production-realistic mode, new dimensions
- Gemini overlay: backtest mode, exponential backoff, forecast staleness checks
- Sim-production parity audit script
- 125 tests pass (strategy, leverage, entry filters, trade live, neural, sweep, gemini)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…kdown

Runs sim-vs-live parity audit: entry filter breakdown, multi-window backtest,
parameter sensitivity grid (dip x proximity x sma), random baseline comparison.

Key finding: proximity_pct=0.02 is too tight (blocks 71% of evals). Best
reliable config is dip=20% prox=3% sma=20 (Sort=19.24, +8.39%, -0.25% DD).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, C sim optimization

- Vectorize multi-step training rollout + AMP + gradient checkpointing (3-5x speedup)
- Wire neural model inference into live bot (--neural-model flag, fallback to rules)
- Expand symbol universe: 60+ symbols with data validation and download script
- Neural training sweep infrastructure (run_neural_sweep.py)
- Production code review: edge case fixes (NaN guards, atomic writes, empty bars)
- C simulator OpenMP parallelization (14x batch speedup)
- Fix compute_metrics: add n_orders, n_trades, win_rate fields
- Fix build_dataloader: accept num_workers, pin_memory params
- 179 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…handler

Downloads training data and optional resume checkpoint from R2 before
training starts. Registers an atexit handler that uploads top-5 checkpoints
(read from .topk_manifest.json) and log files back to R2 when training ends.
Uses src.r2_client.R2Client when available, with an inline boto3 fallback
for pods where src/ is not installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- train.py: add --wandb-run-name and --wandb-group argparse args; pass
  name and group through to wandb.init() alongside existing project/entity/mode
- autoresearch_rl.py: add --wandb-project arg; derive a per-run wandb_group
  (autoresearch_<timestamp>) so all trials from one autoresearch run are
  grouped together on the W&B dashboard; pass --wandb-project,
  --wandb-run-name (trial description), and --wandb-group into each
  training subprocess; add wandb_project/wandb_group keyword params to
  run_trial()

All wandb calls remain gated (wandb=None on ImportError, wandb_run=None
when project not set) so training works without wandb installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add MIT LICENSE
- Add CONTRIBUTING.md with dev setup, uv/venv instructions, SSH security note
- Fix .python-version from 3.12 to 3.13
- Add .github/workflows/ci-ubuntu.yml for GitHub-hosted runner CI
- Add Makefile with setup, setup-cutedsl, build-c, test targets
- Add src/runpod_client.py with RTX 5090 pricing, resolve_gpu_type(), get_supported_gpu_types()
- Add pufferlib_market/cute_chronos_features.py with robust _find_cutedsl() path resolution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unit 4 — Multi-seed dispatch:
- scripts/dispatch_rl_training.py: --num-seeds/--seeds flags, variance report,
  cost estimation with setup overhead, budget limit enforcement

Unit 6 — WandB metrics reader:
- scripts/wandb_metrics_reader.py: LLM-friendly markdown output, --project,
  --run-id, --last-n-runs, --group flags; graceful fallback without API key
- tests/test_wandb_metrics_reader.py: 47 tests with mocked wandb API

Unit 7 — S&P500 data pipeline:
- scripts/download_sp500_data.py: Wikipedia constituent fetch, yfinance
  download, batched with rate-limit delays, --dry-run/--limit support
- scripts/export_sp500_daily.py: MKTD v2 binary export, batched by 50 symbols
- src/alpaca_stock_expansion.py: add get_sp500_symbols() with cache
- pyproject.toml: sp500 optional deps (yfinance, pandas, lxml), cutedsl note
- tests/test_sp500_pipeline.py: 39 tests with mocked yfinance and Wikipedia

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pufferlib_market/gpu_pool_rl.py: --budget-limit flag blocks provisioning
  when estimated cost exceeds limit; check_running_cost warns on over-budget
  busy pods; _print_cost_estimate helper shows cost breakdown on startup
- tests/test_gpu_pool_rl.py: 40 tests covering pool state, cost gating,
  pod lifecycle, and budget enforcement
- tests/test_dispatch_multiseed.py: 21 tests for multi-seed dispatch and
  variance reporting

Cost equation:
  est_cost = rate * ((1800 + num_seeds * time_budget_secs) / 3600)
Default --budget-limit 10.0 (allows ~8 sweeps on RTX 5090 before blocking)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lee101 and others added 27 commits March 22, 2026 17:22
- Replace --stocks12 with --stocks in build_autoresearch_cmd
- Add start_from param to build_autoresearch_cmd/build_remote_autoresearch_plan
- launch_stocks_autoresearch_remote.py: --start-from default=96 (skip lr=3e-4 configs
  that collapse on stocks11_2012 data; saves ~96 wasted H100 trials)
- Confirmed dry-run produces correct command with --start-from 96

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lr=3e-4/5e-4 collapse to hold-cash on stocks11_2012 data (confirmed).
anneal_lr=False also degrades performance significantly.

Prevents ~50% of H100 random mutations being wasted on known-bad configs.
stocks_mode=True now uses lr∈[1e-4, 2e-4] and anneal_lr=True only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…anneal_ent)

Key untested axes for s1137 seed (stocks11_2012 best seed at robust=-21.4):
- s1137_lr2e4: lr=2e-4 between confirmed good (1e-4) and bad (3e-4)
- s1137_h2048: larger model with best seed (h2048 failed with s777/s42)
- s1137_obs_norm: observation normalization for daily data stability
- s1137_gamma995: longer planning horizon for multi-day trends
- s1137_anneal_ent: entropy decay for better exploitation
- s5678_lr2e4: lr=2e-4 test for 2nd-best seed

New configs at pool indices 161-166, before random mutations (167+).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
smooth_downside_penalty is the key differentiator in random_mut_2272 (stocks12
champion, 0% negative windows). Adding 5 variants for s1137/s5678:
- s1137_sdp01_t001: exact random_mut_2272 formula on best seed
- s1137_sdp005_t002: lighter version
- temperature sweep variants
- s5678_sdp01_t001: 2nd best seed

Key insight: -50 × negative_return_rate dominates robust_score. s1137 has 20%
negative rate = -10 points. smooth_downside_penalty should force more conservative
strategy → fewer negative holdout windows.

Pool: 172 named configs, random mutations start at index 172.
Phase 2 launcher auto-starts L+M block sweep when stocks10 finishes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses --descriptions to run 4 targeted trials instead of sequential scan:
- lr1e4_anneal_s1137 (cross-features only)
- lr1e4_anneal_s5678 (cross-features only)
- s1137_sdp01_t001 (cross-features + smooth_downside_penalty)
- s5678_sdp01_t001 (cross-features + smooth_downside_penalty)

Avoids wasting 4/6 trials on known-bad seeds (s2718/s31415/s1234/s314).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests if transformer arch + cross-features gives better cross-symbol signal capture.
TransformerTradingPolicy correctly receives features_per_sym=20 from file header.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All named configs 96-166 now tested locally before H100 launch.
H100 focuses on:
1. M-block (167-171): smooth_downside_penalty — KEY for reducing negative windows
2. Random mutations (172+): 495 unique seed explorations

500 × ~93s = ~13 hours on H100.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lr=3e-4 collapse was seed-specific (only seed=1137 collapsed).
random_mut_4424 (lr=3e-4, h=256) achieved robust=-4.02 with 0%
negative windows on stocks12 data, beating all lr=1e-4 configs.
Need to explore lr=3e-4 with diverse seeds on stocks11_2012 data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Evidence from 50+ stocks11_2012 trials: lr=2e-4 and lr=3e-4 ALL hit
the -64.87 degenerate minimum. Only lr=1e-4 escapes it. Added:
1. stocks_mode: force lr=1e-4 unconditionally (even if not mutated)
2. Remove h=2048 from mutation pool (s1137_h2048 = -64.87, worse)
3. seed_only=True mode: only mutates seed, all other hyperparams stay
   at proven defaults (h=1024, ent=0.05, no penalties, anneal=True)
4. --seed-only CLI flag for pure seed sweeps

This enables the correct H100 strategy: sweep ~200 seeds at s1137's
exact config and find all seeds that achieve robust > -21.38.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_infer_arch was checking input_proj.weight first (returning 'resmlp')
but GRUTradingPolicy also has input_proj.weight. Fixed order:
GRU (gru. keys) → transformer (attn./symbol_proj. keys) → resmlp → mlp.

Also fixed _infer_hidden_size for transformer (mlp.0.weight) and
GRU (input_proj.weight) architectures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add 15 sdp=0.2,sdt=0.01 configs to STOCK_EXPERIMENTS (indices 172-186):
  * 10 varied seeds (sdp02_s1464 through sdp02_s2024)
  * 5 known-bad seeds (7860,4533,4438,6828,5678) to test rescue effect
- Add launch_cross_feature_sweep.sh: waits for L-block (PID 543189) then
  launches 8-trial cross-feature sweep on stocks11_daily_{train,val}_2012_cross.bin

Findings from latest experiments:
- s5678+sdp=0.1,sdt=0.01: robust=-37.81 (vs baseline -55.49) -- sdp rescues s5678
- s1137+sdp=0.2,sdt=0.01: robust=-24.15 (vs baseline -21.38) -- minimal hurt
- Seed sweep escape rate (no sdp): 17% (1/6 >-50), 33% (2/6 >-65)
- H100 strategy: 500 seed-only OR sdp=0.2 trials; decision after SDP02 results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
H100 strategy based on local experiments:
- sdp=0.2 HURTS good seeds: seed=1464 went -37.25 → -56.18 with sdp=0.2
- Pure seed-only sweep (s1137 exact config + 500 seeds) is optimal
- Default: --start-from 187 --seed-only --max-trials 500

Launcher changes:
- Default start-from: 172 → 187 (skip sdp02 block to avoid sdp contamination)
- --seed-only: default True (only seed changes in random mutations)
- --cross-features: use stocks11_daily_{train,val}_2012_cross.bin (20 feat/sym)
- Rsync includes holdout data (which equals val data for stocks run)

Remote pipeline: add seed_only param to build_autoresearch_cmd and
build_remote_autoresearch_plan, passes --seed-only to autoresearch_rl

Evidence: 17% seed escape rate (no sdp) → 500 H100 trials → ~85 good models
Best local result: seed=1464 at -37.25, s1137 at -21.38 (deployment target: <-10)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add 8 N-block configs (indices 187-194) to STOCK_EXPERIMENTS testing the
  stocks12-champion formula (h=256, lr=3e-4, slip=12bps, dp=0.01) on stocks11_2012.
  random_mut_4424 achieved robust=-4.02 with 0% negative windows on stocks12.
  Testing whether this transfers to stocks11_2012 (4840 days).
- Update H100 --start-from default: 187→195 (random_1 now at 195 after N-block)
- Update --seed-only help text to reference new index

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…m index 187

- Add 8 O-block configs (indices 187-194): PPO infrastructure variants for s1137
  (rollout_len=128/512, num_envs=256, mb=4096, gamma=0.98, ppo_epochs=2/8, cosine lr)
- N-block h256 configs shifted to indices 195-202 (after O-block)
- Random mutations now start at index 203
- H100 default: --max-trials 1000 (8% escape rate → 80 good models, ~9h)
- H100 default: --start-from 187 (O+N-block first, then 992 seed-only variations)

Key findings:
- h256 + lr=3e-4: robust=-121 (catastrophic overfit, 95% neg windows) — REJECTED
- cross-features (20 feat): robust=-61 (worse than standard 16-feat) — REJECTED
- s1137 seed escape rate: 8% (1/12 fully escape, 83% degenerate at -64.87)
- H100 strategy: 1000 trials of seed_only from s1137 base = ~80 escaped models

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- mutate_config: h=256 catastrophically bad on stocks11_2012 (-121 to -146 robust).
  Now only [512, 1024] in stocks_mode hidden_size choices.
- Add s1137_h512 (index 195): h=512, lr=1e-4, seed=1137 — only untested viable size.
- h=256 findings: lr=3e-4 → -121 (95% neg), lr=1e-4 → -146 (100% neg). Both rejected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds pufferlib_market/early_stopper.py with:
- combined_score(): equal-weight sortino + return combined metric
- PolynomialEarlyStopper: fits degree-1/2 polynomial to mid-training
  val snapshots, projects to progress=1.0, prunes if < best_known*tol
- BestKnownTracker: persists per-track best scores to JSON with atomic
  writes (supports stocks_daily, binance_crypto, mixed, hourly_crypto)

31 pytest tests cover all edge cases including degenerate polyfit inputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rch_rl

Add early_stopper.py with PolynomialEarlyStopper, BestKnownTracker, and
combined_score. Integrate into autoresearch_rl.py: mid-training val evals
at 25/50/75% budget, polynomial curve-fit projection, prune trials unlikely
to reach the best known combined (sortino+return) score. Add --local
(time-budget=60) and --a40 (a40-mode + time-budget=180) convenience presets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rch_rl

- _quick_val_eval now returns (val_return, val_sortino, val_wr) tuple
- Three checkpoint-eval blocks unified into data-driven _check_schedule loop
- PolynomialEarlyStopper: collect at 25%, prune at 50% (tol=0.70) and 75% (tol=0.80)
- BestKnownTracker: persist per-track best combined scores to JSON
- Combined metric: 0.5*sortino + 0.5*return (more stable than return alone)
- --poly-prune/--no-poly-prune, --local (60s budget), --a40 (180s + a40_mode)
- poly_projected_final added to result_payload for leaderboard tracking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- STOCK_EXPERIMENTS random pool: range(1,451) → range(1,1001) so --max-trials 1000
  actually runs 1000 trials (was silently stopping at ~467 due to pool exhaustion)
- build_autoresearch_cmd: add poly_prune param + pass --no-poly-prune for stocks mode
  (polynomial combined_score gap between degenerate/escaped seeds is only 4%, making
  50%/75% poly checks useless; restores 25% val_return fixed-threshold early rejection
  which cuts degenerate H100 trial time from ~60s to ~30s)
- build_remote_autoresearch_plan: auto-sets poly_prune=not stocks_mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g contamination

O-block trials (187-203) are all confirmed degenerate locally, but if any score
above -inf they become best_config. With --seed-only, all 1000 subsequent seed
mutations inherit that config (e.g., ppo_epochs=2 from s1137_ppo_epochs2 at -49.64).
Starting at 204 ensures best_config stays as clean TrialConfig() + stocks_mode lr=1e-4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove hardcoded GEMINI_API_KEY from supervisor.conf (untracked)
- Source API key from .env.binance-hybrid file instead (gitignored)
- Reduce --leverage from 5 to 0.5 to cut risk exposure 10x
- Fast-fail with clear ValueError if GEMINI_API_KEY is missing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roken status

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on errors

tests/test_evaluate_ttt.py, tests/test_pufferlib_market_train_arch.py, and
tests/test_root_evaluate_ttt.py import from pufferlib_market.train at module
level. If pufferlib is not installed (as in CI), this caused a
ModuleNotFoundError during pytest collection, failing the entire test run
before any tests could execute.

Adds try/except (ImportError, ModuleNotFoundError) guards matching the pattern
already used in tests/test_pufferlib_market_train_resume.py, causing these
modules to be skipped gracefully when pufferlib/pufferlib_market is unavailable.

Fixes: Fast CI (GitHub Runners) / fast-unit-tests failure where pytest collection
errored on test_evaluate_ttt.py and test_pufferlib_market_train_arch.py with
ModuleNotFoundError: No module named 'pufferlib'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codex-infinite
Copy link

Codex Infinity
Hi! I'm Codex Infinity, your coding agent for this repo.

Start a task on this PR's branch by commenting:

Tasks and logs: https://codex-infinity.com

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74a1895fb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +28 to +29
except (ImportError, ModuleNotFoundError):
pytest.skip("pufferlib_market or pufferlib not installed", allow_module_level=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep the policy/LoRA unit suites running without PufferLib

pufferlib_market.train is explicitly written to import in CPU-only/unit-test environments without pufferlib (see the deferred-import comment in pufferlib_market/train.py), so this skip now disables the entire architecture/LoRA/TTT coverage in exactly the lean CI setup it was meant to help. In an environment that lacks pufferlib but still has the normal test deps, regressions in relu_sq, TransformerTradingPolicy, GRUTradingPolicy, DepthRecurrenceTradingPolicy, and the LoRA wrappers will all go untested because this module, plus the same pattern in tests/test_evaluate_ttt.py, gets skipped before any assertions run.

Useful? React with 👍 / 👎.

Comment on lines +47 to +50
try:
from pufferlib_market.train import TradingPolicy, ResidualTradingPolicy
except (ImportError, ModuleNotFoundError):
pytest.skip("pufferlib_market or pufferlib not installed", allow_module_level=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move the root evaluate_ttt guard before exec_module

This guard is too late to protect the root-level TTT test from the missing-dependency case it is trying to handle. tests/test_root_evaluate_ttt.py executes evaluate_ttt.py at lines 31-38, and that module immediately imports pufferlib_market.train/evaluate_fast; any ModuleNotFoundError raised there will abort collection before this pytest.skip(...) runs. In other words, this file still hard-fails instead of skipping if the top-level evaluate_ttt.py import chain is what needs guarding.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant