Skip to content

Sandboxing libstempo#81

Open
vhaasteren wants to merge 16 commits intovallis:masterfrom
vhaasteren:sandbox
Open

Sandboxing libstempo#81
vhaasteren wants to merge 16 commits intovallis:masterfrom
vhaasteren:sandbox

Conversation

@vhaasteren
Copy link
Collaborator

@vhaasteren vhaasteren commented Oct 10, 2025

Add Sandbox Mode for Crash-Protected libstempo Usage

Summary

This PR introduces a comprehensive sandbox mode for libstempo that provides crash isolation and automatic retry capabilities. The sandbox runs each tempopulsar instance in a separate subprocess, preventing tempo2 crashes from affecting the main Python kernel. Especially when running in scripts, processing many pulsars for (I)PTA purposes those random non-deterministic crashes that tempo2 tends to cause can be a pain, and the sandbox provides a frictionless workaround.

Notes

I have not tested this PR yet on a large number of devices, but it is fully operational. Test verifies against a native libstempo instance, and the github runners all finish. Demo notebooks work. Still, I am looking for bug reports! Only add_gwb is not compatible, because that function calls tempo2 natively.

Suitability for libstempo

I originally wrote this sandbox as part of another one of my projects (more IPTA-focused), which I will release soon. If the sandbox is deemed 'out of scope' for libstempo I'd be most happy to make it available through other means. To me the libstempo repo seems most sensible.

Key Features

🛡️ Crash Isolation

  • Segfaults in tempo2 only kill the worker process, not your main kernel
  • Automatic worker recycling prevents memory leaks and resource accumulation
  • Process isolation ensures stability for long-running analyses

🔄 Automatic Retry & Recovery

  • Built-in retry logic for transient failures
  • Configurable retry policies (constructor retries, call timeouts)
  • Automatic worker recycling based on age, call count, or memory usage

🌍 Environment Flexibility

  • Support for conda environments, virtual environments, and Rosetta (macOS)
  • Explicit Python path specification

🚀 Proactive TOA Handling

  • Automatically handles large TOA files to prevent "Too many TOAs" errors
  • Bulk loading capabilities for processing many pulsars

Usage

Basic Usage (Drop-in Replacement)

from libstempo.sandbox import tempopulsar

# Same API as regular tempopulsar
psr = tempopulsar(parfile="J1713.par", timfile="J1713.tim", dofit=False)
residuals = psr.residuals()
design_matrix = psr.designmatrix()

# Can just pass sandbox tempopulsar to native toasim as usual
import libstempo as lt
lt.make_ideal(psr)
lt.add_efac(psr, efac=1.0, seed=1234)

Advanced Configuration

from libstempo.sandbox import tempopulsar, Policy, configure_logging

# Configure logging and retry policies
configure_logging(level="DEBUG", log_file="tempo2.log")
policy = Policy(
    ctor_retry=5,           # Retry constructor 5 times on failure
    call_timeout_s=300.0,    # 5-minute timeout per RPC call
    max_calls_per_worker=1000,  # Recycle worker after 1000 calls
    max_age_s=3600,          # Recycle worker after 1 hour
    rss_soft_limit_mb=2048   # Recycle worker if memory exceeds 2GB
)

psr = tempopulsar(parfile="J1713.par", timfile="J1713.tim", policy=policy)

Bulk Processing

from libstempo.sandbox import load_many, Policy

pairs = [("J1713.par", "J1713.tim"), ("J1909.par", "J1909.tim"), ...]
policy = Policy(ctor_retry=3, call_timeout_s=120.0)

ok_by_name, retried_by_name, failed_list = load_many(pairs, policy=policy, parallel=8)

Performance Characteristics

Based on comprehensive performance testing with J1909-3744_NANOGrav_dfg+12 data:

  • Initialization: ~9x overhead (amortized over long-running applications)
  • Computational operations: ~1.2x overhead for residuals(), ~1.0x for designmatrix()
  • Attribute access: Higher overhead due to RPC, but typically not a bottleneck

The overhead is primarily due to inter-process communication, which is the price of process isolation. For heavy computations, the overhead becomes negligible relative to the actual work.

Implementation Details

Architecture

  • JSON-RPC over stdio: Robust communication protocol between main process and workers
  • Worker process management: Automatic lifecycle management with recycling policies
  • Data serialization: Efficient NumPy array copying to prevent memory sharing issues
  • Error handling: Exception types (Tempo2Error, Tempo2Crashed, Tempo2Timeout)

New Files Added

  • libstempo/sandbox.py (1,232 lines): Main sandbox implementation
  • libstempo/tim_file_analyzer.py (548 lines): TOA file analysis utilities
  • tests/test_sandbox.py (94 lines): Comprehensive test suite
  • Updated README.md with sandbox documentation

Testing

  • ✅ All existing tests pass
  • ✅ Comprehensive sandbox-specific tests added
  • ✅ Performance analysis completed
  • ✅ Cross-platform compatibility (somewhat) verified
  • ❌ GWB simulation with add_gwb is not supported because it calls tempo2 natively
  • ❌ I need bug reports!

When to Use

Use Sandbox when:

  • Stability is critical (production environments)
  • Working with potentially unstable tempo2/libstempo versions
  • Need crash protection for long-running processes
  • Interactive environments (Jupyter notebooks)
  • Processing many pulsars in batch

Use Direct when:

  • Performance is critical
  • Development/testing environments
  • Stable, well-tested code

Backward Compatibility

This PR is fully backward compatible. The sandbox is opt-in and doesn't affect existing code. All existing libstempo functionality remains unchanged.

Documentation

  • Updated README.md with comprehensive sandbox documentation
  • Inline docstrings with usage examples

Some irony

The github CI occasionally fails on the regular tests because of random segmentation faults.

- Break long lines in sandbox.py to fit 120-char limit
- Add noqa comments for imports in __init__.py
- Format tim_file_analyzer.py with black
- Protocol
  - Add hello proto_version=1.2 and capabilities: get_kind, dir, setitem, get_slice, path_access
  - Non-exceptional attribute discovery (get-kind) and optional dir RPC

- Array semantics
  - Introduce write-through ArrayProxy for numpy-backed attrs (stoas, toaerrs, freqs)
  - Reads expose plain numpy via __array__; __repr__/__str__/__getattr__ delegate to ndarray
  - Writes route via setitem RPC; add get_slice RPC to avoid fetching whole arrays for reads
  - Guard __len__ for 0-d; support fancy/masked indexing; optional safe dtype cast on set

- Dotted paths
  - Gate first-hop mapping access to mapping-like (__getitem__) objects only
  - Support psr['PAR'].val/err/fit/set via dotted-path resolution

- Process lifecycle & IO
  - Popen: pass env, close_fds, start_new_session, Windows CREATE_NEW_PROCESS_GROUP when available
  - Group kill with POSIX killpg; Windows terminate/kill fallbacks
  - Thread-safe RPC framing with a per-worker send lock

- Errors & logging
  - Stderr ring with optional tail included in exceptions; cap tail by bytes (16KiB) and lines

- Tests
  - Add unit tests comparing sandbox vs native for parameter mapping and TOA edits+fit
  - Full suite green: array writes now update worker; residuals match native after fit
@vhaasteren vhaasteren marked this pull request as ready for review October 12, 2025 19:33
@vhaasteren vhaasteren requested a review from mattpitkin October 13, 2025 12:32
@vhaasteren vhaasteren self-assigned this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant