A production-grade statistical arbitrage (stat-arb) trading system that identifies market mispricings through quantitative factor analysis, portfolio optimization, and systematic execution. The system processes historical market data, generates alpha signals from multiple strategies, optimizes portfolio positions considering transaction costs and risk, and backtests trading strategies through multiple simulation engines.
This system implements a complete workflow for statistical arbitrage trading:
- Data Loading & Preprocessing: Loads and processes market data from multiple sources
- Alpha Generation: Calculates predictive signals from 20+ trading strategies
- Factor Analysis: Decomposes returns using PCA and Barra risk models
- Portfolio Optimization: Maximizes risk-adjusted returns with realistic constraints
- Backtesting: Simulates execution across multiple engines with transaction cost modeling
The system is designed for daily rebalancing across ~1,400 US equities with sophisticated risk management and execution cost modeling.
- Features
- Architecture
- Installation
- Quick Start
- Data Requirements
- Usage
- Strategies
- Simulation Engines
- Configuration
- Project Structure
- Salamander Module
- Performance Metrics
- Multi-Source Data Integration: Daily/intraday prices, Barra factors, analyst estimates, short locates
- 20+ Alpha Strategies: PCA decomposition, analyst signals, momentum, mean reversion, order flow
- Advanced Optimization: NLP solver with factor risk, transaction costs, and participation constraints
- Multiple Simulation Engines: Daily (BSIM), order-level (OSIM), intraday (QSIM), full system (SSIM)
- Risk Management: Factor exposure limits, position sizing, sector neutrality
- Realistic Execution Modeling: Market impact, slippage, borrow costs, VWAP vs. close fills
- HDF5 Caching: Fast data loading with compressed storage
- Vectorized Operations: Efficient pandas/numpy operations for large datasets
- Rolling Window Analysis: Adaptive factor models with 30-60 day windows
- Winsorization: Robust outlier handling at 5-sigma levels
- Corporate Action Handling: Automatic adjustment for splits and dividends
Raw Market Data (CSV/SQL)
↓
Load & Merge (loaddata.py)
↓
Calculate Returns & Features (calc.py)
↓
Filter Tradable Universe
↓
Generate Alpha Signals (strategy files)
↓
Fit Regression Coefficients (regress.py)
↓
PCA Decomposition (pca.py) [optional]
↓
Portfolio Optimization (opt.py)
↓
Simulation Engines (bsim/osim/qsim/ssim)
↓
Performance Analysis & Reporting
| Component | File | Description |
|---|---|---|
| Data Loading | loaddata.py |
Load market data, fundamentals, analyst estimates |
| Calculations | calc.py |
Forward returns, volume profiles, winsorization |
| Regression | regress.py |
Fit alpha factors to forward returns (WLS) |
| PCA | pca.py |
Principal component decomposition |
| Optimization | opt.py |
Portfolio optimization with OpenOpt NLP |
| Big Sim | bsim.py |
Daily rebalancing backtest |
| Order Sim | osim.py |
Order-level execution backtest |
| Quote Sim | qsim.py |
Intraday 30-min bar backtest |
| System Sim | ssim.py |
Full lifecycle position tracking |
| Utilities | util.py |
Helper functions for data merging |
- Python 2.7 (legacy codebase)
- NumPy 1.16.0
- Pandas 0.23.4
- OpenOpt 0.5628
- statsmodels
- scikit-learn
- matplotlib
- lmfit
- MySQL connector (optional, for database access)
# Clone the repository
git clone https://github.com/yourusername/statarb.git
cd statarb
# Install dependencies
pip install -r requirements.txt
# For Cython optimization module (optional)
python setup.py build_ext --inplaceSet the base directories in loaddata.py:
UNIV_BASE_DIR = "/path/to/universe/"
PRICE_BASE_DIR = "/path/to/prices/"
BARRA_BASE_DIR = "/path/to/barra/"
BAR_BASE_DIR = "/path/to/bars/"
EARNINGS_BASE_DIR = "/path/to/earnings/"
LOCATES_BASE_DIR = "/path/to/locates/"
ESTIMATES_BASE_DIR = "/path/to/estimates/"# Run BSIM with a single alpha signal
python bsim.py --start=20130101 --end=20130630 \
--fcast=hl:1:1 \
--kappa=2e-8 \
--maxnot=200e6# Combine high-low and beta-adjusted signals
python bsim.py --start=20130101 --end=20130630 \
--fcast=hl:1:0.6,bd:0.8:0.4 \
--kappa=2e-8-
Universe Files (
UNIV_BASE_DIR/YYYY/YYYYMMDD.csv)- Columns:
sid,ticker_root,status,country,currency
- Columns:
-
Price Files (
PRICE_BASE_DIR/YYYY/YYYYMMDD.csv)- Columns:
sid,ticker,open,high,low,close,volume,mkt_cap
- Columns:
-
Barra Files (
BARRA_BASE_DIR/YYYY/YYYYMMDD.csv)- Risk factors: beta, momentum, size, volatility, etc. (13 factors)
- Industry classifications (58 industries)
-
Bar Files (
BAR_BASE_DIR/YYYY/YYYYMMDD.h5)- Intraday 30-minute bars with VWAP and volume
- Format: HDF5 with MultiIndex (timestamp, sid)
-
Locates File (
LOCATES_BASE_DIR/borrow.csv)- Short borrow availability and rates
- Price Range: $2.00 - $500.00
- Min ADV: $1M (tradable) / $5M (expandable universe)
- Country: USA
- Currency: USD
- Market Cap: Top 1,400 stocks by default
The optimization module (opt.py) maximizes:
Utility = Alpha - κ(Specific Risk + Factor Risk) - Slippage - Execution Costs
Key Parameters:
kappa: Risk aversion (2e-8 to 4.3e-5)max_sumnot: Max total notional ($50M default)max_posnot: Max position size (0.48% of capital)slip_nu: Market impact coefficient (0.14-0.18)
Constraints:
- Position limits: ±$40k-$1M per stock
- Capital limits: $4-50M aggregate notional
- Participation: Max 1.5% of ADV
- Factor exposure: Limited Barra factor bets
Most comprehensive daily backtest with optimized positions:
python bsim.py \
--start=20130101 \
--end=20130630 \
--fcast=hl:1:0.5,bd:0.8:0.3,pca:1.2:0.2 \
--horizon=3 \
--kappa=2e-8 \
--maxnot=200e6 \
--locates=True \
--vwap=FalseArguments:
--start/--end: Date range (YYYYMMDD)--fcast: Alpha signals (format:name:multiplier:weight)--horizon: Forecast horizon in days--kappa: Risk aversion parameter--maxnot: Maximum notional--vwap: Use VWAP execution (default: close)
Order-level backtest with fill strategy analysis:
python osim.py \
--start=20130101 \
--end=20130630 \
--fill=vwap \
--slipbps=0.0001 \
--fcast=alpha_files30-minute bar simulation for intraday strategies:
python qsim.py \
--start=20130101 \
--end=20130630 \
--fcast=qhl_intra \
--horizon=3 \
--mult=1000 \
--slipbps=0.0001Full lifecycle with position and cash tracking:
python ssim.py \
--start=20130101 \
--end=20131231 \
--fcast=combined_alphaThe system includes 20+ alpha strategies in separate files:
| Category | Files | Description |
|---|---|---|
| PCA | pca.py |
Market-neutral returns from PCA decomposition |
| Beta-Adjusted | bd.py, badj_*.py |
Order flow signals adjusted for beta |
| High-Low | hl.py, qhl_*.py |
Intraday high-low mean reversion |
| Analyst | analyst*.py, rating_diff.py |
Analyst rating and estimate changes |
| Momentum | mom_year.py |
Annual momentum signals |
| Volatility | vadj_*.py |
Volume-adjusted position models |
| Overnight | c2o.py |
Close-to-open gap trading |
| Earnings | eps.py, target.py |
Earnings surprises and target misses |
- Develop Alpha: Create new strategy file with alpha calculation
- Fit Coefficients: Use
regress.pyto fit on in-sample data - Generate Forecasts: Apply to out-of-sample period
- Optimize: Run through
opt.pyto get target positions - Backtest: Simulate with appropriate engine (BSIM/OSIM/QSIM/SSIM)
- Analyze: Evaluate Sharpe, drawdown, factor exposures
| Engine | Use Case | Granularity | Execution Model |
|---|---|---|---|
| BSIM | Daily strategies | Daily | Optimized positions |
| OSIM | Fill analysis | Order-level | VWAP/mid/close fills |
| QSIM | Intraday strategies | 30-min bars | Time-of-day analysis |
| SSIM | Full system | Daily + intraday | Complete lifecycle |
All engines provide:
- P&L: Daily and cumulative
- Sharpe Ratio: Risk-adjusted returns
- Drawdown: Maximum peak-to-trough decline
- Turnover: Average daily trading volume
- Factor Exposures: Barra factor bets over time
- Execution Quality: Realized vs. estimated costs
Edit in loaddata.py:
# Tradable universe
t_low_price = 2.0
t_high_price = 500.0
t_min_advp = 1000000.0 # $1M min ADV
# Expandable universe
e_low_price = 2.25
e_high_price = 500.0
e_min_advp = 5000000.0 # $5M min ADV
# Universe size
uni_size = 1400 # Top N by market capEdit in opt.py:
max_sumnot = 50.0e6 # $50M max notional
max_posnot = 0.0048 # 0.48% max per position
kappa = 4.3e-5 # Risk aversion
# Slippage model
slip_alpha = 1.0 # Base cost
slip_beta = 0.6 # Participation power
slip_delta = 0.25 # Participation coefficient
slip_nu = 0.14 # Market impact
execFee = 0.00015 # 1.5 bps execution feeEdit in calc.py:
BARRA_FACTORS = ['country', 'growth', 'size', 'sizenl',
'divyild', 'btop', 'earnyild', 'beta',
'resvol', 'betanl', 'momentum', 'leverage',
'liquidty']
PROP_FACTORS = ['srisk_pct_z', 'rating_mean_z']statarb/
├── README.md # This file
├── requirements.txt # Python dependencies
├── setup.py # Cython build configuration
│
├── loaddata.py # Data loading and preprocessing
├── calc.py # Factor calculations
├── regress.py # Regression analysis
├── pca.py # PCA decomposition
├── opt.py # Portfolio optimization
├── util.py # Utility functions
│
├── bsim.py # Daily simulation engine
├── osim.py # Order simulation engine
├── qsim.py # Intraday simulation engine
├── ssim.py # System simulation engine
│
├── bd.py # Beta-adjusted order flow
├── hl.py # High-low strategy
├── pca.py # PCA alpha generation
├── analyst*.py # Analyst signal strategies
├── rating_diff.py # Rating change strategy
├── vadj_*.py # Volume-adjusted strategies
├── mom_year.py # Momentum strategy
├── eps.py # Earnings surprise strategy
├── target.py # Price target strategy
├── c2o.py # Close-to-open strategy
└── ... (additional strategies)
│
└── salamander/ # Standalone module
├── instructions.txt # Salamander usage guide
├── requirements.txt # Salamander dependencies
├── gen_dir.py # Directory structure generator
├── gen_hl.py # Alpha signal generator
├── gen_alpha.py # Alpha file creator
├── bsim.py # Standalone backtest engine
├── simulation.py # Portfolio simulation
└── ... (supporting files)
The salamander/ directory contains a standalone, simplified version of the system for easier deployment and development.
- Modular directory structure
- Simplified alpha generation pipeline
- Standalone backtest engine
- Documented workflow in
instructions.txt
# 1. Create directory structure
python3 salamander/gen_dir.py --dir=/path/to/data
# 2. Generate alpha signals from raw data
python3 salamander/gen_hl.py \
--start=20100630 \
--end=20130630 \
--dir=/path/to/data
# 3. Create alpha signal files
python3 salamander/gen_alpha.py \
--start=20100630 \
--end=20130630 \
--dir=/path/to/data
# 4. Run backtest
python3 salamander/bsim.py \
--start=20130101 \
--end=20130630 \
--dir=/path/to/data \
--fcast=hl:1:1data/
├── all/ # Alpha signal files
├── hl/ # High-low strategy files
├── locates/ # Short borrow data (borrow.csv)
├── opt/ # Optimization outputs
├── blotter/ # Trade records
├── raw/ # Raw market data
└── all_graphs/ # Visualization outputs
The system evaluates strategies using:
- Sharpe Ratio: Risk-adjusted returns (annualized)
- Information Ratio: Alpha vs. benchmark volatility
- Maximum Drawdown: Largest peak-to-trough decline
- Turnover: Average daily trading as % of capital
- Hit Rate: Percentage of profitable days
- Factor Exposures: Bets on Barra risk factors
- Participation Rate: Trading volume vs. ADV
- Factor Neutrality: Limits on Barra factor exposures
- Sector Limits: Industry concentration constraints
- Position Sizing: Market cap and liquidity-based limits
- Participation Constraints: Max 1.5% of ADV to minimize impact
- Correlation Monitoring: Rolling 30-day cross-security correlations
To create a new alpha signal:
- Create a new Python file (e.g.,
my_alpha.py) - Load data using
loaddata.pyfunctions - Calculate your alpha signal
- Use
regress.pyto fit coefficients on training data - Generate out-of-sample forecasts
- Save to HDF5 or CSV for simulation engines
Example structure:
from loaddata import *
from calc import *
from regress import *
# Load data
daily_df = load_prices(start, end, lookback)
barra_df = load_barra(start, end, lookback)
# Calculate alpha
daily_df['my_alpha'] = calculate_my_signal(daily_df)
# Fit regression
fits_df = regress_alpha(daily_df, 'my_alpha', horizon=3)
# Generate forecast
forecast_df = apply_coefficients(daily_df, fits_df)
# Save results
dump_alpha(forecast_df, 'my_alpha')Combine multiple alphas with optimized weights:
python bsim.py \
--start=20130101 \
--end=20130630 \
--fcast=pca:1.0:0.3,hl:1.2:0.25,bd:0.8:0.2,analyst:1.5:0.15,mom:1.0:0.1Weights should sum to 1.0 for proper risk attribution.
The system models realistic costs:
- Execution Fees: 1.5 bps fixed
- Slippage: Nonlinear function of participation rate
- Market Impact: Based on order size vs. ADV
- Borrow Costs: For short positions
- Opportunity Cost: From delayed fills
Analyze realized vs. estimated costs using OSIM engine.
This is a research codebase. Key areas for improvement:
- Python 3 migration
- Additional alpha strategies
- Enhanced optimization algorithms
- Real-time data integration
- Machine learning alpha generation
- Improved execution modeling
Apache 2.0
For questions and support, please open an issue on GitHub.
Disclaimer: This system is for research and educational purposes. Use at your own risk. Past performance does not guarantee future results. Trading involves substantial risk of loss.