Speculorix Prototype

ML-driven stock selection using financial fundamentals and XGBoost. currently trained on annual Compustat-CRSP data (2010-2024).

current state

this is a working prototype that demonstrates the core methodology:

loads fundamental data (balance sheet, income statement, cash flow)
engineers ~90 features (profitability, leverage, liquidity, valuation, etc.)
trains XGBoost to predict monthly returns
backtests a top-30 equal-weighted portfolio
evaluates via Information Coefficient and quintile analysis

test results (2022-2024):

IC: +0.04 (positive but weak)
monthly alpha: ~1.2%
only 15 test months (not statistically significant)

the model works, but the data is sparse. each company only appears once per fiscal year, which limits what we can do.

known limitations

sparse data - annual fundamentals miss a lot. companies report quarterly at minimum, and prices move daily. we're basically predicting with stale information.
small sample size - 15 test months isn't enough to prove anything. could easily be luck. need 50+ months minimum for confidence.
no momentum signals - can't build price-based features (momentum, volatility, beta) with one observation per year. these are huge alpha sources in real quant funds.
equal weighting - not optimal. should be doing mean-variance optimization with proper risk management, but need more data points for stable covariance estimates.
no transaction costs - backtest assumes zero friction. in reality, trading 30 stocks monthly costs ~10-50 bps depending on size and market impact.

what we need next

to build something production-ready, we need monthly (or better yet, daily) data. here's the rough plan once we get proper data:

phase 1: feature expansion

right now we only use fundamental ratios. with monthly data we can add:

momentum & technical

rolling returns (1m, 3m, 6m, 12m)
volatility and Sharpe ratios
short-term reversal (last month's losers often bounce)
52-week high/low proximity
volume trends

growth & changes

quarter-over-quarter earnings growth
revenue acceleration (is growth speeding up or slowing?)
margin expansion/contraction
analyst estimate revisions (if we can get that data)

relative features

how does this stock rank vs peers in same industry?
is it cheap or expensive relative to sector?
sector momentum (buy winners in strong sectors)

macro regime

VIX, interest rates, yield curve
market breadth indicators
does stock perform well in high-vol environments? (defensive stocks)

the key insight: momentum is probably more predictive than fundamentals for short horizons (1-3 months), but fundamentals matter for long-term (6-12 months). we should combine both.

phase 2: model improvements

try multiple models

LightGBM (often faster and better than XGBoost)
CatBoost (handles categorical features well)
simple linear models as baselines (ridge, lasso)
neural networks (see note below)

then ensemble them. don't just pick the best model on validation - combine predictions from all models with learned weights. more robust.

neural networks / deep learning

with our current sparse data, neural nets are probably overkill. tree-based models (XGBoost, LightGBM) usually win on small tabular datasets.

BUT with monthly data (50k+ observations), deep learning becomes viable:

TabNet - attention-based architecture for tabular data (surprisingly good)
FT-Transformer - feature tokenizer + transformer (SOTA on some financial datasets)
LSTM / GRU - for time-series patterns (capture temporal dependencies XGBoost might miss)
simple MLP - 3-4 dense layers as a baseline

pros: can learn complex non-linear patterns, good for high-dimensional data
cons: needs more data, harder to interpret, slower to train, more hyperparameters

worth trying if you have 50k+ observations. probably won't beat a well-tuned XGBoost by much (5-10% IC improvement), but good for ensemble diversity.

hyperparameter tuning

use Bayesian Optimization instead of grid search or random search. BO builds a probabilistic model of how hyperparameters affect performance and uses that to intelligently pick the next set to try. much more efficient than brute force.

tools: Optuna and Hyperopt are both BO frameworks (Optuna is newer and easier to use).

process:

define search space (max_depth from 3-8, learning_rate from 0.01-0.3, etc.)
maximize IC on validation set (NOT training loss)
run 100-200 trials, takes 2-4 hours
typical gains: 5-15% IC improvement over default params
be careful not to overfit - validate on a separate time period that wasn't used for tuning

grid search would take forever (8 depths × 10 learning rates × 5 subsamples = 400 combinations). BO finds good params in 100 trials.

note: some people claim BO is overkill and random search is just as good. maybe true for simple models, but with XGBoost's 10+ hyperparameters, BO definitely helps. plus it's automated so you can run it overnight.

avoid overfitting

with more data, we can use early stopping more aggressively
cross-validation across time periods
penalize complexity (strong L1/L2 regularization)
don't optimize hyperparameters on the same data you'll use for final testing

phase 3: portfolio construction

right now we just pick top 30 and equal-weight them. pretty naive.

mean-variance optimization

maximize expected return / portfolio volatility
need covariance matrix (require 50+ months of returns)
constraints: no single stock > 10%, no sector > 30%
long-only for now (shorting is expensive and risky)

transaction cost model

estimate commissions (5-10 bps)
market impact depends on trade size (slippage)
only rebalance when benefit exceeds cost
maybe rebalance less frequently? monthly might be too much

risk limits

max portfolio volatility (say 20% annualized)
max drawdown tolerance
stop trading if model IC drops below zero for 6+ months

phase 4: backtesting & validation

walk-forward testing

retrain model every quarter on expanding window
backtest should be as realistic as possible
include transaction costs, execution delays, realistic fills

out-of-sample testing

hold out 2023-2024 completely until final validation
don't touch test set until we're confident in the model
if test IC < 0.02, something's wrong

statistical tests

bootstrap to get confidence intervals on alpha
adjust for multiple testing (trying many strategies)
Sharpe ratio should be > 1.0 to be interesting

phase 5: production considerations

once we have something that works on historical data:

monitoring

track IC every month (should stay positive)
if model breaks, stop trading and retrain
watch for regime changes (2020 covid, 2022 rate hikes, etc.)

retraining schedule

probably retrain quarterly
use rolling 3-year window (or expanding window?)
test both and see what works better

live execution

need real-time data feed
execute trades at market open (less impact)
track slippage vs backtest assumptions

risk management

daily P&L tracking
automated alerts if drawdown > 10%
have a kill switch ready

realistic expectations

even with monthly data and all these improvements, this is still hard:

good case: IC of 0.05-0.08, Sharpe 1.5-2.0, annual alpha 5-10%
realistic case: IC of 0.03-0.05, Sharpe 1.0-1.5, alpha 3-7%
bad case: model stops working after regime change, IC drops to zero

most quant funds have ICs in the 0.02-0.05 range and still make money because they trade thousands of stocks with leverage. we're small, so need to be selective and keep costs low.

also, past performance doesn't guarantee future results (obviously). markets adapt, alphas decay, strategies get crowded. need to keep iterating.

note on deep learning: don't expect neural nets to magically 10x your returns. most of the alpha comes from good features and clean data. DL might add 5-10% IC improvement at best. stick to XGBoost + ensemble first, then explore neural nets if you're curious. but don't fall into the "more complex = better" trap.

tech stack

current:

python 3.x
pandas, numpy
xgboost
scipy, sklearn
matplotlib

probably need later:

faster data processing (polars or dask for large datasets)
database (postgres or timescaledb for time-series)
Bayesian optimization (Optuna or Hyperopt for hyperparameter tuning)
deep learning (PyTorch or TensorFlow if we try neural nets)
MLOps tools (MLflow for experiment tracking)
cloud compute (AWS or GCP for big backtests)

optional / research:

graph neural networks (if we add company relationship data)
NLP models (for earnings call transcripts, news sentiment)
alternative data sources (satellite imagery, credit card data, etc.)

next steps

core work (must-do):

acquire monthly fundamental + price data (Capital IQ, Bloomberg, or FactSet)
validate that prototype works on monthly data
add momentum features (biggest priority)
try LightGBM and ensemble models
implement mean-variance optimization
run realistic backtest with transaction costs
if Sharpe > 1.0, consider live paper trading
if paper trading works for 6 months, maybe go live with small capital

optional exploration (if time permits): 9. try neural nets (TabNet, LSTM) for ensemble diversity 10. add alternative data (sentiment, news, macro indicators) 11. multi-asset strategies (bonds, commodities, FX) 12. sector rotation and factor timing strategies

resources

papers: Fama-French factors, momentum (Jegadeesh), fundamental analysis (Piotroski F-Score)
books: "Quantitative Trading" by Ernie Chan, "Advances in Financial Machine Learning" by Marcos Lopez de Prado
repos: check out QuantConnect, Zipline, Backtrader for inspiration

note: this is an experimental research project. don't bet the farm on this. start small, validate rigorously, and be ready to shut it down if it stops working.

last updated: january 2026

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
CRSP_US_Stock_&_Indexes_Database_Data_Descriptions_Guide.pdf		CRSP_US_Stock_&_Indexes_Database_Data_Descriptions_Guide.pdf
PORTFOLIO_OPTIMIZATION_ROADMAP.md		PORTFOLIO_OPTIMIZATION_ROADMAP.md
README.md		README.md
Speculorix Prototype.pdf		Speculorix Prototype.pdf
backtest_performance.png		backtest_performance.png
backtest_results.csv		backtest_results.csv
backtest_results_monthly.csv		backtest_results_monthly.csv
data.csv		data.csv
feature_importance.csv		feature_importance.csv
feature_importance.png		feature_importance.png
ic_by_month.csv		ic_by_month.csv
latest_portfolio_base.csv		latest_portfolio_base.csv
performance_charts.png		performance_charts.png
prototype_pipeline.py		prototype_pipeline.py
speculorix_pipeline.py		speculorix_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speculorix Prototype

current state

known limitations

what we need next

phase 1: feature expansion

phase 2: model improvements

phase 3: portfolio construction

phase 4: backtesting & validation

phase 5: production considerations

realistic expectations

tech stack

next steps

resources

About

Uh oh!

Releases

Packages

Languages

kyk1os/Prototype

Folders and files

Latest commit

History

Repository files navigation

Speculorix Prototype

current state

known limitations

what we need next

phase 1: feature expansion

phase 2: model improvements

phase 3: portfolio construction

phase 4: backtesting & validation

phase 5: production considerations

realistic expectations

tech stack

next steps

resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages