Skip to content

danielhomola/dagobert

Repository files navigation

dagobert

"Whenever motivation is down, think about the teak wood! Onwards and upwards compadre!"

dagobert is an abandoned algorithmic crypto trading research project. It was built by Daniel Homola and Marcell Máté over roughly 1,000 hours across 2019–2021 with the audacious goal of building a fully automated trading system that would make us rich within a year or two.

It failed spectacularly at that goal. But we learned an enormous amount about financial machine learning, signal processing, time series modelling, and - most importantly - developed a healthy skepticism toward anyone claiming you can reliably make money in day trading without being at a top-tier hedge fund. Simple ML models, moving average crossovers, and RSI-based strategies are not edges. They're noise traps dressed up as insights.

The codebase is provided as-is, without warranties of any kind. We're open-sourcing it in the hope that someone finds the implementation work useful - the pipeline infrastructure, the financial ML building blocks, or the DL/RL training loops. It reflects serious engineering effort even if the underlying thesis didn't pan out.

A note on the project

We put a serious amount of work into this - the infrastructure is solid, the implementations are careful, and the test coverage is real. What we learned is that the hard part of systematic trading isn't building the pipeline. It's having an actual edge. Feature importance from a gradient boosted tree on historical crypto data is not an edge. A TCN that achieves 52% directional accuracy on a backtest is not an edge when you account for transaction costs, slippage, and the fact that crypto markets in 2020–2021 were largely driven by macro sentiment and retail FOMO rather than any signal a model trained on OHLCV bars could hope to capture.

If you're here because you want to build a trading bot that makes money: our honest advice is don't, unless you're prepared to spend years on it and have access to proprietary data and infrastructure that the rest of the market doesn't have.

If you're here because you want to learn financial ML, study time series modelling, or need battle-tested building blocks for a research pipeline - this codebase might genuinely save you hundreds of hours.


What's inside

The project loosely follows Marcos Lopez De Prado's Advances in Financial Machine Learning and covers the full stack from raw exchange data to trained models.

Data Ingestion

  • Binance API client for fetching historical OHLCV data
  • SQLAlchemy-based local database management
  • S3 utilities for reading/writing pipeline artefacts (feather format)

Bar Construction (preprocessing/bars/)

Raw tick data is resampled into information-driven bars - a core idea from De Prado - rather than fixed time bars:

  • Standard bars: time, tick, volume, dollar
  • Imbalance bars: tick, volume, dollar imbalance bars that sample more frequently during high activity

Feature Engineering (preprocessing/feature_creation/)

  • Fractional differencing: makes price series stationary while preserving memory, using the minimum differencing order that passes ADF
  • Technical indicators via stockstats: RSI, MACD, Bollinger Bands, ATR, and many others
  • Time features: cyclical encoding of hour, day-of-week, month, etc.
  • Boruta feature selection: wrapper around the Boruta algorithm for identifying genuinely predictive features

Labelling (preprocessing/labelling/)

  • Triple-barrier labelling: assigns +1/−1/0 labels based on which barrier (profit-target, stop-loss, or vertical/time barrier) is hit first
  • Configurable stop-loss, profit-target multipliers, and time horizons
  • Label binarisation strategies for handling the neutral class

Sample Weights & Sampling (preprocessing/sampling/)

  • Uniqueness-based sample weights to correct for overlapping labels (another De Prado concept)
  • Downsampling strategies for class imbalance

Mini-Series Extraction (preprocessing/mini_series/)

  • Rolling window extraction to produce fixed-length lookback sequences
  • Used as input to sequence models (TCN)

Cross-Validation (modelling/cv/)

Time-series-aware CV that never leaks future data:

  • DateSplitter: split by date ranges
  • TimeSeriesCV: purged/embargoed k-fold CV
  • TrainValTestSplitter: proper three-way temporal split

Deep Learning - TCN (modelling/dl/)

A Temporal Convolutional Network for multi-class directional prediction:

  • Dilated causal convolutions with residual connections
  • Multi-instrument CryptoDataset supporting simultaneous training on multiple pairs (BTC, ETH, LTC, XRP, BCH, ...)
  • Data augmentation (noise injection, time warping)
  • PyTorch Lightning training loop with early stopping, LR scheduling
  • Optuna hyperparameter search with parallelised trials and Comet ML experiment tracking
  • AdaBelief optimiser

Reinforcement Learning - PPO (modelling/rl/)

A custom Proximal Policy Optimisation agent for portfolio allocation across multiple crypto assets:

  • Actor-Critic architecture with a Dirichlet distribution output head for continuous portfolio weights
  • Multi-head parallel environment for simultaneous simulation across instruments
  • Configurable reward shaping, transaction cost modelling
  • Parallel worker rollout collection
  • Based loosely on the PGPortfolio paper architecture with significant modifications

Pipeline Orchestration

  • Snakemake workflows in pipelines/bars_features_labels/ for end-to-end reproducible runs
  • YAML-driven configuration: runners dynamically instantiate Python classes from config, allowing full pipeline variation without code changes (see runner_utils.py)

Test Coverage

Extensive pytest suite covering:

  • Bar construction (standard + imbalance)
  • All feature creation modules (frac diff, stockstats, time features)
  • Labelling and binarisation
  • Mini-series extraction and merging
  • Sample weights and downsampling
  • CV splitters
  • DL data pipeline and tensor construction
  • RL environment dynamics

Installation

pip install -e .

Or with the full conda environment (Python 3.7, includes MKL/GPU deps):

conda env create -f environment.yml
conda activate dagobert

Running the pipelines

After installation, CLI entry points are available:

dagobert-preprocessing   # bar creation, feature engineering, labelling
dagobert-tcn             # train TCN or launch Optuna study
dagobert-rl              # train PPO agent
dagobert-optuna          # Optuna study management
dagobert-s3              # S3 file utilities

Each runner takes a YAML config file. See config/ and config/custom/ for examples.

For the full Snakemake pipeline:

cd pipelines/bars_features_labels/
snakemake --cores N

Running tests

pytest

About

Building Scrooge McDuck's vault

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors