dagobert

"Whenever motivation is down, think about the teak wood! Onwards and upwards compadre!"

dagobert is an abandoned algorithmic crypto trading research project. It was built by Daniel Homola and Marcell Máté over roughly 1,000 hours across 2019–2021 with the audacious goal of building a fully automated trading system that would make us rich within a year or two.

It failed spectacularly at that goal. But we learned an enormous amount about financial machine learning, signal processing, time series modelling, and - most importantly - developed a healthy skepticism toward anyone claiming you can reliably make money in day trading without being at a top-tier hedge fund. Simple ML models, moving average crossovers, and RSI-based strategies are not edges. They're noise traps dressed up as insights.

The codebase is provided as-is, without warranties of any kind. We're open-sourcing it in the hope that someone finds the implementation work useful - the pipeline infrastructure, the financial ML building blocks, or the DL/RL training loops. It reflects serious engineering effort even if the underlying thesis didn't pan out.

A note on the project

We put a serious amount of work into this - the infrastructure is solid, the implementations are careful, and the test coverage is real. What we learned is that the hard part of systematic trading isn't building the pipeline. It's having an actual edge. Feature importance from a gradient boosted tree on historical crypto data is not an edge. A TCN that achieves 52% directional accuracy on a backtest is not an edge when you account for transaction costs, slippage, and the fact that crypto markets in 2020–2021 were largely driven by macro sentiment and retail FOMO rather than any signal a model trained on OHLCV bars could hope to capture.

If you're here because you want to build a trading bot that makes money: our honest advice is don't, unless you're prepared to spend years on it and have access to proprietary data and infrastructure that the rest of the market doesn't have.

If you're here because you want to learn financial ML, study time series modelling, or need battle-tested building blocks for a research pipeline - this codebase might genuinely save you hundreds of hours.

What's inside

The project loosely follows Marcos Lopez De Prado's Advances in Financial Machine Learning and covers the full stack from raw exchange data to trained models.

Data Ingestion

Binance API client for fetching historical OHLCV data
SQLAlchemy-based local database management
S3 utilities for reading/writing pipeline artefacts (feather format)

Bar Construction (`preprocessing/bars/`)

Raw tick data is resampled into information-driven bars - a core idea from De Prado - rather than fixed time bars:

Standard bars: time, tick, volume, dollar
Imbalance bars: tick, volume, dollar imbalance bars that sample more frequently during high activity

Feature Engineering (`preprocessing/feature_creation/`)

Fractional differencing: makes price series stationary while preserving memory, using the minimum differencing order that passes ADF
Technical indicators via stockstats: RSI, MACD, Bollinger Bands, ATR, and many others
Time features: cyclical encoding of hour, day-of-week, month, etc.
Boruta feature selection: wrapper around the Boruta algorithm for identifying genuinely predictive features

Labelling (`preprocessing/labelling/`)

Triple-barrier labelling: assigns +1/−1/0 labels based on which barrier (profit-target, stop-loss, or vertical/time barrier) is hit first
Configurable stop-loss, profit-target multipliers, and time horizons
Label binarisation strategies for handling the neutral class

Sample Weights & Sampling (`preprocessing/sampling/`)

Uniqueness-based sample weights to correct for overlapping labels (another De Prado concept)
Downsampling strategies for class imbalance

Mini-Series Extraction (`preprocessing/mini_series/`)

Rolling window extraction to produce fixed-length lookback sequences
Used as input to sequence models (TCN)

Cross-Validation (`modelling/cv/`)

Time-series-aware CV that never leaks future data:

DateSplitter: split by date ranges
TimeSeriesCV: purged/embargoed k-fold CV
TrainValTestSplitter: proper three-way temporal split

Deep Learning - TCN (`modelling/dl/`)

A Temporal Convolutional Network for multi-class directional prediction:

Dilated causal convolutions with residual connections
Multi-instrument CryptoDataset supporting simultaneous training on multiple pairs (BTC, ETH, LTC, XRP, BCH, ...)
Data augmentation (noise injection, time warping)
PyTorch Lightning training loop with early stopping, LR scheduling
Optuna hyperparameter search with parallelised trials and Comet ML experiment tracking
AdaBelief optimiser

Reinforcement Learning - PPO (`modelling/rl/`)

A custom Proximal Policy Optimisation agent for portfolio allocation across multiple crypto assets:

Actor-Critic architecture with a Dirichlet distribution output head for continuous portfolio weights
Multi-head parallel environment for simultaneous simulation across instruments
Configurable reward shaping, transaction cost modelling
Parallel worker rollout collection
Based loosely on the PGPortfolio paper architecture with significant modifications

Pipeline Orchestration

Snakemake workflows in pipelines/bars_features_labels/ for end-to-end reproducible runs
YAML-driven configuration: runners dynamically instantiate Python classes from config, allowing full pipeline variation without code changes (see runner_utils.py)

Test Coverage

Extensive pytest suite covering:

Bar construction (standard + imbalance)
All feature creation modules (frac diff, stockstats, time features)
Labelling and binarisation
Mini-series extraction and merging
Sample weights and downsampling
CV splitters
DL data pipeline and tensor construction
RL environment dynamics

Installation

pip install -e .

Or with the full conda environment (Python 3.7, includes MKL/GPU deps):

conda env create -f environment.yml
conda activate dagobert

Running the pipelines

After installation, CLI entry points are available:

dagobert-preprocessing   # bar creation, feature engineering, labelling
dagobert-tcn             # train TCN or launch Optuna study
dagobert-rl              # train PPO agent
dagobert-optuna          # Optuna study management
dagobert-s3              # S3 file utilities

Each runner takes a YAML config file. See config/ and config/custom/ for examples.

For the full Snakemake pipeline:

cd pipelines/bars_features_labels/
snakemake --cores N

Running tests

pytest

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.circleci		.circleci
config		config
docs		docs
notebooks		notebooks
pipelines/bars_features_labels		pipelines/bars_features_labels
src/dagobert		src/dagobert
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.rst		CHANGELOG.rst
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dagobert

A note on the project

What's inside

Data Ingestion

Bar Construction (`preprocessing/bars/`)

Feature Engineering (`preprocessing/feature_creation/`)

Labelling (`preprocessing/labelling/`)

Sample Weights & Sampling (`preprocessing/sampling/`)

Mini-Series Extraction (`preprocessing/mini_series/`)

Cross-Validation (`modelling/cv/`)

Deep Learning - TCN (`modelling/dl/`)

Reinforcement Learning - PPO (`modelling/rl/`)

Pipeline Orchestration

Test Coverage

Installation

Running the pipelines

Running tests

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dagobert

A note on the project

What's inside

Data Ingestion

Bar Construction (preprocessing/bars/)

Feature Engineering (preprocessing/feature_creation/)

Labelling (preprocessing/labelling/)

Sample Weights & Sampling (preprocessing/sampling/)

Mini-Series Extraction (preprocessing/mini_series/)

Cross-Validation (modelling/cv/)

Deep Learning - TCN (modelling/dl/)

Reinforcement Learning - PPO (modelling/rl/)

Pipeline Orchestration

Test Coverage

Installation

Running the pipelines

Running tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Bar Construction (`preprocessing/bars/`)

Feature Engineering (`preprocessing/feature_creation/`)

Labelling (`preprocessing/labelling/`)

Sample Weights & Sampling (`preprocessing/sampling/`)

Mini-Series Extraction (`preprocessing/mini_series/`)

Cross-Validation (`modelling/cv/`)

Deep Learning - TCN (`modelling/dl/`)

Reinforcement Learning - PPO (`modelling/rl/`)

Packages