A reinforcement learning framework for cryptocurrency trading, supporting BTC spot and perpetual swap markets. The project implements three major RL algorithms (DQN, DRL, PPO) with a modular architecture built on Gymnasium.
- Multiple RL Algorithms: DQN (discrete actions), DRL (continuous policy gradient), PPO (proximal policy optimization)
- Gymnasium-Compatible Environment: Standard RL interface with customizable observation and action spaces
- Perpetual Swap Support: Long/short positions with funding rate handling
- Flexible Reward Functions: 7 reward calculation methods including Sharpe ratio, Sortino ratio, and Differential Sharpe Ratio (DSR)
- Modular Architecture: Clean separation between environment, agent, and training components
- Configuration-Driven: YAML configs with CLI override support
uv venv
uv pip install -r requirements.txt
# Development dependencies
uv pip install -r requirements-dev.txtpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -r requirements-dev.txt# Train with DQN (discrete actions)
python -m cli.train_agent --trainer DQNTrainer --episodes 100
# Train with DRL (continuous actions)
python -m cli.train_agent --trainer DRLTrainer --episodes 50 --batch-size 32
# Train with PPO (modern policy gradient)
python -m cli.train_agent --trainer PPOTrainer --episodes 100 --num-steps 256# Train from config
python -m cli.train_agent --config configs/training_config.yaml
# Override config with CLI args
python -m cli.train_agent --config configs/training_config.yaml \
--trainer DRLTrainer --learning-rate 1e-3src/
├── cli/ # Command-line training entry point
├── configs/ # YAML configuration files
├── data/ # Data loading and preprocessing
│ ├── loader.py # CSV data loading
│ ├── preprocessing.py # Feature engineering
│ └── downloaders/ # Binance API integration
├── envs/ # Trading environment
│ ├── trading_env.py # BTCMarketEnv (Gymnasium)
│ └── rewards.py # Reward functions
├── ml/ # Neural network models
│ ├── models.py # TensorFlow MLP/LSTM
│ └── models_torch.py # PyTorch ActorCritic
├── rl/ # Reinforcement learning
│ ├── agent.py # TensorFlow TraderAgent
│ ├── agent_torch.py # PyTorch PPOAgent
│ ├── trainer.py # Base trainer class
│ ├── algorithms/ # DQN, DRL, PPO implementations
│ ├── buffers/ # Experience/rollout buffers
│ └── exploration/ # Epsilon-greedy strategy
├── trading/ # Trade execution
└── utils/ # Logging, visualization, config loading
Discrete action space with 4 actions: Hold, Buy 50%, Buy 100%, Sell.
- Experience replay buffer
- Epsilon-greedy exploration with decay
- Batch training with MSE loss
Continuous action space for position allocation [-1, 1] or [0, 1].
- Policy gradient with Differential Sharpe Ratio
- Adaptive risk aversion coefficient
- Step-by-step buffer updates
Modern Actor-Critic architecture with PyTorch.
- Clipped surrogate objective
- Generalized Advantage Estimation (GAE)
- Entropy bonus for exploration
- Gaussian policy for continuous actions
The BTCMarketEnv is a Gymnasium-compatible trading environment.
State tensor shape: (features, window_size) e.g., (8, 20)
| Feature | Description | Normalization |
|---|---|---|
| Close change | Price movement | Sigmoid |
| MACD Histogram | Momentum indicator | Sigmoid |
| EMA 50 change | Trend indicator | Sigmoid |
| Wallet change | Portfolio performance | Sigmoid |
| Volume | Trading volume | MinMax |
| Open price | Opening price | MinMax |
| RSI 14 | Relative strength | MinMax |
| MACD | Moving average convergence | MinMax |
- DQN:
Discrete(4)- Hold, Buy 50%, Buy 100%, Sell - DRL/PPO:
Box(-1, 1)- Position allocation from full short to full long
reward_differential_sharpe_ratio: EMA-tracked Sharpe with risk aversion (default for DRL)reward_sharpe_ratio: Risk-adjusted returnsreward_sortino_ratio: Downside-adjusted returnsreward_profit: Simple profit-basedreward_sterling_ratio: Max drawdown adjustedcompute_reward_from_tutor: Basic yield rewardreward_price_rate_log: Logarithmic price changes
Configuration is managed via YAML files in configs/:
environment:
observation_space: [8, 20]
start_money: 10000
trading_fee: 0.001
agent:
action_domain: [0.0, 1.0]
epsilon: 0.5
epsilon_decay: 0.75
training:
trainer: "DQNTrainer"
episodes: 50
batch_size: 16
learning_rate: 1.0e-7
reward:
function: "reward_differential_sharpe_ratio"| Parameter | Description | Default |
|---|---|---|
--trainer |
Algorithm: DQNTrainer, DRLTrainer, PPOTrainer | DQNTrainer |
--episodes |
Number of training episodes | 50 |
--batch-size |
Batch size for training | 16 |
--learning-rate |
Learning rate | 1e-7 |
--num-steps |
Steps per rollout (PPO) | 128 |
--clip-coef |
PPO clipping coefficient | 0.2 |
--gae-lambda |
GAE lambda parameter | 0.95 |
--gpu-memory |
GPU memory limit (MB) | None |
Place CSV market data files in the datasets/ directory. Expected columns:
date, open, high, low, close, Volume, histogram, 50ema, rsi14, macd
For perpetual swaps, additional columns:
Close_BTC, Funding_Rate
Training data for evaluation: Download Link
Extract all training folders to a single directory and set the base_folder variable in the evaluation notebook.
# Run all tests
python -m pytest
# Run specific test file
python -m pytest tests/unit/envs/test_trader_env.py -v
# Run specific test
python -m pytest tests/unit/envs/test_trader_env.py::TestEnv::test_handle_long_position -v# Format code
black src/ tests/
# Lint
ruff check src/ tests/
# Type checking
mypy src/Training outputs are saved to logs/:
logs/
└── {algorithm}_trial_0/
├── model_checkpoint.h5 # TensorFlow model
├── model_checkpoint.pt # PyTorch model (PPO)
├── params.json # Training parameters
├── training_log.csv # Episode metrics
└── episode_{N}.csv # Per-episode logs
Detailed documentation is available in the docs/ directory:
TRAINING_ARCHITECTURE.md: Comprehensive training system specificationBTCMarket_Env_spec.md: Environment specificationPPO_TRAINING.md: PPO algorithm details
- Python 3.10+
- PyTorch 2.0+ (for PPO)
- TensorFlow 2.10+ (for DQN/DRL)
- Gymnasium 0.26+
- NumPy, Pandas, scikit-learn
| Reward Function | Training ID |
|---|---|
| reward_sharpe_ratio | 20230429_200721 |
| reward_differential_sharpe_ratio | 20230424_070731 |
| compute_reward_from_tutor | 20230429_200559 |
| reward_profit (v0) | 20230429_110440 |
| reward_profit (v1) | 20230423_174023 |
| reward_profit (v2) | 20230420_083508 |
| reward_profit (v3) | 20230420_195053 |
| Reward Function | Training ID |
|---|---|
| reward_sharpe_ratio | 20230427_165557 |
| reward_differential_sharpe_ratio (v0) | 20230427_083418 |
| reward_differential_sharpe_ratio (v1) | 20230423_114422 |
| compute_reward_from_tutor (v0) | 20230427_083632 |
| compute_reward_from_tutor (v1) | 20230421_181519 |
| reward_profit (v0) | 20230425_145617 |
| reward_profit (v1) | 20230423_231505 |
| reward_profit (v2) | 20230422_122001 |
| reward_profit (v3) | 20230421_001115 |
MIT License