A reinforcement learning application to Liar's Dice.
This project explores the application of deep reinforcement learning to Liar’s Dice, an imperfect-information, turn-based bluffing game.
This document presents the game mechanics, key reinforcement learning concepts, and their application to this setting.
The demo notebook shows how to use the code to simulate games and train an agent against multiple fixed policies.
Results highlight: The trained DDQN (Double Deep Q-network) agent achieves ~79% first place rate vs. two simple rule-based opponents after 4000 training games. (See Result analysis for plots and metrics.)
- Custom Liar’s Dice game environment in Python.
- Fixed baseline opponents.
- Visual tools for data analysis.
- Reinforcement learning modeling (state format, episode definition, reward definition).
- DDQN with action masking algorithm implementation in PyTorch.
- Reinforcement learning applicaton code in an interactive notebook.
This project uses conda for environment management.
git clone https://github.com/Aff54/Reinforcement-learning-applied-to-Liar-s-dice.git
cd Reinforcement-learning-applied-to-Liar-s-dice
conda env create -f requirements.yml
conda activate liars_dice_rl-
Reinforcement learning theory
2.1 Core idea
2.2 Q-learning
2.3 DQN algorithm
2.4 DDQN update
2.5 Masking -
Training an agent with reinforcement learning
3.1 Environment setup
3.2 RL framework application
3.3 RL Training loop -
Result analysis
4.1 Agent performance
4.2 3rd place ratio discussion
4.3 Conclusion
Liar’s Dice is a turn-based bluffing game with imperfect information.
Each player starts with 5 dice. At the beginning of each round, all players roll their dice privately, then take turns making a bet about the number of dice showing a certain value among all dice in play.
On their turn, a player must either outbid the previous bet or challenge it.
A bet is written as [q, v] and means: there are at least q dice showing face v across all dice at play.
Ones are wild: when v ≠ 1, dice showing 1 also count toward as the face of current bet.
Given the previous bet [q, v], a legal outbid must satisfy one of the following conditions:
- Increase the face value while keeping the same quantity:
[q, v′]withv′ > v - Increase the quantity with any face:
[q′, v′]withq′ > qandv' in {1, ..., 6}
Additional constraints apply when switching between wilds and non-wilds:
- From wilds to non-wilds (
[q, 1] → [q′, v′ ≠ 1]): the new quantity must satisfyq′ ≥ 2q + 1 - From non-wilds to wilds (
[q, v ≠ 1] → [q′, 1]): the quantity may decrease, but must satisfyq′ ≥ ceil(q / 2)(this is the only way to reduce the quantity in a bet)
Instead of outbidding, a player may challenge the previous bet by calling either "liar" or "exact". In this case, dice are revealed and previous bet validity is checked which ends current round:
- "Liar":
If the true count of dice showing
v(and/or1) is ≥ q, the challenger loses one die; otherwise the previous player loses one die. - "Exact": if the challenger calls exact and the true count equals
q, the challenger gains one die back; otherwise the challenger loses one die.
Many Liar’s Dice variants exist; only the rules described above are implemented in this project.
Reinforcement learning (RL) is a branch of machine learning. It is a framework for training agents to make decisions in an environment by interacting with it.
A RL problem is commonly modeled as a Markov Decision Process (MDP) defined by the tuple
-
$\mathcal{S}$ is the state space and$\mathcal{A}$ the action space, -
$\mathbb{P}(s' \mid s, a)$ is the probability of transitioning from state$s$ to$s'$ after taking action$a$ , -
$r(s,a,s')$ is the reward for taking action$a$ in state$s$ and transitioning to$s'$ , -
$\gamma \in (0,1)$ is the discount factor.
At time
The action-value function (Q-function) under a policy
The goal of reinforcement learning is to find an optimal policy
The optimal Q-function satisfies the Bellman optimality equation:
Q-learning is a reinforcement learning algorithm designed to learn an optimal policy :
by iteratively approximating the optimal action-value function :
The main idea is to initialize the Q-function values
The agent then interacts with the environment. At each interaction step, it:
- observes the current state
$s$ , - selects an action
$a$ , typically using an exploration strategy, - receives a reward
$r$ and transitions to a new state$s'$ , - updates the corresponding Q-value using the Bellman optimality target.
The Q-learning update rule is:
where
By repeatedly applying this update, the Q-function converges toward the optimal Q-function, and an optimal policy can be:
The exploration strategy used in this project is
- at time
$t$ and state$s$ , the agent chooses a random action
$A_t \sim \mathcal{U}(\mathcal{A})$ with probability$\varepsilon_t$ , - otherwise, it selects the greedy action
$A_t = \arg\max_a Q(s,a)$ with probability$(1 - \varepsilon_t)$ .
The exploration rate
A major limitation of tabular Q-learning is that it becomes impractical when the state or action space is large.
Deep Q-Networks (DQN) address this limitation by approximating the Q-function with a policy neural network
Rather than solving Bellman’s optimality equation directly, DQN minimizes the temporal-difference (TD) error between the current Q-value estimate and a target value.
TD error is defined as:
where
The Q-network is trained by minimizing the loss:
on a set of transition
The Huber Loss is used for making the Q-network learning more robust against outliers (source).
The training loop follows these steps:
- the agent observes a state
$s$ and selects an action$a$ using an exploration strategy (e.g.$\varepsilon$ -greedy), - it transitions to state
$s'$ , receives reward$r$ , and stores the transition$(s,a,r,s')$ in the replay buffer, - minibatches of transitions are sampled from the buffer to update
$\theta$ via gradient descent, - the target network weights
$\theta^-$ are periodically updated with$\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-$ .
During DQN training, the Q-network tend overestimate state-action values which can lead to a suboptimal policy and unstable training (source).
As a solution, the double DQN (DDQN) updates the TD error to:
This way, the action in the target value is selected using the policy network but its Q-value is still computed by the target network.
In the case of state-dependent actions, some actions may be illegal in a given state. Since the Q-network outputs a value for every action in
The core idea is to set a very large negative Q-value to illegal actions for preventing the agent from selecting it.
In the case of DQN with an
- the agent observes a state
$s$ , - the Q-network computes the Q-value of every action (both legal and illegal):
- Q-values corresponding to illegal actions are set to an arbitrarily large negative value (
$-10^9$ in this project):
- the agent chooses the action with the highest Q-value:
During exploration steps, random actions are sampled uniformly among legal actions only.
Additionally, in DDQN, masks are also applied in optimization steps: when selecting the next action using the (policy) network, illegal actions are masked before applying the
This project aims to train an agent to play Liar’s Dice against 2 to 4 opponents.
To train the agent and evaluate its performance, the opponents are modeled using fixed policies: these opponents do not learn.
Three baseline player types are implemented:
-
Survivalist policy: At each turn, this agent selects the legal action (calling "liar", "exact", or outbidding) with the highest estimated probability of being true, given its current hand. In case of ties, the most aggressive action (highest quantity and/or value) is chosen.
-
Aggressive policy: At each turn, it selects the action with the lowest probability of being true among those whose probability exceeds a predefined threshold (50% by default). This policy favors high quantity/value bets while maintaining a minimum survival probability.
-
Random policy: At each turn, this agent selects a legal action at random.
For survivalist and aggressive policies, action probabilities are computed using a binomial distribution conditioned on the player’s hand.
To keep training times under an hour, the results presented in this repository were obtained using the following setup:
- the RL agent is trained against two fixed policies (survivalist and aggressive),
- each agent starts with two dice.
With this configuration, the full training process (warm-up + training loop) took approximately 30 minutes on average.
Increasing either the number of opponents or dice significantly increased simulation duration. For example, training against two fixed policies with five dice each required more than 3 hours.
Game variables were handled as follows:
- Bets are encoded as
[q, v], whereqis the minimum number of dice showing valuevacross all dice in play, - A player's hand of n dice is encoded
[d1, d2, ..., dn]wherediis the i-th die value (and is0if corresponding die was lost), - Player order is randomized at the beginning of each game.
The algorithm used for training the agent against two fixed policies is DDQN based on PyTorch's DQN tutorial.
The states the agent will transition between contain in sequence:
- the number of dice in previous player hand (one value),
- the number of dice in next player hand (one value),
- the number of total dice currently in game (one value),
- previous player's bet (two values),
- the agent's hand (values up to the number of dice per player at the beginning of the game).
Both last bet and the agent's hand are encoded as histograms. This way, states only incorporate quantities. On the other hand, encoding last bet and the agent's hand as [q, v] and [d1, d2, ..., dn] in states would mix quantitative and qualitative variables. Switching to histograms contributed to significant improvement in the agent's performance.
Thus, last bet is encoded as a six values tuple (0, ..., 0, q, 0, ..., 0) where the only non-zero entry is at index v-1 and has value q. Similarly, the agent's hand is encoded as (q1, q2, q3, q4, q5, q6) where qi is the quantity of dice showing value i.
For instance, if the player before the agent outbid [2, 3] with 2 dice in hand, the player after the agent has 1 die in hand, and the agent's hand is [4, 1] (2 dice), the corresponding state is (2, 1, 5, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0).
In the case of Liar's dice, episodes were defined as rounds. Thus, terminal states correspond to the end of a round. They are encoded as None and require no action.
This way, a state-action Q-value accounts for every state and move the agent would witness and take until the end of the round.
Transitions stored in the replay buffer are composed of:
- the state witnessed by the agent,
- selected action,
- the state it transitioned into,
- corresponding reward
- a legal action mask.
The corresponding format is: (s, a, s', r, mask)
Legal action masks are arrays of legal action indices in the Q-network output. In the case of terminal states, legal action masks are empty arrays.
Transitions are recorded as follows:
- the agent observes a state
sand selects an actionausing an exploration strategy (e.g.$\varepsilon$ -greedy), - players after the agent play,
- the agent observes a new state
s'with a corresponding rewardrand stores transition(s, a, s', r, mask)in the replay buffer. Then,s'will be used as the first state of next transition.
Terminal states happen in three different scenarios:
- round ended because the agent challenged previous player,
- round ended because the agent outbid and was challenged by the player after,
- round ended because the agent outbid and another player challenged someone else.
The Q-network used in this project is the same as the one in PyTorch's DQN tutorial. It is a simple neural network with 3 fully connected layers and Relu as the activation function.
| Neural network parameters | value |
|---|---|
| Input size (state length) | 15 |
| Hidden layer size | 64 |
| Output size (total number of possible actions) | 38 |
Every output unit corresponds to a certain action. Thus, every possible action is matched with an index in a python dict before simulating games.
Given the simplicity of the setup (three players with two dice each), this architecture was deemed sufficient for starting with; and showed to be sufficient (See Result analysis). Thus, no additional architectural modifications were tested except varying the hidden-layer sizes.
The algorithm used for training the agent is DDQN with masking.
Since the Q-network is trained on transitions sampled from the replay buffer, a warm-up phase is used to fill the buffer before training begins.
During this phase, the agent follows the
Deep learning parameters were selected across experiments. Following values are the ones used for producing results presented in Result analysis.
| Parameter | Value |
|---|---|
| Batch size | |
| Learning rate | |
| Loss function | Huber loss |
| Metric function | Average |
| weigth decay factor ( |
Weight decay was added because it improved the agent’s performance during offline training. However, it did not lead to a noticeable improvement during online training.
Reinforcement learning parameters were selected across experiments. Following values are the ones used for producing results presented in Result analysis.
| Parameter | Value |
|---|---|
| Warm-up duration (in games) | |
| Training duration (in games) | |
| Testing duration (in games) | |
| Discount factor ( |
1 |
| soft update factor ( |
|
| maximum exploration rate ( |
1 |
| minimum exploration rate ( |
0.2 |
| exploration decay rate |
The exploration rate epsilon_rate_scheduler class. It follows an exponential decay defined by:
The choice of the exponential decay was arbitrary and it could be changed to a linear decay.
Since episode (round) duration is finite, the discount factor
The decay rate was chosen so that
Results presented in Result analysis used the following reward system:
| Situation | Reward |
|---|---|
| Agent called liar and was right | 1 |
| Agent called liar or was called liar on and lost a dice | -2 |
| Agent called exact and earned a die back | 2 |
| Agent outbid without being challenged | 0 |
| Agent was challenged and challenger lost a dice | 1 |
| Agent outbid, next player called exact and earned a dice back | -0.5 |
| Game ended without the agent challenging or being challenged | 0 |
Rewards were chosen for encouraging the agent to make other players lose a dice (
The following figure shows the ranking distribution (first/second/third) of the three players - survivalist (player 1), aggressive (player 2) and the RL agent (player 3) - across 5,000 evaluation games played after training.
The trained agent achieved first place in 79.4% of games with parameters presented in Section 3.
The ranking distribution shows that the agent is substantially more likely to finish third than second. While this may appear counterintuitive at first, it can be explained by analyzing its performance in a 1-vs-1 setup.
Below figure 2 shows the agent’s ranking distribution against one policy at a time (chosen randomly) for 5000 games. In these charts, "agent max proba" played with the survivalist policy while "agent min proba" used the aggressive policy.
In these 1-vs-1 matches the agent won 97.8 % of games. This suggests that when the agent survives to a head-to-head situation, it is much more likely to finish first.
Thus, the agent's performance in a three-player setup can be interpreted as: the agent either gets eliminated first (20.1 % of games) or survives to reach a 1-vs-1 situation and win almost every time.
This interpretation is supported by the below pie charts showing the ranking distribution in a five-player setup (5000 games). In this configuration, player 1 and 2 used the survalist policy, player 3 and 4 used the aggressive policy and player 5 was trained with DDQN. As in a three-player setup, the agent's second place ratio is substantially lower than every other place ratio.
The fact that the agent strugles more when multiple players are involved may be explained by the fact that the game is more chaotic with multiple players: since players can only outbid, bets depend on previous players bets and their policies.
This repository provides a framework for simulating Liar’s Dice with fixed opponents and for training agents using DDQN with action masking. With the configuration presented in this report (three players, two dice each), the trained agent reached first place in 79.4% of games.
The framework can scale to more players and more dice, but at the cost of significatively increased training times.
Multiple improvement directions were identified while working on this project:
- training loop computation time could be improved by speeding up fixed policy–related functions in action_management.py and deterministic_agents.py. These functions were written to highlight their logic but were not optimized for computational performance,
- episode definition could be extended from rounds to full games. Taking the end of a game as the only terminal state would encourage the agent to account for action consequences across multiple rounds,
- episode history could be taken into account by adding past bets made by all players during the current round (or game) to the state representation,
- policy identity could be incorporated by including player policies in the state and adding an embedding layer to the Q-network,
- self play could be used for training the agent.
Some of these improvements may be added in a future update.
Date: 22/02/2026
Adding lru_cache decorator to fixed policy's functions (agent_max_probability and agent_min_probability in deterministic_agents.py) significantly improved training duration: it went down from 30 minutes on average to 5 minutes on average.
- PyTorch's reinforcement learning tutorial : https://docs.pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
- Methods for improving DQN: https://towardsdatascience.com/techniques-to-improve-the-performance-of-a-dqn-agent-29da8a7a0a7e/
- DDQN algorithm explanation : https://apxml.com/courses/intermediate-reinforcement-learning/chapter-3-dqn-improvements-variants/double-dqn-ddqn
Author: Alexandre Forestier Foray


