Releases · clp-research/clemcore

16 Dec 16:23

phisad

3.3.5

a8ad296

3.3.5 Latest

Latest

What's Changed

Added PettingZoo/Gymnasium Support: Games become Environments for RL by @phisad in #246
recorder: centralize logging in GM; log players in InteractionsFileSaver
transcribe: re-add speaker info above chat bubbles
CLI: return error code 1 when exceptions happen during run
fix deepcopy of GameInstanceIterator
backends: improve huggingface local
backends: More graceful response handling for cohere by @mohiuddinshahrukh in #244

New Contributors

@mohiuddinshahrukh made their first contribution in #244

Full Changelog: 3.3.4...3.3.5

Contributors

phisad and mohiuddinshahrukh

Assets 2

24 Nov 09:38

phisad

3.3.4

32c9a08

3.3.4

What's Changed

Removes ImageSaver, streamlines image state representations in transcripts, restructures envs, improves env docstrings by @atompaule in #220
Fix resource loading for transcripts

Full Changelog: 3.3.3...3.3.4

Contributors

atompaule

Assets 2

14 Nov 13:59

phisad

3.3.3

a061e00

3.3.3

What's Changed

Slurk: GPT-style chatarea by @zarathuuustra in #218
Slurk: Disable the textarea after sending out the message by @zarathuuustra in #221
Backend: Openrouter backend by @Gnurro in #219
Backend: HF-local backend CoT bypass by @Gnurro in #228
Transcripts: Create side-by-side flow only for 2-player games by @ansovald in #222
Chore: fixed compilation error in the dependencies by @kranti-up in #240
Chore: Documentation updates october 2025 by @Gnurro in #241

New Contributors

@zarathuuustra made their first contribution in #218

Full Changelog: 3.3.2...3.3.3

Contributors

Gnurro, ansovald, and 2 other contributors

Assets 2

18 Aug 14:52

phisad

3.3.2

130d8d7

3.3.2

Minor release to update dependencies.

Main Dependencies Updated

aleph-alpha-client: 7.0.1 → 11.2.0
openai: 1.75.0 → 1.99.9
anthropic: 0.47.1 → 0.64.0
cohere: 4.48 → 5.17
google-generativeai: 0.8.4 → 0.8.5
mistralai: 1.8.0 → 1.9.6

Optional Dependencies Updated

vllm

transformers: 4.51.1 → 4.55.2 (to support gpt-oss models)

huggingface

bitsandbytes: 0.45.3 → 0.45.5
peft: 0.15.2 → 0.17.0
transformers: 4.51.1 → 4.55.2 (to support gpt-oss models)
timm: updated with upper bound >=1.0.15, <=1.0.19
unpinned protobuf
unpinned einops
unpinned sentencepiece

What's Changed

Update Dependency Versions by @phisad in #216

Full Changelog: 3.3.1...3.3.2

Contributors

phisad

Assets 2

15 Aug 12:38

phisad

3.3.1

78b3e6c

3.3.1

This patch heavily improves the framework in terms of usability and functionality.

CLI:

The sequential runner is now the default again (clem run behaves by default as before)
The control over the batch size is back to the user: The batchwise runner is now called when -b or --batch_size argument is passed (and all models support batching). The automatic batch estimation has been removed, because it is too brittle and compute intense. The user should know which batch size is sufficient for their use case (care must be taken with two different models).

Core:

The game instance iterator has been fully decoupled from the benchmark class making it more versatile to use
The runners now call a on_game_step allowing to easily invoke additional logic after a game step
The latest changes of the contributed GameEnvironment are merged (thanks to @paulutsch !) and are located in a new envs package (Note, though: The grid_environment will be relocated later)
The files callbacks are given the ResultsFolder on init allowing for more versatile use
Add RunFileSaver callback to store run related values such as run durations

Backend:

The huggingface backend now supports most of the existing Chain-Of-Thought models
A lot of new models have been added to the model registry (thanks to @Gnurro !)
The slurk backend now uses the first 8 chars of the login tokens as model names (to distinguish users in the results folder)

Chore:

Improved Model.__repr__ for debugging
Turn load_from_spec into static method of GameBenchmark
Fix initial prompt handling for GameMaster

What's Changed

The large PR to merge the most recent EnvGameMaster and GameEnvironment changes into upstream by @paulutsch in #206
Model additions July/August 2025 by @Gnurro in #207
HF Backend special token preservation by @Gnurro in #208
Feat/batch size arg by @phisad in #209
HF backend CoT handling by @Gnurro in #210
Decouple GameInstanceIterator from GameBenchmark by @phisad in #213
Introduce envs module with GameEnvironment and EnvGameMaster by @phisad in #214
Introduce on_game_step() for callbacks by @phisad in #215
Fix for missing CoT start tags by @Gnurro in #212
Add openrouter entries and update recent model registry entries with missing values by @Gnurro in #211

Full Changelog: 3.3.0...3.3.1

Contributors

phisad, Gnurro, and atompaule

Assets 2

25 Jul 11:28

phisad

3.3.0

0bc72c2

3.3.0

This release version targeted mainly batch inference (at least with HF models), together with several architectural and structural cleanup.

⚠️ This patch might initially break compatibility with playpen (has not been tested yet).

Batching

We added BatchGenerativeModel implementing a generate_batch_response method.
When all models loaded during a run implement this class, then the run will be batchwise.
Of course, this also works with models in self-play mode.

For this we introduced the new architectural concept of runners, specifically, a sequential and a batchwise runner.
This means that a GameBenchmark no longer has a run method, but the cli calls a dispatch runner to decide the mode of processing.

When selected, the sequential runner will work as before, that is, the models play each game instance one after the other.
You can force a sequential run with --sequential option in the clem run command.
All runners now use the GameInstanceIterator from the instances module.

How does the batchwise runner work?

In contrast the batchwise runner will first look if a batch_size is set in the model spec.
You can do so via unification, e.g., -m {"model_name": "Meta-Llama-3.1-8B-Instruct", "batch_size": 16} or by placing a model_registry.json in your working directory.
If no batch_size can be found, then the runner tries to estimate the batch size by prompting the model with the initial contexts of all available game instances.
In case multiple models specify different batch sizes, then the smallest one will be used to avoid OOM.
In any case, the batch size will be reduced if the number of available game session is smaller than the batch size.

This means, before running the game benchmark all game masters are setup at once (instead of sequentially one after the other) and wrapped into game sessions.
From this pool of available game sessions, the runner will sample up to batch_size observations and advance the game master's state.
Importantly, the batchwise runner calls each game master once before iterating to the next player's turn. This might lead smaller batch sizes at the end of polling round, when only a few gamer masters have not been called yet.

How does this work with Player?

The __call__ method of Player has been split into two parts perceive_context and perceive_response which advance the player's state once before and once after the batch processing.
This means that the batchwise runner avoids to call the player's directly, but makes use of a newly introduced Player.batch_response method.
This method groups the Player's after their backend models and applies the contexts accordingly.

⚠️ Note:

For now only CustomResponseModel and HuggingfaceLocalModel implement BatchGenerativeModel
When batching with decoder-only models, then make sure that padding_side=left is set in the model spec's model_config. The backend will try to derive this values but likely fails, because necessary parameters are often not set in the huggingface configs. There the default is often padding side "right".

⚠️ Only Experimental Auto-Estimator -- try to set the batch_size directly:

The automatic batch size finder likely overestimates the batch size, because it only uses the initial context, but the conversation might grow largely during a game session :exclamation_mark: The estimated batch_size will be set into the model_spec of the Model so that it does not need to be re-estimated for other game benchmarks.

GameMaster

introduce EnvLike with observe() and step() methods
add is_done() method
move initial_prompt handling from Player to DialogueGameMaster
remove store_records from GameMaster and GameRecorder

Callbacks

There is now a new callback mechanism that replaces the old GameRecorder.
This basically decoupled the GameMaster from all file handling or storing related activities and allows for easy addition of new recording behaviors.

introduce events module
add GameEventSource inherited by Player and GameMaster to emit events
store records on_game_end in InteractionsFileSaver
rename DefaultGameRecorder to GameInteractionsRecorder
add GameEventLogger inherited by GameInteractionsRecorder

Results

there is now a ResultsFolder class that mimics the results directory structure
move to_model_results_folder to files module
renamed episode_X dir to instance_XXXXX dir
removed the leading N_ prefix from experiment dir
add zero padding to instance folder name with (5 digits)

⚠️ Note: The results for a particular instance are now always stored in the respective results folder numbered with the game/instance id and not in an episode folder with an arbitrary number anymore!

Backend

add augment_response_object wrapper for Model.generate_response methods (also works with batches)
ensure_messages_format now checks for list of lists (to be compatible with batches)

General / Miscellaneous

replace to_player_model_infos() with Model.to_infos()
add sub-selector to load_from_spec
remove create_game_instance_iterator

What's Changed

Architecture Patch: Batch Inference, Callbacks, Runners and Results Folder by @phisad in #205

Full Changelog: 3.2.1...3.3.0

Contributors

phisad

Assets 2

18 Jul 09:09

phisad

3.2.1

24ec805

3.2.1

Benchmark

not storing player_models.json anymore, but this information is in experiment.json
add runtime logging for (all) benchmark/model loading and runs
add game lookup via CLEMBENCH_HOME when game_registry.json is missing and no games are found from cwd
set log level to debug for spec loading details

Master

add reset() method to Player and Model
DGM calls player.reset() at episode end

Backend

allow ensure_messages_format() to wrap methods with additional args
hf: avoid attention mask warning
hf: avoid temperature warnings
slurk: fix link to task_room_layout.js
slurk: by default, the history is cleared and toast shown to player on episode ends
openai: ignore max_tokens for reasoning models b.c. not supported
openai: raise error when temperature is not greater than zero for reasoning models
openai_compatible: name with lookup in key file

Full Changelog: 3.2.0...3.2.1

Assets 2

11 Jul 13:10

phisad

3.2.0

f13f494

3.2.0

What's Changed

Playpen

implement deepcopy for GameSpec to allow playpen branching env
rename process_turn() back to step() in favour of playpen compatibility and RL-focus
remove get_current_player() in favor of current_player property in legacy DGM

Legacy Clembench Games

introduce legacy module with 2.x-style DialogueGameMaster and Scorer
add toggle to disable count logging in recorder (could overwrite game logs)
introduce errors module and expose all error types
add key arg to ResponseError; improve docs for easier compare with legacy

NOTE: This release marks the end of life of the maintenance/2.x branch!

Full Changelog: 3.1.2...3.2.0

Assets 2

09 Jul 15:02

phisad

3.1.2

5f42ad9

3.1.2

fix: transcribe looking for dialogue_pair but is results_folder now
support single entry model_registry.json files
improve slurk backend: provide link in console, by default no display area, by default assistant-like chat
add script to establish default room layout
change temperature and max_tokens to a property of Model (similar to name)
fix: introduce name property to Model to make calls like m.name possible
keep track of player_models in scores.json
keep track of player_models in interactions.json
add support for games with more than 2 players:
shift model expansion to game master, game_spec now passed to game masters, results folder naming is now based on -m option arguments (1 model: name-t; 2 models: join with --; >3 models: 'group-Np-hash') - store player_models.json sidecar
allow to pass task_selector to run only subsets of game instances
rename instances_name to instances_filename
add option to return dataframe after clemeval
set default: pretty json on store_file
instance generator now handles seed; marked generate as final

Full Changelog: 3.1.1...3.1.2

Assets 2

04 Jul 09:21

phisad

2.5.2

74b3851

2.5.2

add standard error classes to master.py
fix: transcribe looking for dialogue_pair but is results_folder now
support single entry model_registry.json files
improve slurk backend: provide link in console, by default no display area but assistant-like chat

Full Changelog: 2.5.1...2.5.2

Assets 2

Releases: clp-research/clemcore

3.3.5

What's Changed

New Contributors

Contributors

Uh oh!

3.3.4

What's Changed

Contributors

Uh oh!

3.3.3

What's Changed

New Contributors

Contributors

Uh oh!

3.3.2

Main Dependencies Updated

Optional Dependencies Updated

vllm

huggingface

What's Changed

Contributors

Uh oh!

3.3.1

What's Changed

Contributors

Uh oh!

3.3.0

What's Changed

Contributors

Uh oh!

3.2.1

Uh oh!

3.2.0

What's Changed

Uh oh!

3.1.2

Uh oh!

2.5.2

Uh oh!