Releases: clp-research/clemcore
3.3.5
What's Changed
- Added PettingZoo/Gymnasium Support: Games become Environments for RL by @phisad in #246
- recorder: centralize logging in GM; log players in InteractionsFileSaver
- transcribe: re-add speaker info above chat bubbles
- CLI: return error code 1 when exceptions happen during run
- fix deepcopy of GameInstanceIterator
- backends: improve huggingface local
- backends: More graceful response handling for cohere by @mohiuddinshahrukh in #244
New Contributors
- @mohiuddinshahrukh made their first contribution in #244
Full Changelog: 3.3.4...3.3.5
3.3.4
What's Changed
- Removes ImageSaver, streamlines image state representations in transcripts, restructures envs, improves env docstrings by @atompaule in #220
- Fix resource loading for transcripts
Full Changelog: 3.3.3...3.3.4
3.3.3
What's Changed
- Slurk: GPT-style chatarea by @zarathuuustra in #218
- Slurk: Disable the textarea after sending out the message by @zarathuuustra in #221
- Backend: Openrouter backend by @Gnurro in #219
- Backend: HF-local backend CoT bypass by @Gnurro in #228
- Transcripts: Create side-by-side flow only for 2-player games by @ansovald in #222
- Chore: fixed compilation error in the dependencies by @kranti-up in #240
- Chore: Documentation updates october 2025 by @Gnurro in #241
New Contributors
- @zarathuuustra made their first contribution in #218
Full Changelog: 3.3.2...3.3.3
3.3.2
Minor release to update dependencies.
Main Dependencies Updated
aleph-alpha-client: 7.0.1 → 11.2.0openai: 1.75.0 → 1.99.9anthropic: 0.47.1 → 0.64.0cohere: 4.48 → 5.17google-generativeai: 0.8.4 → 0.8.5mistralai: 1.8.0 → 1.9.6
Optional Dependencies Updated
vllm
transformers: 4.51.1 → 4.55.2 (to support gpt-oss models)
huggingface
bitsandbytes: 0.45.3 → 0.45.5peft: 0.15.2 → 0.17.0transformers: 4.51.1 → 4.55.2 (to support gpt-oss models)timm: updated with upper bound>=1.0.15, <=1.0.19- unpinned
protobuf - unpinned
einops - unpinned
sentencepiece
What's Changed
Full Changelog: 3.3.1...3.3.2
3.3.1
This patch heavily improves the framework in terms of usability and functionality.
CLI:
- The
sequentialrunner is now the default again (clem runbehaves by default as before) - The control over the batch size is back to the user: The
batchwiserunner is now called when-bor--batch_sizeargument is passed (and all models support batching). The automatic batch estimation has been removed, because it is too brittle and compute intense. The user should know which batch size is sufficient for their use case (care must be taken with two different models).
Core:
- The game instance iterator has been fully decoupled from the benchmark class making it more versatile to use
- The runners now call a
on_game_stepallowing to easily invoke additional logic after a game step - The latest changes of the contributed
GameEnvironmentare merged (thanks to @paulutsch !) and are located in a newenvspackage (Note, though: Thegrid_environmentwill be relocated later) - The files callbacks are given the
ResultsFolderon init allowing for more versatile use - Add
RunFileSavercallback to store run related values such as run durations
Backend:
- The huggingface backend now supports most of the existing Chain-Of-Thought models
- A lot of new models have been added to the model registry (thanks to @Gnurro !)
- The slurk backend now uses the first 8 chars of the login tokens as model names (to distinguish users in the results folder)
Chore:
- Improved
Model.__repr__for debugging - Turn
load_from_specinto static method ofGameBenchmark - Fix initial prompt handling for
GameMaster
What's Changed
- The large PR to merge the most recent EnvGameMaster and GameEnvironment changes into upstream by @paulutsch in #206
- Model additions July/August 2025 by @Gnurro in #207
- HF Backend special token preservation by @Gnurro in #208
- Feat/batch size arg by @phisad in #209
- HF backend CoT handling by @Gnurro in #210
- Decouple GameInstanceIterator from GameBenchmark by @phisad in #213
- Introduce envs module with GameEnvironment and EnvGameMaster by @phisad in #214
- Introduce on_game_step() for callbacks by @phisad in #215
- Fix for missing CoT start tags by @Gnurro in #212
- Add openrouter entries and update recent model registry entries with missing values by @Gnurro in #211
Full Changelog: 3.3.0...3.3.1
3.3.0
This release version targeted mainly batch inference (at least with HF models), together with several architectural and structural cleanup.
playpen (has not been tested yet).
Batching
We added BatchGenerativeModel implementing a generate_batch_response method.
When all models loaded during a run implement this class, then the run will be batchwise.
Of course, this also works with models in self-play mode.
For this we introduced the new architectural concept of runners, specifically, a sequential and a batchwise runner.
This means that a GameBenchmark no longer has a run method, but the cli calls a dispatch runner to decide the mode of processing.
When selected, the sequential runner will work as before, that is, the models play each game instance one after the other.
You can force a sequential run with --sequential option in the clem run command.
All runners now use the GameInstanceIterator from the instances module.
How does the batchwise runner work?
In contrast the batchwise runner will first look if a batch_size is set in the model spec.
You can do so via unification, e.g., -m {"model_name": "Meta-Llama-3.1-8B-Instruct", "batch_size": 16} or by placing a model_registry.json in your working directory.
If no batch_size can be found, then the runner tries to estimate the batch size by prompting the model with the initial contexts of all available game instances.
In case multiple models specify different batch sizes, then the smallest one will be used to avoid OOM.
In any case, the batch size will be reduced if the number of available game session is smaller than the batch size.
This means, before running the game benchmark all game masters are setup at once (instead of sequentially one after the other) and wrapped into game sessions.
From this pool of available game sessions, the runner will sample up to batch_size observations and advance the game master's state.
Importantly, the batchwise runner calls each game master once before iterating to the next player's turn. This might lead smaller batch sizes at the end of polling round, when only a few gamer masters have not been called yet.
How does this work with Player?
The __call__ method of Player has been split into two parts perceive_context and perceive_response which advance the player's state once before and once after the batch processing.
This means that the batchwise runner avoids to call the player's directly, but makes use of a newly introduced Player.batch_response method.
This method groups the Player's after their backend models and applies the contexts accordingly.
- For now only
CustomResponseModelandHuggingfaceLocalModelimplementBatchGenerativeModel - When batching with decoder-only models, then make sure that
padding_side=leftis set in the model spec'smodel_config. The backend will try to derive this values but likely fails, because necessary parameters are often not set in the huggingface configs. There the default is often padding side "right".
- The automatic batch size finder likely overestimates the batch size, because it only uses the initial context, but the conversation might grow largely during a game session :exclamation_mark: The estimated
batch_sizewill be set into themodel_specof theModelso that it does not need to be re-estimated for other game benchmarks.
GameMaster
- introduce
EnvLikewith observe() and step() methods - add
is_done()method - move
initial_prompthandling fromPlayertoDialogueGameMaster - remove
store_recordsfromGameMasterandGameRecorder
Callbacks
There is now a new callback mechanism that replaces the old GameRecorder.
This basically decoupled the GameMaster from all file handling or storing related activities and allows for easy addition of new recording behaviors.
- introduce
eventsmodule - add
GameEventSourceinherited byPlayerandGameMasterto emit events - store records
on_game_endinInteractionsFileSaver - rename
DefaultGameRecordertoGameInteractionsRecorder - add
GameEventLoggerinherited byGameInteractionsRecorder
Results
- there is now a
ResultsFolderclass that mimics the results directory structure - move
to_model_results_foldertofilesmodule - renamed
episode_Xdir toinstance_XXXXX dir - removed the leading
N_prefix from experiment dir - add zero padding to instance folder name with (5 digits)
Backend
- add
augment_response_objectwrapper forModel.generate_responsemethods (also works with batches) ensure_messages_formatnow checks for list of lists (to be compatible with batches)
General / Miscellaneous
- replace to_player_model_infos() with Model.to_infos()
- add sub-selector to load_from_spec
- remove
create_game_instance_iterator
What's Changed
Full Changelog: 3.2.1...3.3.0
3.2.1
Benchmark
- not storing
player_models.jsonanymore, but this information is in experiment.json - add runtime logging for (all) benchmark/model loading and runs
- add game lookup via
CLEMBENCH_HOMEwhen game_registry.json is missing and no games are found from cwd - set log level to debug for spec loading details
Master
- add
reset()method to Player and Model - DGM calls
player.reset()at episode end
Backend
- allow
ensure_messages_format()to wrap methods with additional args - hf: avoid attention mask warning
- hf: avoid temperature warnings
- slurk: fix link to
task_room_layout.js - slurk: by default, the history is cleared and toast shown to player on episode ends
- openai: ignore
max_tokensfor reasoning models b.c. not supported - openai: raise error when temperature is not greater than zero for reasoning models
- openai_compatible: name with lookup in key file
Full Changelog: 3.2.0...3.2.1
3.2.0
What's Changed
Playpen
- implement deepcopy for
GameSpecto allow playpen branching env - rename
process_turn()back tostep()in favour of playpen compatibility and RL-focus - remove
get_current_player()in favor ofcurrent_playerproperty in legacy DGM
Legacy Clembench Games
- introduce
legacymodule with 2.x-styleDialogueGameMasterandScorer - add toggle to disable count logging in recorder (could overwrite game logs)
- introduce errors module and expose all error types
- add
keyarg toResponseError; improve docs for easier compare with legacy
NOTE: This release marks the end of life of the maintenance/2.x branch!
Full Changelog: 3.1.2...3.2.0
3.1.2
- fix: transcribe looking for
dialogue_pairbut isresults_foldernow - support single entry
model_registry.jsonfiles - improve slurk backend: provide link in console, by default no display area, by default assistant-like chat
- add script to establish default room layout
- change
temperatureandmax_tokensto a property of Model (similar to name) - fix: introduce
nameproperty to Model to make calls likem.namepossible - keep track of
player_modelsin scores.json - keep track of
player_modelsin interactions.json - add support for games with more than 2 players:
- shift model expansion to game master,
game_specnow passed to game masters, results folder naming is now based on-moption arguments (1 model: name-t; 2 models: join with --; >3 models: 'group-Np-hash') - store player_models.json sidecar - allow to pass
task_selectorto run only subsets of game instances - rename
instances_nametoinstances_filename - add option to return dataframe after clemeval
- set default: pretty json on
store_file - instance generator now handles
seed; marked generate as final
Full Changelog: 3.1.1...3.1.2
2.5.2
- add standard error classes to master.py
- fix: transcribe looking for dialogue_pair but is results_folder now
- support single entry model_registry.json files
- improve slurk backend: provide link in console, by default no display area but assistant-like chat
Full Changelog: 2.5.1...2.5.2