Skip to content

Releases: clp-research/clemcore

3.3.5

16 Dec 16:23

Choose a tag to compare

What's Changed

  • Added PettingZoo/Gymnasium Support: Games become Environments for RL by @phisad in #246
  • recorder: centralize logging in GM; log players in InteractionsFileSaver
  • transcribe: re-add speaker info above chat bubbles
  • CLI: return error code 1 when exceptions happen during run
  • fix deepcopy of GameInstanceIterator
  • backends: improve huggingface local
  • backends: More graceful response handling for cohere by @mohiuddinshahrukh in #244

New Contributors

Full Changelog: 3.3.4...3.3.5

3.3.4

24 Nov 09:38

Choose a tag to compare

What's Changed

  • Removes ImageSaver, streamlines image state representations in transcripts, restructures envs, improves env docstrings by @atompaule in #220
  • Fix resource loading for transcripts

Full Changelog: 3.3.3...3.3.4

3.3.3

14 Nov 13:59
a061e00

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 3.3.2...3.3.3

3.3.2

18 Aug 14:52

Choose a tag to compare

Minor release to update dependencies.

Main Dependencies Updated

  • aleph-alpha-client: 7.0.1 → 11.2.0
  • openai: 1.75.0 → 1.99.9
  • anthropic: 0.47.1 → 0.64.0
  • cohere: 4.48 → 5.17
  • google-generativeai: 0.8.4 → 0.8.5
  • mistralai: 1.8.0 → 1.9.6

Optional Dependencies Updated

vllm

  • transformers: 4.51.1 → 4.55.2 (to support gpt-oss models)

huggingface

  • bitsandbytes: 0.45.3 → 0.45.5
  • peft: 0.15.2 → 0.17.0
  • transformers: 4.51.1 → 4.55.2 (to support gpt-oss models)
  • timm: updated with upper bound >=1.0.15, <=1.0.19
  • unpinned protobuf
  • unpinned einops
  • unpinned sentencepiece

What's Changed

Full Changelog: 3.3.1...3.3.2

3.3.1

15 Aug 12:38

Choose a tag to compare

This patch heavily improves the framework in terms of usability and functionality.

CLI:

  • The sequential runner is now the default again (clem run behaves by default as before)
  • The control over the batch size is back to the user: The batchwise runner is now called when -b or --batch_size argument is passed (and all models support batching). The automatic batch estimation has been removed, because it is too brittle and compute intense. The user should know which batch size is sufficient for their use case (care must be taken with two different models).

Core:

  • The game instance iterator has been fully decoupled from the benchmark class making it more versatile to use
  • The runners now call a on_game_step allowing to easily invoke additional logic after a game step
  • The latest changes of the contributed GameEnvironment are merged (thanks to @paulutsch !) and are located in a new envs package (Note, though: The grid_environment will be relocated later)
  • The files callbacks are given the ResultsFolder on init allowing for more versatile use
  • Add RunFileSaver callback to store run related values such as run durations

Backend:

  • The huggingface backend now supports most of the existing Chain-Of-Thought models
  • A lot of new models have been added to the model registry (thanks to @Gnurro !)
  • The slurk backend now uses the first 8 chars of the login tokens as model names (to distinguish users in the results folder)

Chore:

  • Improved Model.__repr__ for debugging
  • Turn load_from_spec into static method of GameBenchmark
  • Fix initial prompt handling for GameMaster

What's Changed

  • The large PR to merge the most recent EnvGameMaster and GameEnvironment changes into upstream by @paulutsch in #206
  • Model additions July/August 2025 by @Gnurro in #207
  • HF Backend special token preservation by @Gnurro in #208
  • Feat/batch size arg by @phisad in #209
  • HF backend CoT handling by @Gnurro in #210
  • Decouple GameInstanceIterator from GameBenchmark by @phisad in #213
  • Introduce envs module with GameEnvironment and EnvGameMaster by @phisad in #214
  • Introduce on_game_step() for callbacks by @phisad in #215
  • Fix for missing CoT start tags by @Gnurro in #212
  • Add openrouter entries and update recent model registry entries with missing values by @Gnurro in #211

Full Changelog: 3.3.0...3.3.1

3.3.0

25 Jul 11:28

Choose a tag to compare

This release version targeted mainly batch inference (at least with HF models), together with several architectural and structural cleanup.

⚠️ This patch might initially break compatibility with playpen (has not been tested yet).

Batching

We added BatchGenerativeModel implementing a generate_batch_response method.
When all models loaded during a run implement this class, then the run will be batchwise.
Of course, this also works with models in self-play mode.

For this we introduced the new architectural concept of runners, specifically, a sequential and a batchwise runner.
This means that a GameBenchmark no longer has a run method, but the cli calls a dispatch runner to decide the mode of processing.

When selected, the sequential runner will work as before, that is, the models play each game instance one after the other.
You can force a sequential run with --sequential option in the clem run command.
All runners now use the GameInstanceIterator from the instances module.

How does the batchwise runner work?

In contrast the batchwise runner will first look if a batch_size is set in the model spec.
You can do so via unification, e.g., -m {"model_name": "Meta-Llama-3.1-8B-Instruct", "batch_size": 16} or by placing a model_registry.json in your working directory.
If no batch_size can be found, then the runner tries to estimate the batch size by prompting the model with the initial contexts of all available game instances.
In case multiple models specify different batch sizes, then the smallest one will be used to avoid OOM.
In any case, the batch size will be reduced if the number of available game session is smaller than the batch size.

This means, before running the game benchmark all game masters are setup at once (instead of sequentially one after the other) and wrapped into game sessions.
From this pool of available game sessions, the runner will sample up to batch_size observations and advance the game master's state.
Importantly, the batchwise runner calls each game master once before iterating to the next player's turn. This might lead smaller batch sizes at the end of polling round, when only a few gamer masters have not been called yet.

How does this work with Player?

The __call__ method of Player has been split into two parts perceive_context and perceive_response which advance the player's state once before and once after the batch processing.
This means that the batchwise runner avoids to call the player's directly, but makes use of a newly introduced Player.batch_response method.
This method groups the Player's after their backend models and applies the contexts accordingly.

⚠️ Note:

  • For now only CustomResponseModel and HuggingfaceLocalModel implement BatchGenerativeModel
  • When batching with decoder-only models, then make sure that padding_side=left is set in the model spec's model_config. The backend will try to derive this values but likely fails, because necessary parameters are often not set in the huggingface configs. There the default is often padding side "right".

⚠️ Only Experimental Auto-Estimator -- try to set the batch_size directly:

  • The automatic batch size finder likely overestimates the batch size, because it only uses the initial context, but the conversation might grow largely during a game session :exclamation_mark: The estimated batch_size will be set into the model_spec of the Model so that it does not need to be re-estimated for other game benchmarks.

GameMaster

  • introduce EnvLike with observe() and step() methods
  • add is_done() method
  • move initial_prompt handling from Player to DialogueGameMaster
  • remove store_records from GameMaster and GameRecorder

Callbacks

There is now a new callback mechanism that replaces the old GameRecorder.
This basically decoupled the GameMaster from all file handling or storing related activities and allows for easy addition of new recording behaviors.

  • introduce events module
  • add GameEventSource inherited by Player and GameMaster to emit events
  • store records on_game_end in InteractionsFileSaver
  • rename DefaultGameRecorder to GameInteractionsRecorder
  • add GameEventLogger inherited by GameInteractionsRecorder

Results

  • there is now a ResultsFolder class that mimics the results directory structure
  • move to_model_results_folder to files module
  • renamed episode_X dir to instance_XXXXX dir
  • removed the leading N_ prefix from experiment dir
  • add zero padding to instance folder name with (5 digits)

⚠️ Note: The results for a particular instance are now always stored in the respective results folder numbered with the game/instance id and not in an episode folder with an arbitrary number anymore!

Backend

  • add augment_response_object wrapper for Model.generate_response methods (also works with batches)
  • ensure_messages_format now checks for list of lists (to be compatible with batches)

General / Miscellaneous

  • replace to_player_model_infos() with Model.to_infos()
  • add sub-selector to load_from_spec
  • remove create_game_instance_iterator

What's Changed

  • Architecture Patch: Batch Inference, Callbacks, Runners and Results Folder by @phisad in #205

Full Changelog: 3.2.1...3.3.0

3.2.1

18 Jul 09:09

Choose a tag to compare

Benchmark

  • not storing player_models.json anymore, but this information is in experiment.json
  • add runtime logging for (all) benchmark/model loading and runs
  • add game lookup via CLEMBENCH_HOME when game_registry.json is missing and no games are found from cwd
  • set log level to debug for spec loading details

Master

  • add reset() method to Player and Model
  • DGM calls player.reset() at episode end

Backend

  • allow ensure_messages_format() to wrap methods with additional args
  • hf: avoid attention mask warning
  • hf: avoid temperature warnings
  • slurk: fix link to task_room_layout.js
  • slurk: by default, the history is cleared and toast shown to player on episode ends
  • openai: ignore max_tokens for reasoning models b.c. not supported
  • openai: raise error when temperature is not greater than zero for reasoning models
  • openai_compatible: name with lookup in key file

Full Changelog: 3.2.0...3.2.1

3.2.0

11 Jul 13:10

Choose a tag to compare

What's Changed

Playpen

  • implement deepcopy for GameSpec to allow playpen branching env
  • rename process_turn() back to step() in favour of playpen compatibility and RL-focus
  • remove get_current_player() in favor of current_player property in legacy DGM

Legacy Clembench Games

  • introduce legacy module with 2.x-style DialogueGameMaster and Scorer
  • add toggle to disable count logging in recorder (could overwrite game logs)
  • introduce errors module and expose all error types
  • add key arg to ResponseError; improve docs for easier compare with legacy

NOTE: This release marks the end of life of the maintenance/2.x branch!

Full Changelog: 3.1.2...3.2.0

3.1.2

09 Jul 15:02

Choose a tag to compare

  • fix: transcribe looking for dialogue_pair but is results_folder now
  • support single entry model_registry.json files
  • improve slurk backend: provide link in console, by default no display area, by default assistant-like chat
  • add script to establish default room layout
  • change temperature and max_tokens to a property of Model (similar to name)
  • fix: introduce name property to Model to make calls like m.name possible
  • keep track of player_models in scores.json
  • keep track of player_models in interactions.json
  • add support for games with more than 2 players:
  • shift model expansion to game master, game_spec now passed to game masters, results folder naming is now based on -m option arguments (1 model: name-t; 2 models: join with --; >3 models: 'group-Np-hash') - store player_models.json sidecar
  • allow to pass task_selector to run only subsets of game instances
  • rename instances_name to instances_filename
  • add option to return dataframe after clemeval
  • set default: pretty json on store_file
  • instance generator now handles seed; marked generate as final

Full Changelog: 3.1.1...3.1.2

2.5.2

04 Jul 09:21

Choose a tag to compare

  • add standard error classes to master.py
  • fix: transcribe looking for dialogue_pair but is results_folder now
  • support single entry model_registry.json files
  • improve slurk backend: provide link in console, by default no display area but assistant-like chat

Full Changelog: 2.5.1...2.5.2