Ondrej Bohdal1, Mete Ozay1, Jijoong Moon2, Kyeng-Hun Lee2, Hyeonmok Ko2, Umberto Michieli1
1 Samsung R&D Institute UK, United Kingdom 2 Samsung Research, South Korea
EMNLP 2025
Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
We use the following libraries: torch, transformers, datasets, evaluate, accelerate, peft, rouge-score, vllm:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers
pip install datasets
pip install evaluate
pip install accelerate
pip install peft
pip install rouge-score
pip install vllm
We have tested running the code with the versions of the libraries specified in the requirements.txt file.
We also need to create directories data, output, predictions, stored_models, logs in the repo directory:
mkdir data output predictions stored_models logs
We can recreate the data for our compositional multi-tasking benchmark by running the prepare_compositional_task_data.sh script.
The first step is to train LoRAs for each individual task. Example:
CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name dialogsum_lora_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --do_train --do_valid --do_predict
This needs to be repeated for each dataset and model. Note that tone adjustment is a smaller dataset so --gradient_accumulation_steps 4 is used instead of the default value.
Evaluation of various methods on the compositional multi-tasking benchmark can be done for example as follows:
Zero-shot:
CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_zero_shot_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --zero_shot --secondary_task translation --do_valid --do_predict
Main-task LoRA:
CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_sum_lora_only_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --secondary_task translation --do_valid --do_predict
Linear merge:
CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_linear_merge_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --lora_merge_strategy peft_linear --lora_merge_modules dialogsum_lora_slm_v1 tedtalks_es_lora_slm_v1 --stored_model_name dialogsum_lora_slm_v1 --secondary_task translation --do_valid --do_predict
Multi-step LoRA usage:
CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_multi_step_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --secondary_stored_model_name tedtalks_es_lora_slm_v1 --secondary_task translation --do_valid --do_predict --multi_step_prompts --prompt_style empty
Joint-expert LoRA:
CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_joint_expert_lora_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name sum_tr_es_joint_expert_lora_slm_v1 --secondary_task translation --multi_task_train_targets --do_train --do_valid --do_predict
Learnable Calibration++:
CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_learnable_calibration_pp_merge_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --for_each_component --lora_merge_strategy learnable_calibration_pp --merge_datasets dialogsum_es --secondary_task_dataset dialogsum_es --lora_merge_modules dialogsum_lora_slm_v1 tedtalks_es_lora_slm_v1 --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --secondary_task translation --do_valid --do_predict --num_examples_merge 10000
Evaluation with LLM judge is done separately, for example with:
CUDA_VISIBLE_DEVICES=0 python llm_judge_eval.py --experiment_name sum_tr_es_learnable_calibration_pp_merge_slm_v1 --batch_size 2 --stage test
The results are stored in the output directory with the given experiment_name.
Our code extends:
- https://github.com/jzhang38/TinyLlama/blob/main/sft/finetune.py
- https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py
- https://github.com/sail-sg/lorahub
If you find this useful for your research, please consider citing:
@inproceedings{bohdal2025efficient,
title={Efficient Compositional Multi-tasking for On-device Large Language Models},
author={Bohdal, Ondrej and Ozay, Mete and Moon, Jijoong and Lee, Kyeng-Hun and Ko, Hyeonmok and Michieli, Umberto},
booktitle={EMNLP},
year={2025}
}