Skip to content

SamsungLabs/CompositionalMultitaskingLLMs

Repository files navigation

Efficient Compositional Multi-tasking for On-device Large Language Models

Ondrej Bohdal1, Mete Ozay1, Jijoong Moon2, Kyeng-Hun Lee2, Hyeonmok Ko2, Umberto Michieli1

1 Samsung R&D Institute UK, United Kingdom   2 Samsung Research, South Korea 

EMNLP 2025

website arXiv BibTeX

Abstract

Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

Setup

We use the following libraries: torch, transformers, datasets, evaluate, accelerate, peft, rouge-score, vllm:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers
pip install datasets
pip install evaluate
pip install accelerate
pip install peft
pip install rouge-score
pip install vllm

We have tested running the code with the versions of the libraries specified in the requirements.txt file.

We also need to create directories data, output, predictions, stored_models, logs in the repo directory:

mkdir data output predictions stored_models logs

Data

We can recreate the data for our compositional multi-tasking benchmark by running the prepare_compositional_task_data.sh script.

Experiments

The first step is to train LoRAs for each individual task. Example:

CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name dialogsum_lora_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --do_train --do_valid --do_predict

This needs to be repeated for each dataset and model. Note that tone adjustment is a smaller dataset so --gradient_accumulation_steps 4 is used instead of the default value.

Evaluation of various methods on the compositional multi-tasking benchmark can be done for example as follows:

Zero-shot:

CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_zero_shot_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --zero_shot --secondary_task translation --do_valid --do_predict

Main-task LoRA:

CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_sum_lora_only_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --secondary_task translation --do_valid --do_predict

Linear merge:

CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_linear_merge_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --lora_merge_strategy peft_linear --lora_merge_modules dialogsum_lora_slm_v1 tedtalks_es_lora_slm_v1 --stored_model_name dialogsum_lora_slm_v1 --secondary_task translation --do_valid --do_predict

Multi-step LoRA usage:

CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_multi_step_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --secondary_stored_model_name tedtalks_es_lora_slm_v1 --secondary_task translation --do_valid --do_predict --multi_step_prompts --prompt_style empty

Joint-expert LoRA:

CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_joint_expert_lora_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --secondary_task_dataset dialogsum_es --dataset dialogsum --stored_model_name sum_tr_es_joint_expert_lora_slm_v1 --secondary_task translation --multi_task_train_targets --do_train --do_valid --do_predict

Learnable Calibration++:

CUDA_VISIBLE_DEVICES=0 python run_experiment.py --experiment_name sum_tr_es_learnable_calibration_pp_merge_slm_v1 --model_name_or_path stabilityai/stablelm-2-1_6b-chat --tgt_lang es --for_each_component --lora_merge_strategy learnable_calibration_pp --merge_datasets dialogsum_es --secondary_task_dataset dialogsum_es --lora_merge_modules dialogsum_lora_slm_v1 tedtalks_es_lora_slm_v1 --dataset dialogsum --stored_model_name dialogsum_lora_slm_v1 --secondary_task translation --do_valid --do_predict --num_examples_merge 10000

Evaluation with LLM judge is done separately, for example with:

CUDA_VISIBLE_DEVICES=0 python llm_judge_eval.py --experiment_name sum_tr_es_learnable_calibration_pp_merge_slm_v1 --batch_size 2 --stage test

The results are stored in the output directory with the given experiment_name.

Acknowledgements

Our code extends:

Citation

If you find this useful for your research, please consider citing:

@inproceedings{bohdal2025efficient,
  title={Efficient Compositional Multi-tasking for On-device Large Language Models},
  author={Bohdal, Ondrej and Ozay, Mete and Moon, Jijoong and Lee, Kyeng-Hun and Ko, Hyeonmok and Michieli, Umberto},
  booktitle={EMNLP},
  year={2025}
}

About

Efficient Compositional Multi-tasking for On-device Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published