Skip to content

Commit ea19289

Browse files
committed
add minions
1 parent 4c0feb6 commit ea19289

File tree

9 files changed

+106
-0
lines changed

9 files changed

+106
-0
lines changed
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
layout: post
3+
title: "Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models"
4+
date: 2025-03-16
5+
description: "Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models"
6+
tags: ml paper llm federated distributed
7+
comments: true
8+
---
9+
10+
<style>
11+
li {
12+
font-size: 1.1em; /* Adjust as needed */
13+
}
14+
</style>
15+
16+
# [Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models](https://arxiv.org/abs/2502.15964)
17+
> [TL;DR]
18+
> MinionS is a collaboration protocol between **local small LMs** and **remote frontier LMs**, significantly reducing cloud inference costs while maintaining near-frontier accuracy by decomposing complex tasks into simpler subtasks executed locally in parallel.
19+
20+
21+
## Highlights
22+
- Proposes a collaboration protocol between **local small LMs** and **remote frontier LMs**
23+
- Proposes two protocols: Minion (naïve chat-based) and MinionS (task decomposition-based)
24+
- MinionS reduces cloud inference costs by **5.7×** on average and recovers **97.9%** of frontier LM performance
25+
- Conducts detailed analyses on model choice, parallel workload scaling, and sequential communication strategies
26+
27+
## Summary
28+
- **Observation 1**: Large model can perform data-intensive reasoning, but accessing these models is expensive.
29+
- **Observation 2**: Small local models run on-device with nocost, but they struggle with multi-step instructions and long context reasoning.
30+
<div class="row mt-3">
31+
<div class="col-sm-6 mt-3 mt-md-0 offset-3">
32+
{% include figure.html path="assets/img/posts/minions/slm.png" title="Small language models" class="img-fluid rounded z-depth-1" %}
33+
</div>
34+
</div>
35+
<br>
36+
- **The problem statement**: Reducing the cost of cloud-based inference while maintaining performance by enabling effective collaboration between small, on-device LMs and large, remote LMs.
37+
- **The solution**: MinionS leverages the remote LM to decompose complex queries into simpler subtasks which are executed in parallel by local LMs on smaller document chunks, improving accuracy and reducing remote inference costs.
38+
{% include figure.html path="assets/img/posts/minions/overview.png" title="Minions Overview" class="img-fluid rounded z-depth-1" %}
39+
- **Finding 1**: Practical Effectiveness: Achieves near-equivalent accuracy to large remote-only models at just a fraction (around 18%) of the cost.
40+
- **Finding 2**: Effective collaboration is achievable starting from a 3B-parameter local model, with larger local models (8B) further improving accuracy and cost efficiency.
41+
<div class="row mt-3">
42+
<div class="col-sm-10 mt-3 mt-md-0 offset-1">
43+
{% include figure.html path="assets/img/posts/minions/slm_size.png" title="Small language models size" class="img-fluid rounded z-depth-1" %}
44+
</div>
45+
</div>
46+
<br>
47+
48+
49+
## Experiments
50+
51+
#### Comparison with Baselines
52+
- Minion (naïve protocol) achieves 30.4× cost reduction but with 87% of the accuracy of the remote-only model.
53+
- MinionS substantially improves on Minion by achieving 97.9% of the accuracy at only 18% of the cost compared to remote-only inference.
54+
<div class="row mt-3">
55+
<div class="col-sm-4 mt-3 mt-md-0">
56+
{% include figure.html path="assets/img/posts/minions/baselines.png" title="Minions baselines" class="img-fluid rounded z-depth-1" %}
57+
</div>
58+
<div class="col-sm-8 mt-3 mt-md-0">
59+
{% include figure.html path="assets/img/posts/minions/table.png" class="img-fluid rounded z-depth-1" %}
60+
</div>
61+
</div>
62+
<br>
63+
64+
65+
#### Analysis of Parallel Workloads
66+
- Three parameters configured by RemoteLM for increasing the degree of task decomposition:
67+
- (1) **Number of tasks per round**: How many simpler jobs or subtasks the remote model creates for the local models, enabling parallel execution locally. (i.e. “Extract the ARR for Q1 of 2014”)
68+
- (2) **Number of samples per task**: Number of repeated attempts (samples) made by the local language model (LocalLM) for each individual subtask (i.e. number of generations created with LocalLM, ≥ 1).
69+
- (3) **Chunk size**: chunk by page, chunk by paragraph, etc; smaller chunks will send more information to cloud.
70+
71+
- Increasing the number of tasks, samples per task, and chunking granularity improves accuracy but increases cost. Task decomposition and chunk size offer a more cost-effective trade-off.
72+
73+
<div class="row mt-3">
74+
<div class="col-sm-10 mt-3 mt-md-0 offset-1">
75+
{% include figure.html path="assets/img/posts/minions/scaling.png" class="img-fluid rounded z-depth-1" %}
76+
</div>
77+
</div>
78+
<br>
79+
80+
81+
#### Sequential Communication
82+
- Increasing sequential communication rounds can improve accuracy, but also increases cost.
83+
- Strategies like using a scratchpad for intermediate steps slightly improve the cost-accuracy tradeoff.
84+
<div class="row mt-3">
85+
<div class="col-sm-6 mt-3 mt-md-0 offset-3">
86+
{% include figure.html path="assets/img/posts/minions/sequential.png" class="img-fluid rounded z-depth-1" %}
87+
</div>
88+
</div>
89+
<br>
90+
91+
#### Retrieval-Augmented Generation (RAG) Comparison
92+
- RAG excels in structured extraction tasks but struggles with tasks requiring synthesis from dispersed information, where MinionS provides superior token efficiency and narrative coherence.
93+
- From Figure 8 left, none of the RAG configurations are are able to match the quality of Minion at the same low cost.
94+
<div class="row mt-3">
95+
<div class="col-sm-10 mt-3 mt-md-0 offset-1">
96+
{% include figure.html path="assets/img/posts/minions/rag.png" class="img-fluid rounded z-depth-1" %}
97+
</div>
98+
</div>
99+
<br>
100+
101+
102+
## Conclusions
103+
104+
- MinionS efficiently distributes workload between local and remote LMs, significantly reducing cloud inference costs while preserving accuracy.
105+
- This collaboration approach becomes increasingly effective with advancements in local LM capabilities, showing strong potential for future cost-efficient systems.
106+
200 KB
Loading
649 KB
Loading

assets/img/posts/minions/rag.png

313 KB
Loading
382 KB
Loading
139 KB
Loading

assets/img/posts/minions/slm.png

194 KB
Loading
496 KB
Loading

assets/img/posts/minions/table.png

329 KB
Loading

0 commit comments

Comments
 (0)