fix bug #37 and #62 by LeeWant · Pull Request #71 · Wenyueh/MinivLLM

LeeWant · 2026-03-30T10:05:30Z

Surface Cause

Both issues #37 and #62 mentioned this issue. The surface cause of the bug is that the step method in the llm_engine.py file did not exit the outer while loop after processing all the sequences. The step method returns two values: return [], is_prefill. However, the external function expects three values: outputs, num_processed_tokens, is_prefill = self.step(). However, this is not the root cause.

    def step(self) -> tuple[list[int], bool]:
        scheduled_sequences, is_prefill = self.scheduler.schedule()
        if not scheduled_sequences:
            return [], is_prefill  # here!
        # run the model
        outputs = self.model_runner.call("run", scheduled_sequences, is_prefill)
        # Move outputs to CPU and convert them to a list
        if outputs is not None:
            outputs = outputs.cpu().tolist()
        # postprocess the outputs
        self.scheduler.postprocess(scheduled_sequences, outputs)


        outputs = [(seq.seq_id, seq.completion_token_ids) for seq in scheduled_sequences if seq.is_finished]
        num_processed_tokens = sum(len(seq) for seq in scheduled_sequences) if is_prefill else len(scheduled_sequences)


        return outputs, num_processed_tokens, is_prefill

In-depth Analysis

Upon deeper investigation, with a sequence count of 90 (3 prompts * 30), it was found that the number of items in the waiting queue did not decrease when it reached the maximum allocable space (the 3070ti 8G's block allocation limit is 65). After investigation, it was discovered that the issue was due to a logical error in the ref_count counter, which prevented the block from being correctly released. This issue can be fixed by adding code in the block_manager.py file.

            else:
                # cache miss
                block = self._allocate_block(self.free_block_ids[0])
                block.update(h=h, token_ids=token_ids)
                block.ref_count = 1   # add this!
                if h != -1:
                    self.hash_to_block_id[h] = block.block_id
            seq.block_table.append(block.block_id)

Another Scenario

After setting block_size=8, the program does not throw an error. The root cause is that the prompts themselves are too small. The original default setting of block_size=256 meant that a single sequence could not fully use up the block. Moreover, due to an initialization issue with the ref_count at the lower level, only 65 running sequences will be executed (the number of running sequences corresponds to the known block count). Once the running sequence queue finishes, the waiting sequence queue will not continue running, which leads to the block release issue caused by the ref_count.

fix bug Wenyueh#37

8dd3bf8

This was linked to issues Apr 2, 2026

[Bug] An error is reported due to too many prompts. #37

Open

【BUG】When prefix caching is not triggered, the sequence's blocks are not deallocated #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug #37 and #62#71

fix bug #37 and #62#71
LeeWant wants to merge 1 commit intoWenyueh:mainfrom
LeeWant:engine_fix

LeeWant commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeeWant commented Mar 30, 2026

Surface Cause

In-depth Analysis

Another Scenario

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant