Skip to content

fix bug #37 and #62#71

Open
LeeWant wants to merge 1 commit intoWenyueh:mainfrom
LeeWant:engine_fix
Open

fix bug #37 and #62#71
LeeWant wants to merge 1 commit intoWenyueh:mainfrom
LeeWant:engine_fix

Conversation

@LeeWant
Copy link
Copy Markdown
Collaborator

@LeeWant LeeWant commented Mar 30, 2026

Surface Cause

Both issues #37 and #62 mentioned this issue. The surface cause of the bug is that the step method in the llm_engine.py file did not exit the outer while loop after processing all the sequences. The step method returns two values: return [], is_prefill. However, the external function expects three values: outputs, num_processed_tokens, is_prefill = self.step(). However, this is not the root cause.

    def step(self) -> tuple[list[int], bool]:
        scheduled_sequences, is_prefill = self.scheduler.schedule()
        if not scheduled_sequences:
            return [], is_prefill  # here!
        # run the model
        outputs = self.model_runner.call("run", scheduled_sequences, is_prefill)
        # Move outputs to CPU and convert them to a list
        if outputs is not None:
            outputs = outputs.cpu().tolist()
        # postprocess the outputs
        self.scheduler.postprocess(scheduled_sequences, outputs)


        outputs = [(seq.seq_id, seq.completion_token_ids) for seq in scheduled_sequences if seq.is_finished]
        num_processed_tokens = sum(len(seq) for seq in scheduled_sequences) if is_prefill else len(scheduled_sequences)


        return outputs, num_processed_tokens, is_prefill

In-depth Analysis

Upon deeper investigation, with a sequence count of 90 (3 prompts * 30), it was found that the number of items in the waiting queue did not decrease when it reached the maximum allocable space (the 3070ti 8G's block allocation limit is 65). After investigation, it was discovered that the issue was due to a logical error in the ref_count counter, which prevented the block from being correctly released. This issue can be fixed by adding code in the block_manager.py file.

            else:
                # cache miss
                block = self._allocate_block(self.free_block_ids[0])
                block.update(h=h, token_ids=token_ids)
                block.ref_count = 1   # add this!
                if h != -1:
                    self.hash_to_block_id[h] = block.block_id
            seq.block_table.append(block.block_id)

Another Scenario

After setting block_size=8, the program does not throw an error. The root cause is that the prompts themselves are too small. The original default setting of block_size=256 meant that a single sequence could not fully use up the block. Moreover, due to an initialization issue with the ref_count at the lower level, only 65 running sequences will be executed (the number of running sequences corresponds to the known block count). Once the running sequence queue finishes, the waiting sequence queue will not continue running, which leads to the block release issue caused by the ref_count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

【BUG】When prefix caching is not triggered, the sequence's blocks are not deallocated [Bug] An error is reported due to too many prompts.

1 participant