Llama2_7b Example Will Crash When the Model Outputs Too Many Words

## How to Reproduce

Just make the model keep generating new words and non-stop, until the generated sequence length exceeds the default `seq_len`.

For example, change the prompt into

```python
prompt = 'a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a'
```

and it will crash after generating 1022 tokens:

```
    local_cache = val_cache.select(0, l).narrow(0, pos, 3)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (1022) + length (3) exceeds dimension size (1024).
```

## How to Fix

The bug is due to the construction of `local_cache`:

```
local_cache = val_cache.select(0, l).narrow(0, pos, 3)
```

when `pos = seq_len - 2`, using `val_cache` for this in-place construction for `local_cache` will cause an error.

For a quick (but perhaps not "beautiful") fix, just change [line 74](https://github.com/microsoft/antares/blob/fbab05c82ec0d95b28a6cc24361bca180fe57a5b/samples/04_llama2_7b_fp16.py#L74) into

```
val_cache = torch.zeros([n_layers, seq_len + 3, dim], dtype=data_type, device=device).clone()
```

to reserve more place for `local_cache`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

How to Reproduce

How to Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

Description

How to Reproduce

How to Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions