vLLM Mistral-7B Demo

Walkthrough: Serving Mistral-7B with vLLM in a Python virtual environment on NVIDIA A10

Project Description
Setup Process
2.1 - Create Python Virtual Environment
2.2 - Install PyTorch
2.3 - Verify Setup
2.4 - Start the vLLM Open AI compatible server
2.5 - Querying the LLM
Errors Encountered
3.1 - Error #1: Wrong dtype with GPTQ
3.2 - Error #2: SafeTensor error "Header too large"
License

1. PROJECT DESCRIPTION

This project documents how to set up and serve the Mistral-7B-Instruct model using the vLLM inference engine inside a Python virtual environment. The environment was tested on an NVIDIA A10 GPU cloud instance. Mistral-7B was chosen for this project due to its performance comparatively to the LLaMA 2 family, another LLM developed by Meta. Below is a comparison of Mistral-7B’s performance vs. LLaMA:

source: https://mistral.ai/news/announcing-mistral-7b

2. SETUP PROCESS

2.1 - Create Python Virtual Environment

  python3 -m venv vllm-env 
  source vllm-env/bin/activate 
  python -m pip install --upgrade pip

Python copies its interpreter and standard libraries into a new isolated folder (vllm-env/)
It updates the shell’s $PATH so that when you type python or pip, it points to the copies inside vllm-env/, not your system ones.
Upgrades pip inside the venv to ensure the latest installer before adding packages.

2.2 - Install PyTorch

  pip install "vllm[torch]"

pip checks for the latest release of vllm and installs PyTorch
Everything gets installed into the virtual environment vllm-env

2.3 - Verify Setup

Switch to the python shell.

  python

This opens the interactive Python REPL (Read-Eval-Print Loop). From here, you can type Python code line by line.

Type these commands.

  import torch
  print(torch.__version__)
  print(torch.cuda.is_available())
  print(torch.cude.get_device_name(0))

System output:

  2.7.1+cu126
  True
  NVIDIA A10

This should be the expected output. If the output doesn't match, there was an error during setup.

2.4 - Start the vLLM OpenAI compatible server

  python3 -m vllm.entrypoints.openai.api_server
--model ./Mistral-7B-Instruct-v0.2-GPTQ \
--quantization gptq \
--dtype float16 \
--port 8000

This launches vLLM’s OpenAI-compatible server on port 8000, serving the locally stored Mistral-7B-Instruct v0.2 GPTQ model in fp16 precision.

2.5 - Querying the LLM

Now that Mistral-7B is up and running, test it using "curl"

  curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./Mistral-7B-Instruct-v0.2-GPTQ",
    "messages":[{"role":"user","content":"Say hello in 5 words."}]
  }'

Mistral-7B will send back a text output response. This was the response received:

  {
    "id":"chatcmpl-",
    "object":"chat.completion",
    "created":1756330925,
    "model":"./Mistral-7B-Instruct-v0.2-GPTQ",
    "choices":[
      {
        "index":0,
        "message":{
          "role":"assistant",
          "content":" Hello, I'm here to help! Here are five words: \"Hello, welcome friend.\"",
          "refusal":null,
          "annotations":null,
          "audio":null,
          "function_call":null,
          "tool_calls":[],
          "reasoning_content":null
        },
        "logprobs":null,
        "finish_reason":"stop",
        "stop_reason":null
      }
    ],
    "service_tier":null,
    "system_fingerprint":null,
    "usage":{
      "prompt_tokens":16,
      "total_tokens":37,"
    }
  }

The server returned a JSON object containing:

A unique request ID (chatcmpl-...).
Metadata such as the object type, creation timestamp, and model ID.
The generated output from the model inside the choices array.
Token usage information showing how many tokens were processed.

This shows that:

The model was successfully served by vLLM
The API correctly responded to a chat request
The Mistral-7B model produced a valid text output

3. Errors encountered

3.1 - Error #1: Wrong dtype with GPTQ

The server was started with this command:

  python3 -m vllm.entrypoints.openai.api_server
--model ./Mistral-7B-Instruct-v0.2-GPTQ \
--dtype auto \
--quantization gptq \
--port 8000

This resulted in the following error:

  Value error, torch.bfloat16 is not supported for quantization method gpta. Supported dtypes: [torch.float16] [type=value_error, input_value=ArgsKwargs(O, {'model_co...additional_conf ig': {}}), input_type=ArgsKwargs]

"dtype auto" made vLLM choose bfloat16, since the GPU supports it. But GPTQ quantized models are not compatible with bfloat16.

Deeper Explaination: GPTQ is a quantization method that pre-compresses weights (e.g., 4-bit or 8-bit). These quantized weights aren’t compatible with bfloat16 (bf16). Instead, GPTQ models are meant to be loaded in float16 (fp16) or int4/int8, depending on how they were built.

3.2 - Error 2: SafeTensor error "header too large"

  safetensors_rust.SafetensorError: Error while deserializing header: header too large

Explaination:
This error was caused because safetensors tried to read the file’s header (the metadata about tensors in the file), but it was corrupted or incomplete.
“Header too large” usually happens when the model weights didn’t download correctly — often because of Git LFS (Large File Storage... Explained below).

The issue can be resolved by removing Mistral-7B and reinstalling it. The -rf flags tell rm to delete the directory and all of its contents recursively, and to do so forcefully without prompts. This ensures that no leftover or potentially corrupted files remain.

  rm -rf Mistral-7B-Instruct-v0.2-GPTQ
  git clone https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ

Before moving on to the next step and entering the Mistral-7B file directory, Ensure that Git LFS is installed in the virtual environment:

  sudo apt install git-lfs
  git lfs install

After Git LFS is installed, move into the Mistral-7B file directory and run "Git LFS pull"

  cd Mistral-7B-Instruct-v0.2-GPTQ
  git LFS pull

Hugging Face and GitHub use Git LFS for big files (like multi-GB model weights). If you clone a repo with big files without Git LFS, you only get pointer stubs (tiny text files, not the full model).
When vLLM tries to load those stubs as real model weights → safetensors chokes → error
Doing this process, Git will actually fetch the multi-GB .safetensors files instead of stubs and you should be able to proceed without error.

4. License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vLLM Mistral-7B Demo

TABLE OF CONTENTS

1. PROJECT DESCRIPTION

2. SETUP PROCESS

2.1 - Create Python Virtual Environment

2.2 - Install PyTorch

2.3 - Verify Setup

2.4 - Start the vLLM OpenAI compatible server

2.5 - Querying the LLM

3. Errors encountered

3.1 - Error #1: Wrong dtype with GPTQ

3.2 - Error 2: SafeTensor error "header too large"

4. License

About

Uh oh!

Releases

Packages

License

jerodgoode/vLLM-Mistral7b-Demo

Folders and files

Latest commit

History

Repository files navigation

vLLM Mistral-7B Demo

TABLE OF CONTENTS

1. PROJECT DESCRIPTION

2. SETUP PROCESS

2.1 - Create Python Virtual Environment

2.2 - Install PyTorch

2.3 - Verify Setup

2.4 - Start the vLLM OpenAI compatible server

2.5 - Querying the LLM

3. Errors encountered

3.1 - Error #1: Wrong dtype with GPTQ

3.2 - Error 2: SafeTensor error "header too large"

4. License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages