Qwen 3.5 Reverse Proxy is a lightweight HTTP reverse proxy that automatically adjusts sampling parameters (temperature, top_p, etc.) and thinking mode based on one of four predefined profiles. It sits between your application and the backend LLM server serving Qwen 3.5 (e.g., vLLM). It also provides a /tokenize endpoint for tokenizing messages using the backend's tokenizer.
This proxy's primary purpose is to:
- Accept requests for four virtual model names (configured via
-thinking-general,-thinking-coding,-instruct-general, and-instruct-reasoning), rejecting all other model names with HTTP 400 - Set appropriate sampling parameters automatically based on one of four profiles (official Qwen-recommended values from Hugging Face):
- Thinking mode for general tasks:
temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0 - Thinking mode for precise coding tasks:
temperature=0.6,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=0.0,repetition_penalty=1.0 - Instruct mode for general tasks:
temperature=0.7,top_p=0.8,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0 - Instruct mode for reasoning tasks:
temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0
- Thinking mode for general tasks:
- Configure thinking mode by setting
chat_template_kwargs.enable_thinking:enable_thinking=truefor thinking modes (general and coding)enable_thinking=falsefor instruct modes (general and reasoning)
- Rewrite the model name to the actual backend model name (e.g.,
Qwen/Qwen3.5-397B-A17B-FP8) before forwarding to vLLM - Fix vLLM response bugs where non-thinking, non-streaming responses incorrectly place content in
reasoning_contentorreasoningfields instead ofcontent - Enrich
/v1/modelsendpoint by fetching backend models and exposing 4 virtual models with the same metadata (permissions, max_model_len, etc.) - Provide OpenAI Responses API compatibility (
/v1/responses) by converting requests to Chat Completions format and responses back to Responses format. This conversion is necessary because only vLLM's Chat Completions endpoint supportschat_template_kwargs, which is required to control Qwen's thinking mode (enable_thinking=true/false) - Provide a
/tokenizeendpoint for tokenizing messages and counting tokens before making actual generation requests
Requirements: Go 1.24.2 or later
go build -o qwen35-rp ../qwen35-rp \
-target "http://127.0.0.1:8000" \
-served-model "Qwen/Qwen3.5-397B-A17B-FP8" \
-thinking-general "qwen-thinking-general" \
-thinking-coding "qwen-thinking-coding" \
-instruct-general "qwen-instruct-general" \
-instruct-reasoning "qwen-instruct-reasoning"Or using environment variables:
export QWEN35RP_TARGET="http://127.0.0.1:8000"
export QWEN35RP_SERVED_MODEL_NAME="Qwen/Qwen3.5-397B-A17B-FP8"
export QWEN35RP_THINKING_GENERAL_MODEL="qwen-thinking-general"
export QWEN35RP_THINKING_CODING_MODEL="qwen-thinking-coding"
export QWEN35RP_INSTRUCT_GENERAL_MODEL="qwen-instruct-general"
export QWEN35RP_INSTRUCT_REASONING_MODEL="qwen-instruct-reasoning"
./qwen35-rpConfigure the proxy using command-line flags or environment variables:
| Flag | Environment Variable | Default | Description |
|---|---|---|---|
-listen |
QWEN35RP_LISTEN |
0.0.0.0 |
IP address to listen on |
-port |
QWEN35RP_PORT |
9000 |
Port to listen on |
-target |
QWEN35RP_TARGET |
http://127.0.0.1:8000 |
Backend target URL |
-loglevel |
QWEN35RP_LOGLEVEL |
INFO |
Log level (COMPLETE, DEBUG, INFO, WARN, ERROR) |
-served-model |
QWEN35RP_SERVED_MODEL_NAME |
(required) | Backend model name to use in outgoing requests |
-thinking-general |
QWEN35RP_THINKING_GENERAL_MODEL |
qwen3.5-thinking-general |
Name of the thinking-general model (incoming request identifier) |
-thinking-coding |
QWEN35RP_THINKING_CODING_MODEL |
qwen3.5-thinking-coding |
Name of the thinking-coding model (incoming request identifier) |
-instruct-general |
QWEN35RP_INSTRUCT_GENERAL_MODEL |
qwen3.5-instruct-general |
Name of the instruct-general model (incoming request identifier) |
-instruct-reasoning |
QWEN35RP_INSTRUCT_REASONING_MODEL |
qwen3.5-instruct-reasoning |
Name of the instruct-reasoning model (incoming request identifier) |
-enforce-sampling-params |
QWEN35RP_ENFORCE_SAMPLING_PARAMS |
false |
Enforce sampling parameters, overriding client-provided values |
By default, the proxy only sets sampling parameters if they are not already present in the request. When -enforce-sampling-params is enabled, the proxy will always override client-provided sampling parameters with the predefined values for the detected mode.
GET /v1/models: Enriched (fetches backend models, validates served model, exposes 4 virtual models)POST /v1/responses: Converted (Responses API → Chat Completions, with full response conversion)POST /v1/chat/completions: Transformed (sampling params + thinking mode applied)POST /v1/completions: Model name validated and swapped (no sampling params or thinking mode — raw prompt completions bypass the chat template)POST /tokenize: Tokenization (prompt passthrough or messages with content/tools normalization)- All other paths: Passed through unchanged to the backend
The proxy provides full compatibility with OpenAI's Responses API, converting requests and responses to/from the Chat Completions API format.
Why convert instead of forwarding to vLLM's /v1/responses endpoint?
Only vLLM's Chat Completions endpoint supports chat_template_kwargs, which is required to control Qwen's thinking mode (enable_thinking=true/false). By converting to Chat Completions, we can properly configure thinking mode based on the selected profile.
| Feature | Streaming | Non-Streaming |
|---|---|---|
| Text generation | ✅ | ✅ |
| Reasoning/thinking content | ✅ | ✅ |
| Function/tool calls | ✅ | ✅ |
| Usage tracking (billing) | ✅ | ✅ |
| System instructions | ✅ | ✅ |
| Multimodal input (images) | ✅ | ✅ |
| Max output tokens / truncation | ✅ | ✅ |
The proxy emits standard Responses API streaming events:
response.created,response.in_progressresponse.output_item.added,response.output_item.doneresponse.content_part.added,response.content_part.doneresponse.output_text.delta,response.output_text.doneresponse.reasoning_text.delta,response.reasoning_text.done(thinking mode)response.function_call_arguments.delta,response.function_call_arguments.done(tool calls)response.completed
For full functionality, the vLLM backend should be started with the following flags:
--reasoning-parser=qwen3 # Required for thinking/reasoning mode
--enable-auto-tool-choice --tool-call-parser=qwen3_coder # Required for tool/function callsThe proxy provides a /tokenize endpoint that forwards tokenization requests to vLLM's /tokenize. Two modes:
{"prompt": "..."}— raw text tokenization, forwarded as-is. No chat template is applied.{"messages": [...], "tools": [...]}— vLLM applies the model's chat template (apply_chat_template) then tokenizes the result. Individual messages and tools can use either Chat Completions or Responses API formats (e.g.input_textcontent parts, flat tool definitions); the proxy normalizes everything to Chat Completions format before forwarding, since that's whatapply_chat_templateexpects. Also supportsadd_generation_prompt,return_token_strs, andchat_template_kwargs.
GET /health: Returns{"status":"healthy"}for Docker health checks
The proxy supports the following log levels:
| Level | Description |
|---|---|
COMPLETE |
Most verbose - includes full HTTP request/response dumps |
DEBUG |
Debug information including parameter application details |
INFO |
General operational information |
WARN |
Warning messages |
ERROR |
Error messages only |
When set to COMPLETE, the proxy will log full HTTP request and response bodies, which is useful for debugging but very verbose.
COMPLETE log level will expose all this data in plaintext. Only enable it in secure, non-production environments or ensure logs are properly secured and retained temporarily.
The proxy includes native systemd support for production deployments:
- Type:
notify- The proxy signals readiness to systemd automatically - Status Updates: Sends periodic status updates to systemd showing processed request counts
- Graceful Shutdown: Properly signals systemd when stopping
- Journald Logging: Structured logging output is compatible with journald
Example systemd unit file:
[Unit]
Description=Qwen 3.5 Reverse Proxy
After=network.target
[Service]
Type=notify
User=qwen35-rp
Group=qwen35-rp
ExecStart=/usr/local/bin/qwen35-rp -served-model "Qwen/Qwen3.5-397B-A17B-FP8" -thinking-general "qwen-thinking-general" -thinking-coding "qwen-thinking-coding" -instruct-general "qwen-instruct-general" -instruct-reasoning "qwen-instruct-reasoning"
Restart=on-failure
Environment=QWEN35RP_LOGLEVEL=INFO
[Install]
WantedBy=multi-user.targetqwen35-rp). Never run as root. Create the user with:
sudo useradd --system --no-create-home --shell /usr/sbin/nologin qwen35-rp
sudo chown qwen35-rp:qwen35-rp /usr/local/bin/qwen35-rpThe server supports graceful shutdown with a 3-minute timeout to allow in-flight requests to complete. Send SIGINT or SIGTERM to initiate shutdown. When running under systemd, the proxy will automatically signal the service manager when ready and during shutdown.
MIT License - see LICENSE file for details.