qwen35-rp

Qwen 3.5 Reverse Proxy is a lightweight HTTP reverse proxy that automatically adjusts sampling parameters (temperature, top_p, etc.) and thinking mode based on one of four predefined profiles. It sits between your application and the backend LLM server serving Qwen 3.5 (e.g., vLLM). It also provides a /tokenize endpoint for tokenizing messages using the backend's tokenizer.

Core Functionality

This proxy's primary purpose is to:

Accept requests for four virtual model names (configured via -thinking-general, -thinking-coding, -instruct-general, and -instruct-reasoning), rejecting all other model names with HTTP 400
Set appropriate sampling parameters automatically based on one of four profiles (official Qwen-recommended values from Hugging Face):
- Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
- Thinking mode for precise coding tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
- Instruct mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
- Instruct mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Configure thinking mode by setting chat_template_kwargs.enable_thinking:
- enable_thinking=true for thinking modes (general and coding)
- enable_thinking=false for instruct modes (general and reasoning)
Rewrite the model name to the actual backend model name (e.g., Qwen/Qwen3.5-397B-A17B-FP8) before forwarding to vLLM
Fix vLLM response bugs where non-thinking, non-streaming responses incorrectly place content in reasoning_content or reasoning fields instead of content
Enrich /v1/models endpoint by fetching backend models and exposing 4 virtual models with the same metadata (permissions, max_model_len, etc.)
Provide OpenAI Responses API compatibility (/v1/responses) by converting requests to Chat Completions format and responses back to Responses format. This conversion is necessary because only vLLM's Chat Completions endpoint supports chat_template_kwargs, which is required to control Qwen's thinking mode (enable_thinking=true/false)
Provide a /tokenize endpoint for tokenizing messages and counting tokens before making actual generation requests

Installation

Requirements: Go 1.24.2 or later

go build -o qwen35-rp .

Usage

./qwen35-rp \
  -target "http://127.0.0.1:8000" \
  -served-model "Qwen/Qwen3.5-397B-A17B-FP8" \
  -thinking-general "qwen-thinking-general" \
  -thinking-coding "qwen-thinking-coding" \
  -instruct-general "qwen-instruct-general" \
  -instruct-reasoning "qwen-instruct-reasoning"

Or using environment variables:

export QWEN35RP_TARGET="http://127.0.0.1:8000"
export QWEN35RP_SERVED_MODEL_NAME="Qwen/Qwen3.5-397B-A17B-FP8"
export QWEN35RP_THINKING_GENERAL_MODEL="qwen-thinking-general"
export QWEN35RP_THINKING_CODING_MODEL="qwen-thinking-coding"
export QWEN35RP_INSTRUCT_GENERAL_MODEL="qwen-instruct-general"
export QWEN35RP_INSTRUCT_REASONING_MODEL="qwen-instruct-reasoning"
./qwen35-rp

Configuration

Configure the proxy using command-line flags or environment variables:

Flag	Environment Variable	Default	Description
`-listen`	`QWEN35RP_LISTEN`	`0.0.0.0`	IP address to listen on
`-port`	`QWEN35RP_PORT`	`9000`	Port to listen on
`-target`	`QWEN35RP_TARGET`	`http://127.0.0.1:8000`	Backend target URL
`-loglevel`	`QWEN35RP_LOGLEVEL`	`INFO`	Log level (COMPLETE, DEBUG, INFO, WARN, ERROR)
`-served-model`	`QWEN35RP_SERVED_MODEL_NAME`	(required)	Backend model name to use in outgoing requests
`-thinking-general`	`QWEN35RP_THINKING_GENERAL_MODEL`	`qwen3.5-thinking-general`	Name of the thinking-general model (incoming request identifier)
`-thinking-coding`	`QWEN35RP_THINKING_CODING_MODEL`	`qwen3.5-thinking-coding`	Name of the thinking-coding model (incoming request identifier)
`-instruct-general`	`QWEN35RP_INSTRUCT_GENERAL_MODEL`	`qwen3.5-instruct-general`	Name of the instruct-general model (incoming request identifier)
`-instruct-reasoning`	`QWEN35RP_INSTRUCT_REASONING_MODEL`	`qwen3.5-instruct-reasoning`	Name of the instruct-reasoning model (incoming request identifier)
`-enforce-sampling-params`	`QWEN35RP_ENFORCE_SAMPLING_PARAMS`	`false`	Enforce sampling parameters, overriding client-provided values

Enforce Sampling Parameters

By default, the proxy only sets sampling parameters if they are not already present in the request. When -enforce-sampling-params is enabled, the proxy will always override client-provided sampling parameters with the predefined values for the detected mode.

Request Routing

GET /v1/models: Enriched (fetches backend models, validates served model, exposes 4 virtual models)
POST /v1/responses: Converted (Responses API → Chat Completions, with full response conversion)
POST /v1/chat/completions: Transformed (sampling params + thinking mode applied)
POST /v1/completions: Model name validated and swapped (no sampling params or thinking mode — raw prompt completions bypass the chat template)
POST /tokenize: Tokenization (prompt passthrough or messages with content/tools normalization)
All other paths: Passed through unchanged to the backend

Responses API Support

The proxy provides full compatibility with OpenAI's Responses API, converting requests and responses to/from the Chat Completions API format.

Why convert instead of forwarding to vLLM's /v1/responses endpoint?

Only vLLM's Chat Completions endpoint supports chat_template_kwargs, which is required to control Qwen's thinking mode (enable_thinking=true/false). By converting to Chat Completions, we can properly configure thinking mode based on the selected profile.

Supported Features

Feature	Streaming	Non-Streaming
Text generation	✅	✅
Reasoning/thinking content	✅	✅
Function/tool calls	✅	✅
Usage tracking (billing)	✅	✅
System instructions	✅	✅
Multimodal input (images)	✅	✅
Max output tokens / truncation	✅	✅

Streaming Events

The proxy emits standard Responses API streaming events:

response.created, response.in_progress
response.output_item.added, response.output_item.done
response.content_part.added, response.content_part.done
response.output_text.delta, response.output_text.done
response.reasoning_text.delta, response.reasoning_text.done (thinking mode)
response.function_call_arguments.delta, response.function_call_arguments.done (tool calls)
response.completed

vLLM Backend Requirements

For full functionality, the vLLM backend should be started with the following flags:

--reasoning-parser=qwen3                                  # Required for thinking/reasoning mode
--enable-auto-tool-choice --tool-call-parser=qwen3_coder  # Required for tool/function calls

Tokenize API

The proxy provides a /tokenize endpoint that forwards tokenization requests to vLLM's /tokenize. Two modes:

{"prompt": "..."} — raw text tokenization, forwarded as-is. No chat template is applied.
{"messages": [...], "tools": [...]} — vLLM applies the model's chat template (apply_chat_template) then tokenizes the result. Individual messages and tools can use either Chat Completions or Responses API formats (e.g. input_text content parts, flat tool definitions); the proxy normalizes everything to Chat Completions format before forwarding, since that's what apply_chat_template expects. Also supports add_generation_prompt, return_token_strs, and chat_template_kwargs.

Health Check

GET /health: Returns {"status":"healthy"} for Docker health checks

Log Levels

The proxy supports the following log levels:

Level	Description
`COMPLETE`	Most verbose - includes full HTTP request/response dumps
`DEBUG`	Debug information including parameter application details
`INFO`	General operational information
`WARN`	Warning messages
`ERROR`	Error messages only

When set to COMPLETE, the proxy will log full HTTP request and response bodies, which is useful for debugging but very verbose.

⚠️ Privacy Warning: LLM requests often contain sensitive or personal data (conversation history, personal information, confidential content). The COMPLETE log level will expose all this data in plaintext. Only enable it in secure, non-production environments or ensure logs are properly secured and retained temporarily.

systemd Integration

The proxy includes native systemd support for production deployments:

Type: notify - The proxy signals readiness to systemd automatically
Status Updates: Sends periodic status updates to systemd showing processed request counts
Graceful Shutdown: Properly signals systemd when stopping
Journald Logging: Structured logging output is compatible with journald

Example systemd unit file:

[Unit]
Description=Qwen 3.5 Reverse Proxy
After=network.target

[Service]
Type=notify
User=qwen35-rp
Group=qwen35-rp
ExecStart=/usr/local/bin/qwen35-rp -served-model "Qwen/Qwen3.5-397B-A17B-FP8" -thinking-general "qwen-thinking-general" -thinking-coding "qwen-thinking-coding" -instruct-general "qwen-instruct-general" -instruct-reasoning "qwen-instruct-reasoning"
Restart=on-failure
Environment=QWEN35RP_LOGLEVEL=INFO

[Install]
WantedBy=multi-user.target

⚠️ Security Best Practice: Always run the proxy under a dedicated, unprivileged user account (e.g., qwen35-rp). Never run as root. Create the user with:

sudo useradd --system --no-create-home --shell /usr/sbin/nologin qwen35-rp
sudo chown qwen35-rp:qwen35-rp /usr/local/bin/qwen35-rp

Graceful Shutdown

The server supports graceful shutdown with a 3-minute timeout to allow in-flight requests to complete. Send SIGINT or SIGTERM to initiate shutdown. When running under systemd, the proxy will automatically signal the service manager when ready and during shutdown.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
completions.go		completions.go
config.go		config.go
go.mod		go.mod
go.sum		go.sum
helpers.go		helpers.go
main.go		main.go
models.go		models.go
passthrough.go		passthrough.go
responses.go		responses.go
tokenize.go		tokenize.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qwen35-rp

Core Functionality

Installation

Usage

Configuration

Enforce Sampling Parameters

Request Routing

Responses API Support

Supported Features

Streaming Events

vLLM Backend Requirements

Tokenize API

Health Check

Log Levels

systemd Integration

Graceful Shutdown

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

qwen35-rp

Core Functionality

Installation

Usage

Configuration

Enforce Sampling Parameters

Request Routing

Responses API Support

Supported Features

Streaming Events

vLLM Backend Requirements

Tokenize API

Health Check

Log Levels

systemd Integration

Graceful Shutdown

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors