Skip to content

iguanesolutions/qwen35-rp

 
 

Repository files navigation

qwen35-rp

Qwen 3.5 Reverse Proxy is a lightweight HTTP reverse proxy that automatically adjusts sampling parameters (temperature, top_p, etc.) and thinking mode based on one of four predefined profiles. It sits between your application and the backend LLM server serving Qwen 3.5 (e.g., vLLM). It also provides a /tokenize endpoint for tokenizing messages using the backend's tokenizer.

Core Functionality

This proxy's primary purpose is to:

  1. Accept requests for four virtual model names (configured via -thinking-general, -thinking-coding, -instruct-general, and -instruct-reasoning), rejecting all other model names with HTTP 400
  2. Set appropriate sampling parameters automatically based on one of four profiles (official Qwen-recommended values from Hugging Face):
    • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    • Thinking mode for precise coding tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
    • Instruct mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    • Instruct mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  3. Configure thinking mode by setting chat_template_kwargs.enable_thinking:
    • enable_thinking=true for thinking modes (general and coding)
    • enable_thinking=false for instruct modes (general and reasoning)
  4. Rewrite the model name to the actual backend model name (e.g., Qwen/Qwen3.5-397B-A17B-FP8) before forwarding to vLLM
  5. Fix vLLM response bugs where non-thinking, non-streaming responses incorrectly place content in reasoning_content or reasoning fields instead of content
  6. Enrich /v1/models endpoint by fetching backend models and exposing 4 virtual models with the same metadata (permissions, max_model_len, etc.)
  7. Provide OpenAI Responses API compatibility (/v1/responses) by converting requests to Chat Completions format and responses back to Responses format. This conversion is necessary because only vLLM's Chat Completions endpoint supports chat_template_kwargs, which is required to control Qwen's thinking mode (enable_thinking=true/false)
  8. Provide a /tokenize endpoint for tokenizing messages and counting tokens before making actual generation requests

Installation

Requirements: Go 1.24.2 or later

go build -o qwen35-rp .

Usage

./qwen35-rp \
  -target "http://127.0.0.1:8000" \
  -served-model "Qwen/Qwen3.5-397B-A17B-FP8" \
  -thinking-general "qwen-thinking-general" \
  -thinking-coding "qwen-thinking-coding" \
  -instruct-general "qwen-instruct-general" \
  -instruct-reasoning "qwen-instruct-reasoning"

Or using environment variables:

export QWEN35RP_TARGET="http://127.0.0.1:8000"
export QWEN35RP_SERVED_MODEL_NAME="Qwen/Qwen3.5-397B-A17B-FP8"
export QWEN35RP_THINKING_GENERAL_MODEL="qwen-thinking-general"
export QWEN35RP_THINKING_CODING_MODEL="qwen-thinking-coding"
export QWEN35RP_INSTRUCT_GENERAL_MODEL="qwen-instruct-general"
export QWEN35RP_INSTRUCT_REASONING_MODEL="qwen-instruct-reasoning"
./qwen35-rp

Configuration

Configure the proxy using command-line flags or environment variables:

Flag Environment Variable Default Description
-listen QWEN35RP_LISTEN 0.0.0.0 IP address to listen on
-port QWEN35RP_PORT 9000 Port to listen on
-target QWEN35RP_TARGET http://127.0.0.1:8000 Backend target URL
-loglevel QWEN35RP_LOGLEVEL INFO Log level (COMPLETE, DEBUG, INFO, WARN, ERROR)
-served-model QWEN35RP_SERVED_MODEL_NAME (required) Backend model name to use in outgoing requests
-thinking-general QWEN35RP_THINKING_GENERAL_MODEL qwen3.5-thinking-general Name of the thinking-general model (incoming request identifier)
-thinking-coding QWEN35RP_THINKING_CODING_MODEL qwen3.5-thinking-coding Name of the thinking-coding model (incoming request identifier)
-instruct-general QWEN35RP_INSTRUCT_GENERAL_MODEL qwen3.5-instruct-general Name of the instruct-general model (incoming request identifier)
-instruct-reasoning QWEN35RP_INSTRUCT_REASONING_MODEL qwen3.5-instruct-reasoning Name of the instruct-reasoning model (incoming request identifier)
-enforce-sampling-params QWEN35RP_ENFORCE_SAMPLING_PARAMS false Enforce sampling parameters, overriding client-provided values

Enforce Sampling Parameters

By default, the proxy only sets sampling parameters if they are not already present in the request. When -enforce-sampling-params is enabled, the proxy will always override client-provided sampling parameters with the predefined values for the detected mode.

Request Routing

  • GET /v1/models: Enriched (fetches backend models, validates served model, exposes 4 virtual models)
  • POST /v1/responses: Converted (Responses API → Chat Completions, with full response conversion)
  • POST /v1/chat/completions: Transformed (sampling params + thinking mode applied)
  • POST /v1/completions: Model name validated and swapped (no sampling params or thinking mode — raw prompt completions bypass the chat template)
  • POST /tokenize: Tokenization (prompt passthrough or messages with content/tools normalization)
  • All other paths: Passed through unchanged to the backend

Responses API Support

The proxy provides full compatibility with OpenAI's Responses API, converting requests and responses to/from the Chat Completions API format.

Why convert instead of forwarding to vLLM's /v1/responses endpoint?

Only vLLM's Chat Completions endpoint supports chat_template_kwargs, which is required to control Qwen's thinking mode (enable_thinking=true/false). By converting to Chat Completions, we can properly configure thinking mode based on the selected profile.

Supported Features

Feature Streaming Non-Streaming
Text generation
Reasoning/thinking content
Function/tool calls
Usage tracking (billing)
System instructions
Multimodal input (images)
Max output tokens / truncation

Streaming Events

The proxy emits standard Responses API streaming events:

  • response.created, response.in_progress
  • response.output_item.added, response.output_item.done
  • response.content_part.added, response.content_part.done
  • response.output_text.delta, response.output_text.done
  • response.reasoning_text.delta, response.reasoning_text.done (thinking mode)
  • response.function_call_arguments.delta, response.function_call_arguments.done (tool calls)
  • response.completed

vLLM Backend Requirements

For full functionality, the vLLM backend should be started with the following flags:

--reasoning-parser=qwen3                                  # Required for thinking/reasoning mode
--enable-auto-tool-choice --tool-call-parser=qwen3_coder  # Required for tool/function calls

Tokenize API

The proxy provides a /tokenize endpoint that forwards tokenization requests to vLLM's /tokenize. Two modes:

  • {"prompt": "..."} — raw text tokenization, forwarded as-is. No chat template is applied.
  • {"messages": [...], "tools": [...]} — vLLM applies the model's chat template (apply_chat_template) then tokenizes the result. Individual messages and tools can use either Chat Completions or Responses API formats (e.g. input_text content parts, flat tool definitions); the proxy normalizes everything to Chat Completions format before forwarding, since that's what apply_chat_template expects. Also supports add_generation_prompt, return_token_strs, and chat_template_kwargs.

Health Check

  • GET /health: Returns {"status":"healthy"} for Docker health checks

Log Levels

The proxy supports the following log levels:

Level Description
COMPLETE Most verbose - includes full HTTP request/response dumps
DEBUG Debug information including parameter application details
INFO General operational information
WARN Warning messages
ERROR Error messages only

When set to COMPLETE, the proxy will log full HTTP request and response bodies, which is useful for debugging but very verbose.

⚠️ Privacy Warning: LLM requests often contain sensitive or personal data (conversation history, personal information, confidential content). The COMPLETE log level will expose all this data in plaintext. Only enable it in secure, non-production environments or ensure logs are properly secured and retained temporarily.

systemd Integration

The proxy includes native systemd support for production deployments:

  • Type: notify - The proxy signals readiness to systemd automatically
  • Status Updates: Sends periodic status updates to systemd showing processed request counts
  • Graceful Shutdown: Properly signals systemd when stopping
  • Journald Logging: Structured logging output is compatible with journald

Example systemd unit file:

[Unit]
Description=Qwen 3.5 Reverse Proxy
After=network.target

[Service]
Type=notify
User=qwen35-rp
Group=qwen35-rp
ExecStart=/usr/local/bin/qwen35-rp -served-model "Qwen/Qwen3.5-397B-A17B-FP8" -thinking-general "qwen-thinking-general" -thinking-coding "qwen-thinking-coding" -instruct-general "qwen-instruct-general" -instruct-reasoning "qwen-instruct-reasoning"
Restart=on-failure
Environment=QWEN35RP_LOGLEVEL=INFO

[Install]
WantedBy=multi-user.target

⚠️ Security Best Practice: Always run the proxy under a dedicated, unprivileged user account (e.g., qwen35-rp). Never run as root. Create the user with:

sudo useradd --system --no-create-home --shell /usr/sbin/nologin qwen35-rp
sudo chown qwen35-rp:qwen35-rp /usr/local/bin/qwen35-rp

Graceful Shutdown

The server supports graceful shutdown with a 3-minute timeout to allow in-flight requests to complete. Send SIGINT or SIGTERM to initiate shutdown. When running under systemd, the proxy will automatically signal the service manager when ready and during shutdown.

License

MIT License - see LICENSE file for details.

About

Qwen 3.5 Reverse Proxy for handling instant / thinking modes and their variants automatically

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

No contributors

Languages

  • Go 99.4%
  • Dockerfile 0.6%