From e582df68ca1c87adda668488026fb1193c58153b Mon Sep 17 00:00:00 2001
From: Hani Cierlak <hannahcierlak@gmail.com>
Date: Wed, 7 Jan 2026 16:33:47 -0800
Subject: [PATCH 1/3] Add deployment guide for Qwen3 models on SageMaker

Signed-off-by: Hani Cierlak <hannahcierlak@gmail.com>
---
 Qwen/Qwen3_SageMaker.md | 202 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 202 insertions(+)
 create mode 100644 Qwen/Qwen3_SageMaker.md
diff --git a/Qwen/Qwen3_SageMaker.md b/Qwen/Qwen3_SageMaker.md
new file mode 100644
index 00000000..dd269c5e
--- /dev/null
+++ b/Qwen/Qwen3_SageMaker.md
@@ -0,0 +1,202 @@
+# Qwen3-30B-A3B SageMaker Deployment
+
+This guide provides comprehensive instructions for deploying the Qwen3-30B-A3B model on AWS SageMaker using vLLM and Docker. This deployment setup also works for:
+- Qwen3-30B-A3B-Instruct-2507
+- Qwen3-VL-30B-A3B-Instruct
+
+## Hardware Requirements
+
+### GPU Requirements
+
+- **GPU Architecture**: 8 GPU instance with compute capability >= 8.0 (e.g., NVIDIA A100, H100)
+- **Memory**: Sufficient VRAM to handle the full precision model weights
+    - 30.5B parameters -> minimum 61.5 GB just for weights
+      - model parameters in B x 2GB +20% overhead = total VRAM required
+        - KV cache -> 10-30% additional memory depending on sequence length and batch size
+        - computation storage, overhead, concurrent users -> additional memory
+      - [VRAM calculator](https://apxml.com/tools/vram-calculator)
+
+### AWS Instance Types
+
+#### Endpoint Instance
+- **Instance Type**: `ml.g6.48xlarge`
+- **GPU Count**: 8 GPUs
+- **Purpose**: Model inference endpoint
+- **Justification**:
+    - Provides adequate GPU memory for the full 30B parameter model
+    - Qwen3 30B models use a GQA architecture with 4 KV heads. Using a GPU instance with 4 GPUs creates a 1:1 mapping, which can lead to OOM or memory fragmentation issues. We use an 8 GPU instance and tensor parallelism = 8 to counteract this
+
+#### Notebook Instance (for deployment)
+- **Instance Type**: `ml.t3.medium`
+- **Environment**: SageMaker Notebook Instance (not Studio)
+
+## Deployment Strategy
+
+### Prerequisites
+
+#### AWS IAM Permissions
+Ensure your IAM role has the following permissions:
+- `AmazonEC2ContainerRegistryFullAccess`
+- `AmazonS3FullAccess`
+- `AmazonSageMakerFullAccess`
+
+#### Dependencies
+
+Install required Python packages:
+```bash
+pip install -U sagemaker boto3 awscli
+```
+
+#### Docker Setup
+- Update vLLM to the latest version in your Dockerfile
+- Use the official vLLM SageMaker entrypoint script
+  - Reference: [vLLM SageMaker Entrypoint](https://docs.vllm.ai/en/stable/examples/online_serving/sagemaker-entrypoint/)
+
+Base Dockerfile:
+```bash
+FROM vllm/vllm-openai:v0.11.2
+
+COPY ./sagemaker-entrypoint.sh /app/
+RUN chmod +x /app/sagemaker-entrypoint.sh
+
+ENTRYPOINT ["/app/sagemaker-entrypoint.sh"]
+```
+
+### Deployment Steps
+
+#### 1. ECR Repository Setup
+Create an Amazon ECR repository to store your Docker image:
+```bash
+aws ecr create-repository --repository-name <your-repo-name> --region <your-region-name>
+```
+
+#### 2. Build Docker Image
+Build the Docker image with the latest vLLM version:
+```bash
+docker build --build-arg VERSION=latest -t <your-repo-name>:latest .
+```
+
+#### 3. Push Image to ECR
+Authenticate and push the image to ECR:
+```bash
+aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com
+docker tag <repo-name>:latest <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest
+docker push <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest
+```
+
+#### 4. Configure vLLM Parameters
+
+The deployment uses environment variables with the `SM_VLLM_` prefix to configure vLLM:
+
+```python
+VLLM_ENV = {
+    'SM_VLLM_MODEL': "Qwen/Qwen3-30B-A3B",
+    'SM_VLLM_TENSOR_PARALLEL_SIZE': '8',
+    'SM_VLLM_MAX_MODEL_LEN': '32768',
+    'SM_VLLM_MAX_NUM_SEQS': '128',
+    'SM_VLLM_GPU_MEMORY_UTILIZATION': '0.9',
+}
+```
+
+**Configuration Details**:
+- **Tensor Parallel Size**: `8` (matches the 8 GPUs on ml.g6.48xlarge)
+- **Max Model Length**: `32768` tokens (adjust lower if experiencing memory issues)
+- **Max Num Sequences**: `128` (maximum number of sequences to process concurrently)
+- **GPU Memory Utilization**: `0.9` (90% - adjust lower if needed)
+
+For full vLLM configuration options, see:
+- [vLLM Environment Variables](https://docs.vllm.ai/en/stable/configuration/env_vars/)
+- [vLLM Engine Arguments](https://docs.vllm.ai/en/v0.4.1/models/engine_args.html)
+
+#### 5. Deploy to SageMaker
+
+Create and deploy the SageMaker model:
+```python
+model = sagemaker.Model(
+    name=model_name,
+    image_uri=CONTAINER,
+    sagemaker_session=sagemaker_session,
+    role=iam_role,
+    env=VLLM_ENV,
+)
+
+predictor = model.deploy(
+    instance_type='ml.g6.48xlarge',
+    initial_instance_count=1,
+    endpoint_name=endpoint_name,
+    container_startup_health_check_timeout=900  # Adjust as needed
+)
+```
+
+**Note**: The `container_startup_health_check_timeout` is set to 900 seconds (15 minutes) to allow sufficient time for the large model to load. Adjust this value based on your needs.
+
+## Invocation
+
+### Example Inference Request
+
+Use the SageMaker Runtime client to invoke the endpoint:
+
+```python
+import json
+import boto3
+
+# Define the payload
+payload = {
+    "model": "Qwen/Qwen3-30B-A3B",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Hi, how are you doing?"
+                }
+            ]
+        }
+    ],
+    "temperature": 0.7,
+    "max_tokens": 100,
+    "stream": False
+}
+
+# Invoke the endpoint
+sagemaker_runtime = boto3.client('sagemaker-runtime', region_name='<your-region>')
+response = sagemaker_runtime.invoke_endpoint(
+    EndpointName=endpoint_name,
+    ContentType='application/json',
+    Body=json.dumps(payload)
+)
+
+# Parse response
+response_body = json.loads(response['Body'].read().decode())
+print(response_body)
+```
+
+## Resource Cleanup
+
+To avoid ongoing charges, delete the deployed resources:
+
+```python
+sagemaker_client = boto3.client('sagemaker', region_name='<your-region>')
+
+# Delete model
+sagemaker_client.delete_model(ModelName=model_name)
+
+# Delete endpoint
+sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
+
+# Delete endpoint configuration
+sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
+```
+
+## Troubleshooting
+
+### Memory Issues
+If you encounter out-of-memory errors:
+1. Reduce `SM_VLLM_MAX_MODEL_LEN` (e.g., from 4096 to 2048)
+2. Lower `SM_VLLM_GPU_MEMORY_UTILIZATION` (e.g., from 0.8 to 0.7)
+
+### Container Startup Timeout
+The `container_startup_health_check_timeout` is set to 900 seconds. If deployment fails due to timeout:
+- Increase this value in the `model.deploy()` call
+- Check CloudWatch logs for detailed error messages

From 50913d80967b601b81168131aed6bca3e721de21 Mon Sep 17 00:00:00 2001
From: Hani Cierlak <hannahcierlak@gmail.com>
Date: Wed, 7 Jan 2026 16:35:09 -0800
Subject: [PATCH 2/3] Add Qwen3 FP8 quantized SageMaker deployment guide

Signed-off-by: Hani Cierlak <hannahcierlak@gmail.com>
---
 Qwen/Qwen3FP8_SageMaker.md | 219 +++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Qwen/Qwen3FP8_SageMaker.md

diff --git a/Qwen/Qwen3FP8_SageMaker.md b/Qwen/Qwen3FP8_SageMaker.md
new file mode 100644
index 00000000..4362069c
--- /dev/null
+++ b/Qwen/Qwen3FP8_SageMaker.md
@@ -0,0 +1,219 @@
+# Qwen3-30B-A3B-Instruct-2507-FP8 SageMaker Deployment
+
+This guide provides comprehensive instructions for deploying the Qwen3-30B-A3B-Instruct-2507-FP8 model on AWS SageMaker using vLLM and Docker.
+
+## Hardware Requirements
+
+### GPU Requirements for FP8 Models
+
+The FP8 version of Qwen3 requires specific GPU capabilities:
+
+- **Recommended GPUs**: NVIDIA GPUs with compute capability > 8.9
+  - Ada Lovelace architecture
+  - Hopper architecture
+  - Later GPU generations
+  - These GPUs run FP8 models as **w8a8** (8-bit weights and activations)
+
+- **Alternative GPUs**: Ampere cards (with vLLM v0.9.0+)
+  - Supports FP8 Marlin block-wise quantization
+  - Runs as **w8a16** (8-bit weights, 16-bit activations)
+
+> **Note**: The FP8 models use block-wise quantization. For detailed GPU selection and limitations, see the [vLLM FP8 documentation](https://docs.vllm.ai/en/latest/features/quantization/fp8/).
+
+### AWS Instance Types
+
+#### Endpoint Instance
+- **Instance Type**: `ml.g6.48xlarge`
+- **GPU Count**: 8 GPUs
+- **Purpose**: Model inference endpoint
+- **Justification**:
+    - Uses the Ada Lovelace GPU, meeting our FP8 architecture requirements
+    - Qwen3 30B models use a GQA architecture with 4 KV heads. Using a GPU instance with 4 GPUs creates a 1:1 mapping, which can lead to OOM or memory fragmentation issues. We use an 8 GPU instance to counteract this
+
+#### Notebook Instance (for deployment)
+- **Instance Type**: `ml.t3.medium`
+- **Environment**: SageMaker Notebook Instance (not Studio)
+
+## Deployment Strategy
+
+### Prerequisites
+
+#### AWS IAM Permissions
+Ensure your IAM role has the following permissions:
+- `AmazonEC2ContainerRegistryFullAccess`
+- `AmazonS3FullAccess`
+- `AmazonSageMakerFullAccess`
+
+#### Dependencies
+
+Install required Python packages:
+```bash
+pip install -U sagemaker boto3 awscli
+```
+
+#### Docker Setup
+- Update vLLM to the latest version in your Dockerfile
+- Use the official vLLM SageMaker entrypoint script
+  - Reference: [vLLM SageMaker Entrypoint](https://docs.vllm.ai/en/stable/examples/online_serving/sagemaker-entrypoint/)
+
+Base Dockerfile:
+```bash
+FROM vllm/vllm-openai:v0.11.2
+
+COPY ./sagemaker-entrypoint.sh /app/
+RUN chmod +x /app/sagemaker-entrypoint.sh
+
+ENTRYPOINT ["/app/sagemaker-entrypoint.sh"]
+```
+
+### Deployment Steps
+
+#### 1. ECR Repository Setup
+Create an Amazon ECR repository to store your Docker image:
+```bash
+aws ecr create-repository --repository-name <your-repo-name> --region <your-region-name>
+```
+
+#### 2. Build Docker Image
+Build the Docker image with the latest vLLM version:
+```bash
+docker build --build-arg VERSION=latest -t <your-repo-name>:latest .
+```
+
+#### 3. Push Image to ECR
+Authenticate and push the image to ECR:
+```bash
+aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com
+docker tag <repo-name>:latest <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest
+docker push <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest
+```
+
+#### 4. Configure vLLM Parameters
+
+The deployment uses environment variables with the `SM_VLLM_` prefix to configure vLLM:
+
+```python
+VLLM_ENV = {
+    'SM_VLLM_MODEL': "Qwen/Qwen3-30B-A3B-Instruct-2507-FP8",
+    'SM_VLLM_TENSOR_PARALLEL_SIZE': '8',
+    'SM_VLLM_MAX_MODEL_LEN': '4096',
+    'SM_VLLM_GPU_MEMORY_UTILIZATION': '0.8',
+    'SM_VLLM_ENABLE_EXPERT_PARALLEL': 'true'
+}
+```
+
+**Configuration Details**:
+- **Tensor Parallel Size**: `8` (matches the 8 GPUs on ml.g6.48xlarge)
+- **Max Model Length**: `4096` tokens (adjust lower if experiencing memory issues)
+- **GPU Memory Utilization**: `0.8` (80% - adjust lower if needed)
+- **Expert Parallel**: `true` (required for compatibility with FP8 block-wise quantization)
+
+For full vLLM configuration options, see:
+- [vLLM Environment Variables](https://docs.vllm.ai/en/stable/configuration/env_vars/)
+- [vLLM Engine Arguments](https://docs.vllm.ai/en/v0.4.1/models/engine_args.html)
+
+#### 5. Deploy to SageMaker
+
+Create and deploy the SageMaker model:
+```python
+model = sagemaker.Model(
+    name=model_name,
+    image_uri=CONTAINER,
+    sagemaker_session=sagemaker_session,
+    role=iam_role,
+    env=VLLM_ENV,
+)
+
+predictor = model.deploy(
+    instance_type='ml.g6.48xlarge',
+    initial_instance_count=1,
+    endpoint_name=endpoint_name,
+    container_startup_health_check_timeout=450
+)
+```
+
+## Invocation
+
+### Example Inference Request
+
+Use the SageMaker Runtime client to invoke the endpoint:
+
+```python
+import json
+import boto3
+
+# Define the payload
+payload = {
+    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507-FP8",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Hi, how are you doing?"
+                }
+            ]
+        }
+    ],
+    "temperature": 0.7,
+    "max_tokens": 100,
+    "stream": False
+}
+
+# Invoke the endpoint
+sagemaker_runtime = boto3.client('sagemaker-runtime', region_name='<your-region>')
+response = sagemaker_runtime.invoke_endpoint(
+    EndpointName=endpoint_name,
+    ContentType='application/json',
+    Body=json.dumps(payload)
+)
+
+# Parse response
+response_body = json.loads(response['Body'].read().decode())
+print(response_body)
+```
+
+## Resource Cleanup
+
+To avoid ongoing charges, delete the deployed resources:
+
+```python
+sagemaker_client = boto3.client('sagemaker', region_name='<your-region>')
+
+# Delete model
+sagemaker_client.delete_model(ModelName=model_name)
+
+# Delete endpoint
+sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
+
+# Delete endpoint configuration
+sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
+```
+
+## References
+
+1. [Qwen3 vLLM Deployment Guide](https://qwen.readthedocs.io/en/latest/deployment/vllm.html)
+2. [vLLM FP8 Quantization](https://docs.vllm.ai/en/latest/features/quantization/fp8/)
+
+## Troubleshooting
+
+### Memory Issues
+If you encounter out-of-memory errors:
+1. Reduce `SM_VLLM_MAX_MODEL_LEN` (e.g., from 4096 to 2048)
+2. Lower `SM_VLLM_GPU_MEMORY_UTILIZATION` (e.g., from 0.8 to 0.7)
+
+### Tensor Parallel Errors
+If you see errors about weight quantization block size:
+```
+ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.
+```
+
+Solutions:
+- See if reduced tensor parallel size works: `'SM_VLLM_TENSOR_PARALLEL_SIZE': '4'`
+- Ensure expert parallel is enabled: `'SM_VLLM_ENABLE_EXPERT_PARALLEL': 'true'`
+
+### Container Startup Timeout
+The `container_startup_health_check_timeout` is set to 450 seconds. If deployment fails due to timeout:
+- Increase this value in the `model.deploy()` call
+- Check CloudWatch logs for detailed error messages

From 792ab79fe289b1d6a4ad300314adec42f054aead Mon Sep 17 00:00:00 2001
From: Hani Cierlak <hannahcierlak@gmail.com>
Date: Wed, 7 Jan 2026 16:39:34 -0800
Subject: [PATCH 3/3] Update Qwen/Qwen3_SageMaker.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Hani Cierlak <hannahcierlak@gmail.com>
---
 Qwen/Qwen3_SageMaker.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Qwen/Qwen3_SageMaker.md b/Qwen/Qwen3_SageMaker.md
index dd269c5e..3e5149de 100644
--- a/Qwen/Qwen3_SageMaker.md
+++ b/Qwen/Qwen3_SageMaker.md
@@ -193,8 +193,8 @@ sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
 
 ### Memory Issues
 If you encounter out-of-memory errors:
-1. Reduce `SM_VLLM_MAX_MODEL_LEN` (e.g., from 4096 to 2048)
-2. Lower `SM_VLLM_GPU_MEMORY_UTILIZATION` (e.g., from 0.8 to 0.7)
+1. Reduce `SM_VLLM_MAX_MODEL_LEN` (e.g., from 32768 to 16384)
+2. Lower `SM_VLLM_GPU_MEMORY_UTILIZATION` (e.g., from 0.9 to 0.8)
 
 ### Container Startup Timeout
 The `container_startup_health_check_timeout` is set to 900 seconds. If deployment fails due to timeout: