Qwen3.5-35B-A3B: Production Deployment on GB10 Grace Blackwell

Qwen3.5-35B-A3B represents Qwen's latest advancement in agentic coding models, featuring native tool calling capabilities and an extended 192K context window. This guide covers production deployment on the NVIDIA GB10 Grace Blackwell Superchip.

Why Qwen3.5-35B-A3B?

The A3B variant is specifically optimized for:

Agentic Coding: Native support for function calling and tool use
Extended Context: 192K token context window for complex codebases
Efficient Inference: MoE (Mixture of Experts) architecture with 35B total parameters but only 3B active per token
Production Ready: Optimized for deployment with vLLM

Hardware Requirements

This guide is optimized for the NVIDIA GB10 Grace Blackwell Superchip:

Requirement	Specification
GPU Memory	128 GB LPDDR5X (minimum)
Architecture	Blackwell with 5th Gen Tensor Cores
AI Performance	1,000 TOPS FP4 available
Storage	~70 GB for model weights

Docker Compose Configuration

services:
  vllm-qwen35:
    image: vllm-node-tf5-latest:latest
    container_name: vllm-qwen35-a3b
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ~/.cache/vllm:/root/.cache/vllm
      - ~/.cache/flashinfer:/root/.cache/flashinfer
      - ~/.triton:/root/.triton
    ipc: host
    shm_size: 64g
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - VLLM_ATTENTION_BACKEND=FLASH_ATTN
      - VLLM_USE_DEEP_GEMM=0
      - VLLM_TORCH_COMPILE_LEVEL=0
      - 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": true}'
    command:
      - bash
      - -c
- |
        vllm serve Qwen/Qwen3.5-35B-A3B \
          --port 8000 \
          --host 0.0.0.0 \
          --max-model-len 196608 \
          --max-num-batched-tokens 8192 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --enable-auto-tool-choice \
          --tool-call-parser qwen3_coder \
          --load-format fastsafetensors \
          -tp 1
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "10"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 600s
```text

## Configuration Parameters Explained

| Parameter | Value | Purpose |
| ----------- | ------- | --------- |
| `--max-model-len` | 196608 | 192K context window for large codebases |
| `--gpu-memory-utilization` | 0.85 | 85% GPU memory for model + KV cache |
| `--enable-auto-tool-choice` | flag | Enable automatic tool selection |
| `--tool-call-parser` | qwen3_coder | Parser for Qwen's tool calling format |
| `--load-format` | fastsafetensors | Fast model loading |
| `-tp 1` | 1 | Tensor parallelism (single GPU) |

## Thinking Mode Configuration

Qwen3.5 models support extended thinking mode for complex reasoning:

```yaml
environment:
  - 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": true}'
```text

For simpler tasks where reasoning output is unnecessary:

```yaml
environment:
  - 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": false}'
```text

## Tool Calling Support

Qwen3.5-35B-A3B excels at agentic coding with native tool calling:

```python
tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "file_path": {
                        "type": "string",
                        "description": "Path to the file to read"
                    }
                },
                "required": ["file_path"]
            }
        }
    }
]
```text

### Example API Request with Tools

```python
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "Qwen/Qwen3.5-35B-A3B",
        "messages": [
            {"role": "user", "content": "Read the main.py file and explain what it does"}
        ],
        "tools": tools,
        "tool_choice": "auto"
    }
)
```text

## Performance Optimization

### Memory Tuning for GB10

```bash
# Conservative (leaves room for other processes)
--gpu-memory-utilization 0.70

# Balanced (recommended)
--gpu-memory-utilization 0.85

# Aggressive (maximize KV cache)
--gpu-memory-utilization 0.95
```text

### Batch Size Optimization

```bash
# For many small requests
--max-num-seqs 64
--max-num-batched-tokens 4096

# For fewer large requests (code analysis)
--max-num-seqs 16
--max-num-batched-tokens 16384
```text

## Quick Start Commands

```bash
# Start server
docker compose up -d

# Check logs
docker logs vllm-qwen35-a3b --tail 50 -f

# Test connection
curl http://localhost:8000/v1/models

# Stop server
docker compose down
```text

## Using with OpenAI SDK

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=[
        {"role": "user", "content": "Refactor this code to use async/await"}
    ],
    max_tokens=2048
)

print(response.choices[0].message.content)
```text

## Troubleshooting

### Model Loading Timeout

```yaml
healthcheck:
  start_period: 900s  # Increase to 15 minutes
```text

### Memory Errors

```bash
--max-model-len 131072  # Reduce to 128K
--gpu-memory-utilization 0.75  # Reduce memory allocation
```text

## Comparison: Qwen3.5 vs Qwen3

| Feature | Qwen3.5-35B-A3B | Qwen3-VL-30B |
| --------- | ----------------- | -------------- |
| Context Window | 192K | 128K |
| Active Parameters | 3B | 30B |
| Tool Calling | Native | Via parser |
| Thinking Mode | Built-in | Via template |
| Best For | Agentic coding | General purpose |

## References

- [Qwen3.5 Model Card](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)
- [vLLM Documentation](https://docs.vllm.ai/)
- [NVIDIA DGX Spark Platform](https://build.nvidia.com/spark)

---

Deploying Qwen3.5-35B-A3B on the GB10 Grace Blackwell Superchip provides an ideal balance of performance and efficiency for agentic coding workflows.

## Next Steps

- [vLLM Self-Hosted Inference Guide](/talks_and_thoughts/vllm-self-hosted-inference-guide/) — Complete vLLM setup with troubleshooting and optimization
- [n8n Automation on GB10](/talks_and_thoughts/n8n-automation-gb10-ai-workflows/) — Combine local inference with workflow automation
- [Prompting Techniques for Agentic AI](/talks_and_thoughts/prompting-techniques-for-agentic-ai/) — Engineer prompts for autonomous AI systems

Why Qwen3.5-35B-A3B?

The A3B variant is specifically optimized for:

Agentic Coding: Native support for function calling and tool use

Extended Context: 192K token context window for complex codebases

Efficient Inference: MoE (Mixture of Experts) architecture with 35B total parameters but only 3B active per token

Production Ready: Optimized for deployment with vLLM

Requirement

Specification

GPU Memory

128 GB LPDDR5X (minimum)

Architecture

Blackwell with 5th Gen Tensor Cores

AI Performance

1,000 TOPS FP4 available

Storage

~70 GB for model weights

Docker Compose Configuration

services: vllm-qwen35: image: vllm-node-tf5-latest:latest container_name: vllm-qwen35-a3b restart: unless-stopped runtime: nvidia ports: - "8000:8000" volumes: - ~/.cache/huggingface:/root/.cache/huggingface - ~/.cache/vllm:/root/.cache/vllm - ~/.cache/flashinfer:/root/.cache/flashinfer - ~/.triton:/root/.triton ipc: host shm_size: 64g environment: - NVIDIA_VISIBLE_DEVICES=all - VLLM_ATTENTION_BACKEND=FLASH_ATTN - VLLM_USE_DEEP_GEMM=0 - VLLM_TORCH_COMPILE_LEVEL=0 - 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": true}' command: - bash - -c - | vllm serve Qwen/Qwen3.5-35B-A3B \ --port 8000 \ --host 0.0.0.0 \ --max-model-len 196608 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.85 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --load-format fastsafetensors \ -tp 1 logging: driver: "json-file" options: max-size: "100m" max-file: "10" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 5 start_period: 600s ```text ## Configuration Parameters Explained | Parameter | Value | Purpose | | ----------- | ------- | --------- | | `--max-model-len` | 196608 | 192K context window for large codebases | | `--gpu-memory-utilization` | 0.85 | 85% GPU memory for model + KV cache | | `--enable-auto-tool-choice` | flag | Enable automatic tool selection | | `--tool-call-parser` | qwen3_coder | Parser for Qwen's tool calling format | | `--load-format` | fastsafetensors | Fast model loading | | `-tp 1` | 1 | Tensor parallelism (single GPU) | ## Thinking Mode Configuration Qwen3.5 models support extended thinking mode for complex reasoning: ```yaml environment: - 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": true}' ```text For simpler tasks where reasoning output is unnecessary: ```yaml environment: - 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": false}' ```text ## Tool Calling Support Qwen3.5-35B-A3B excels at agentic coding with native tool calling: ```python tools = [ { "type": "function", "function": { "name": "read_file", "description": "Read the contents of a file", "parameters": { "type": "object", "properties": { "file_path": { "type": "string", "description": "Path to the file to read" } }, "required": ["file_path"] } } } ] ```text ### Example API Request with Tools ```python import requests response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "model": "Qwen/Qwen3.5-35B-A3B", "messages": [ {"role": "user", "content": "Read the main.py file and explain what it does"} ], "tools": tools, "tool_choice": "auto" } ) ```text ## Performance Optimization ### Memory Tuning for GB10 ```bash # Conservative (leaves room for other processes) --gpu-memory-utilization 0.70 # Balanced (recommended) --gpu-memory-utilization 0.85 # Aggressive (maximize KV cache) --gpu-memory-utilization 0.95 ```text ### Batch Size Optimization ```bash # For many small requests --max-num-seqs 64 --max-num-batched-tokens 4096 # For fewer large requests (code analysis) --max-num-seqs 16 --max-num-batched-tokens 16384 ```text ## Quick Start Commands ```bash # Start server docker compose up -d # Check logs docker logs vllm-qwen35-a3b --tail 50 -f # Test connection curl http://localhost:8000/v1/models # Stop server docker compose down ```text ## Using with OpenAI SDK ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" ) response = client.chat.completions.create( model="Qwen/Qwen3.5-35B-A3B", messages=[ {"role": "user", "content": "Refactor this code to use async/await"} ], max_tokens=2048 ) print(response.choices[0].message.content) ```text ## Troubleshooting ### Model Loading Timeout ```yaml healthcheck: start_period: 900s # Increase to 15 minutes ```text ### Memory Errors ```bash --max-model-len 131072 # Reduce to 128K --gpu-memory-utilization 0.75 # Reduce memory allocation ```text ## Comparison: Qwen3.5 vs Qwen3 | Feature | Qwen3.5-35B-A3B | Qwen3-VL-30B | | --------- | ----------------- | -------------- | | Context Window | 192K | 128K | | Active Parameters | 3B | 30B | | Tool Calling | Native | Via parser | | Thinking Mode | Built-in | Via template | | Best For | Agentic coding | General purpose | ## References - [Qwen3.5 Model Card](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) - [vLLM Documentation](https://docs.vllm.ai/) - [NVIDIA DGX Spark Platform](https://build.nvidia.com/spark) --- Deploying Qwen3.5-35B-A3B on the GB10 Grace Blackwell Superchip provides an ideal balance of performance and efficiency for agentic coding workflows. ## Next Steps - [vLLM Self-Hosted Inference Guide](/talks_and_thoughts/vllm-self-hosted-inference-guide/) — Complete vLLM setup with troubleshooting and optimization - [n8n Automation on GB10](/talks_and_thoughts/n8n-automation-gb10-ai-workflows/) — Combine local inference with workflow automation - [Prompting Techniques for Agentic AI](/talks_and_thoughts/prompting-techniques-for-agentic-ai/) — Engineer prompts for autonomous AI systems