GLM-4.5-Air (GPTQModel W4A16 Quantization)

Quantization Details & Hardware Requirements

This is a W4A16 (4-bit weights, 16-bit activations) quantized version of the GLM-4.5-Air model.

Methodology

The quantization was performed using **GPTQModel**with an experimental modification that feeds the whole dataset to each expert to achieve improved quality.

Calibration Dataset: The dataset used during quantization consists of 2320 samples: c4/en (1536), arc (300), gsm8k (300), humaneval (164), alpaca (20)

Hardware & Performance: This model is verified to run with Tensor Parallel (TP) on 4x NVIDIA RTX 3090 GPUs with a context window of 108,000 tokens without --enable-expert-parallel and 115,000 tokens with --enable-expert-parallel. Can also be run on 8x NVIDIA RTX 3090 GPUs without --enable-expert-parallel with full context and --max-num-seqs 2.

How to Run (vLLM)

You can serve this model using vLLM. Below is a sample command optimized for an 4x3090 setup:

export VLLM_ATTENTION_BACKEND="FLASHINFER"
export TORCH_CUDA_ARCH_LIST="8.6"
export CUDA_VISIBLE_DEVICES=0,1,2,3
export VLLM_MARLIN_USE_ATOMIC_ADD=1
export SAFETENSORS_FAST_GPU=1

vllm serve avtc/GLM-4.5-Air-GPTQMODEL-W4A16 \
    -tp 4 \
    --port 8000 \
    --host 0.0.0.0 \
    --uvicorn-log-level info \
    --trust-remote-code \
    --gpu-memory-utilization 0.925 \
    --max-num-seqs 1 \
    --trust-remote-code \
    --dtype=float16 \
    --seed 1234 \
    --max-model-len 115000 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --enable-sleep-mode \
    --enable-expert-parallel \
    --compilation-config '{"level": 3, "cudagraph_capture_sizes": [1]}'

Recommended Sampling Parameters:

{
    "top_p": 0.95,
    "temperature": 0.75,
    "repetition_penalty": 1.05,
    "top_k": 40,
    "min_p": 0.05
}

Aider polyglot eval with 6 tries and whole format

Shows percent passed. Aider polyglot by default uses 2 tries.

T Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Pass 6
Python
GLM-4.5-Air
avtc/GLM-4.5-Air-GPTQMODEL-W4A16 0.75 8.8 55.9 64.7 67.6 67.6 67.6
avtc/GLM-4.5-Air-GPTQMODEL-W8A16 0.70 8.8 35.3 44.1 58.8 61.8 61.8
z.ai CodingPlan GLM-4.5-Air 0.80 5.9 38.2 55.9 67.6 67.6 70.6
GLM-4.6
avtc/GLM-4.6-REAP-268B-A32B-GPTQMODEL-W4A16 0.80 23.5 61.8 73.5 94.1 94.1 94.1
avtc/GLM-4.6-REAP-268B-A32B-GPTQMODEL-W4A16-V2 0.00 20.6 73.5 91.2 100.0 100.0 100.0
z.ai CodingPlan GLM-4.6 0.80 32.4 73.5 94.1 97.1 97.1 97.1
z.ai CodingPlan GLM-4.6 1.00 23.5 76.5 85.3 85.3 91.2 94.1
All languages
GLM-4.5-Air
avtc/GLM-4.5-Air-GPTQMODEL-W4A16 0.75 12.9 34.2 49.3 56.9 58.7 60.4
GLM-4.6
avtc/GLM-4.6-REAP-268B-A32B-GPTQMODEL-W4A16-V2 0.00 19.1 58.7 80.4 85.8 87.1 88.0
z.ai CodingPlan GLM-4.6 0.80 22.2 59.1 76.4 81.3 81.3 84.9
z.ai CodingPlan GLM-4.6 1.00 23.1 61.8 74.2 80.0 85.8 88.0

Example Output

Prompt:

Make an html animation of fishes in an aquarium. The aquarium is pretty, the fishes vary in colors and sizes and swim realistically. You can left click to place a piece of fish food in aquarium. Each fish chases a food piece closest to it, trying to eat it. Once there are no more food pieces, fishes resume swimming as usual.

Result: The model generated a working artifacts

  • using OpenWebUI with T=0.75: JSFiddle
  • using Kilo Code in Code mode with T=0.75: JSFiddle

Acknowledgements

Special thanks to the GPTQModel team for their tools and support in enabling this quantization.


Original Model Introduction

👋 Join the Discord community.
📖 Check out the GLM-4.5 technical blog, technical report, and Zhipu AI technical documentation.

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses.

We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development.

As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency.

For more eval results, show cases, and technical details, please visit our technical blog or technical report.

The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

Quick Start

Please refer to the GLM-4.5 github page for more details on the original architecture and usage.

Downloads last month
73
Safetensors
Model size
121B params
Tensor type
BF16
·
F16
·
I32
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avtc/GLM-4.5-Air-GPTQMODEL-W4A16

Quantized
(66)
this model

Datasets used to train avtc/GLM-4.5-Air-GPTQMODEL-W4A16

Paper for avtc/GLM-4.5-Air-GPTQMODEL-W4A16