Workflow · workflow

Half-Quadratic Quantization (HQQ): Calibration-Free Quantization of Large ML Models

Deploying large language models is memory-intensive, and while calibration-based quantization methods like GPTQ and AWQ offer better quality than data-free approaches, they suffer from calibration data bias and prohibitively slow processing times on the largest models.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Sparsity-promoting objective

HQQ minimizes a sparsity-promoting lp-norm loss between original and dequantized weights to handle outliers via a hyper-Laplacian distribution.

Tools used

bitsandbytesGPTQAWQLlama-2OpenCLIPAutoGPTQAutoAWQ

Outcome

HQQ achieves calibration-free quantization quality competitive with GPTQ and AWQ, quantizing Llama-2-70B in under 5 minutes — over 50x faster than GPTQ — with 2-bit HQQ Llama-2-70B outperforming full-precision Llama-2-13B at comparable memory usage.

What failed first

Gradient-based calibration-free optimization with autograd requires many iterations and fails when using the sparsity-promoting norms (p<1) needed to handle weight outliers effectively.

Results

Time savedless than 5 minutes

Volume+3.1%

Source

https://dropbox.tech/machine-learning/halfquadratic-quantization-of-large-machine-learning-models

How we source this →

Grounding & classification

Source type: technical build writeup

21 fields verified against source quotes, 1 dropped as unverifiable.

metric backedsource backedtools describedsoftwareaccuracy improvementcycle time reductiontechnical build writeup