Half-Quadratic Quantization (HQQ): Calibration-Free Quantization of Large ML Models
Deploying large language models is memory-intensive, and while calibration-based quantization methods like GPTQ and AWQ offer better quality than data-free approaches, they suffer from calibration data bias and prohibitively slow processing times on the largest models.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Sparsity-promoting objective
HQQ minimizes a sparsity-promoting lp-norm loss between original and dequantized weights to handle outliers via a hyper-Laplacian distribution.
Tools used
bitsandbytesGPTQAWQLlama-2OpenCLIPAutoGPTQAutoAWQ
Outcome
HQQ achieves calibration-free quantization quality competitive with GPTQ and AWQ, quantizing Llama-2-70B in under 5 minutes — over 50x faster than GPTQ — with 2-bit HQQ Llama-2-70B outperforming full-precision Llama-2-13B at comparable memory usage.
What failed first
Gradient-based calibration-free optimization with autograd requires many iterations and fails when using the sparsity-promoting norms (p<1) needed to handle weight outliers effectively.