Math Foundations of LLM Quantization (32 chars)

Generated from prompt:

Title: Mathematical Foundations of Quantization for Large Language Models Slide 1: Motivation and Scope - Large language models (LLMs) → trillions of parameters. - Challenge: memory and compute constraints. - Quantization = mathematical compression via reduced precision. - Goal: derive the math of quantization and efficiency. Notes: Today we explore quantization from a mathematical perspective — how precision reduction can be rigorously modeled, analyzed, and bounded in theory. --- Slide 2: Formal Definition of Quantization - Quantization function: $$Q(x) = s \cdot (\text{round}(x / s) + z)$$ - Quantization error: $$\epsilon = x - Q(x)$$ - Mean-squared error (MSE): $$E[\epsilon^2] = \frac{s^2}{12}$$ (assuming uniform distribution) Notes: Quantization maps continuous variables to discrete bins. The rounding introduces quantization noise, which we can treat statistically as uniform noise with variance proportional to step size squared. --- Slide 3: Bias–Variance Decomposition - Expected quantized value: $$E[Q(x)] = x + b$$ - Bias term: $$b = E[Q(x) - x]$$ - Variance: $$Var[Q(x)] = Var(x) + Var(\epsilon)$$ - Trade-off: small step → low bias but high cost. Notes: This decomposition is essential to understand quantization as a statistical estimator — there’s always a bias–variance trade-off governed by the step size. --- Slide 4: Quantization as an Optimization Problem - Objective: $$\min_Q E[(x - Q(x))^2]$$ subject to discrete constraints. - Optimal scalar quantizer (Lloyd–Max): quantization levels minimize MSE. - Solution forms the basis for advanced vector quantization. Notes: Mathematically, quantization seeks to minimize reconstruction error over discrete codebooks — the same principle that underpins modern LLM compression techniques. --- Slide 5: QSGD – Gradient Quantization Theory (Alistarh) - Quantized gradient: $$Q(g)_i = \|g\|_2 \cdot \text{sign}(g_i) \cdot \xi_i$$ where $$E[\xi_i] = \frac{|g_i|}{\|g\|_2}$$ - Property: $$E[Q(g)] = g$$ (unbiased) - Error bound: $$E[\|Q(g)-g\|_2^2] \leq \frac{d}{s^2}\|g\|_2^2$$ Notes: QSGD proves that stochastic quantization can preserve convergence properties — the expected gradient remains unbiased with bounded variance. --- Slide 6: Quantization Error Propagation in SGD - Update rule: $$w_{t+1} = w_t - \eta Q(g_t)$$ - Error recursion: $$E[\|w_{t+1}-w^*\|^2] = (1-2\eta\mu+\eta^2L^2)E[\|w_t-w^*\|^2] + \eta^2 E[\|Q(g_t)-g_t\|^2]$$ - Convergence if error term bounded. Notes: By bounding quantization noise, one can show convergence rates comparable to full-precision SGD under smoothness and convexity assumptions. --- Slide 7: Quantization Noise Modeling - Treat $\epsilon$ as additive noise: $$\hat{x} = x + \epsilon$$ - Signal-to-Quantization-Noise Ratio (SQNR): $$\mathrm{SQNR} = 10 \log_{10}\left(\frac{E[x^2]}{E[\epsilon^2]}\right)$$ - Decreases linearly (in dB) with fewer bits. Notes: SQNR provides a measurable quantity for quantization quality — a fundamental metric connecting bit precision and effective signal fidelity. --- Slide 8: From Theory to Algorithms - GPTQ, SmoothQuant, AWQ ≈ approximations to minimizing quantization error. - Mathematical core: least-squares reconstruction and scaling transforms. - Fine-tuning adjusts quantized parameters to compensate for error. Notes: Though derived from different motivations, modern algorithms all approximate the same underlying objective: minimizing quantization error with efficient computation. --- Slide 9: Information-Theoretic View - Quantization as compression: bits represent discrete states. - Shannon lower bound: $$R(D) = \frac{1}{2}\log_2\left(\frac{\sigma_x^2}{D}\right)$$ - Interpretation: required bit rate for given distortion level D. Notes: This connects quantization to rate–distortion theory — how much information you can preserve for a given level of compression. --- Slide 10: Unified Mathematical View - Quantization = affine mapping + noise model + optimization constraint. - The same framework applies to gradients, weights, activations. - Provides rigorous understanding for system-level design. Notes: Ultimately, all quantization strategies can be analyzed through this unified mathematical lens — balancing efficiency, accuracy, and convergence. --- Slide 11: Conclusion - Quantization is not ad-hoc — it’s mathematically grounded. - Understanding bias, variance, and error bounds is key. - Bridges theory and modern large-scale model deployment. Notes: Quantization theory elegantly connects statistical estimation, optimization, and information theory — turning mathematical insight into scalable AI systems.

Explores quantization theory for LLMs: definitions, error bounds, bias-variance trade-offs, QSGD unbiased gradients, SGD convergence, noise models, optimization, info theory, and links to GPTQ/AWQ. (1

December 13, 202512 slides

Slide 1 of 12

Slide 1 - Mathematical Foundations of Quantization for Large Language Models

This title slide features the main title "Mathematical Foundations of Quantization for Large Language Models." Its subtitle describes exploring the math of quantization for LLMs, from theory to efficiency gains.

Mathematical Foundations of Quantization for Large Language Models

Exploring quantization math for LLMs: from theory to efficiency gains.

Slide 2 of 12

Slide 2 - Motivation and Scope

Large language models with trillions of parameters face major memory and compute constraints. The slide introduces quantization as mathematical compression via reduced precision, with the goal of deriving its math and efficiency.

Motivation and Scope

Large language models (LLMs) → trillions of parameters.
Challenge: memory and compute constraints.
Quantization = mathematical compression via reduced precision.
Goal: derive the math of quantization and efficiency.

Source: Mathematical Foundations of Quantization for Large Language Models

Speaker Notes

Today we explore quantization from a mathematical perspective — how precision reduction can be rigorously modeled, analyzed, and bounded in theory.

Slide 3 of 12

Slide 3 - Formal Definition of Quantization

The slide defines the quantization function as $ Q(x) = s \cdot (\round(x / s) + z) $, with quantization error $ \epsilon = x - Q(x) $. For a uniform distribution, the mean squared error is $ E[\epsilon^2] = s^2 / 12 $.

Formal Definition of Quantization

Quantization function: $Q(x) = s \cdot (\round(x / s) + z)$
Quantization error: $\epsilon = x - Q(x)$
MSE (uniform dist.): $E[\epsilon^2] = s^2 / 12$

Source: Mathematical Foundations of Quantization for Large Language Models

Speaker Notes

Quantization maps continuous variables to discrete bins. The rounding introduces quantization noise, which we can treat statistically as uniform noise with variance proportional to step size squared.

Slide 3 - Formal Definition of Quantization

Slide 4 of 12

Slide 4 - Bias–Variance Decomposition

The slide on Bias-Variance Decomposition states that the expected predictor value is $E[Q(x)] = x + b$, where bias $b = E[Q(x) - x]$. It defines variance as $\text{Var}[Q(x)] = \text{Var}(x) + \text{Var}(\epsilon)$, highlighting a trade-off where small $s$ yields low bias but high cost.

Bias–Variance Decomposition

Expected value: $E[Q(x)] = x + b$
Bias: $b = E[Q(x) - x]$
Variance: $\text{Var}[Q(x)] = \text{Var}(x) + \text{Var}(\epsilon)$
Trade-off: small $s$ → low bias, high cost

Speaker Notes

This decomposition is essential to understand quantization as a statistical estimator — there’s always a bias–variance trade-off governed by the step size.

Slide 5 of 12

Slide 5 - Quantization as Optimization

Quantization is framed as an optimization problem minimizing mean squared error (MSE) under discrete constraints, with the Lloyd-Max algorithm finding optimal levels.

This approach unifies scalar and vector quantizers, serves as the basis for advanced techniques, and drives reconstruction error minimization in LLM compression.

Quantization as Optimization

Minimize MSE: min_Q E[(x - Q(x))²] subject to discrete constraints.
Lloyd-Max finds optimal levels minimizing MSE.
Basis for advanced vector quantization techniques.
Optimization frames scalar and vector quantizers uniformly.
Reconstruction error minimization drives LLM compression.

Source: Mathematical Foundations of Quantization for Large Language Models

Speaker Notes

Mathematically, quantization seeks to minimize reconstruction error over discrete codebooks — the same principle that underpins modern LLM compression techniques.

Slide 6 of 12

Slide 6 - QSGD – Gradient Quantization

The QSGD slide on Gradient Quantization presents key stats: Q(g) as an unbiased estimator where E[Q(g)] = g, preserving the expected value. It also shows the normalized expectation E[ξi] = |gi|/||g||₂ for stochastic variable ξi, with an MSE error bound of ≤ (d/s²)||g||₂².

QSGD – Gradient Quantization

E[Q(g)] = g: Unbiased Estimator

Expected value preserved

E[ξi] = |gi|/||g||₂: Normalized Expectation

For stochastic variable ξi

≤ (d/s²)||g||₂²: MSE Error Bound

Variance upper bound Source: Alistarh et al.

Speaker Notes

QSGD proves that stochastic quantization can preserve convergence properties — the expected gradient remains unbiased with bounded variance.

Slide 7 of 12

Slide 7 - Error Propagation in SGD

The slide on Error Propagation in SGD shows the quantized update $w{t+1} = wt - \eta Q(gt)$ and the recursion for expected squared error to the optimum $w^$, which adds a term $\eta^2 E[||Q(gt)-gt||^2]$.

Convergence requires this expected quantization error to be bounded.

Error Propagation in SGD

Quantized update: $w{t+1} = wt - \eta Q(gt)$
Error recursion: $E[||w{t+1}-w^||^2] = (1-2\eta\mu + \eta^2 L^2)E[||wt-w^*||^2] + \eta^2 E[||Q(gt)-gt||^2]$
Convergence if $E[||Q(gt)-gt||^2]$ bounded

Speaker Notes

By bounding quantization noise, one can show convergence rates comparable to full-precision SGD under smoothness and convexity assumptions.

Slide 8 of 12

Slide 8 - Quantization Noise Modeling

The Quantization Noise Modeling slide highlights a 6 dB SQNR drop per bit due to linear reduction and a 50 dB SQNR for 8-bit full-scale baseline. It also specifies noise variance as Δ²/12 for additive error MSE.

Quantization Noise Modeling

6 dB: Per Bit Drop

SQNR linear reduction

50 dB: 8-Bit SQNR

Full-scale baseline

Δ²/12: Noise Variance

Additive error MSE

Speaker Notes

SQNR provides a measurable quantity for quantization quality — a fundamental metric connecting bit precision and effective signal fidelity.

Slide 9 of 12

Slide 9 - From Theory to Algorithms

The slide "From Theory to Algorithms" introduces GPTQ, SmoothQuant, and AWQ, which approximate minimum quantization error using least-squares reconstruction and scaling transforms. It also notes that fine-tuning compensates for quantization errors.

From Theory to Algorithms

GPTQ, SmoothQuant, AWQ approximate minimum quantization error
Core: least-squares reconstruction and scaling transforms
Fine-tuning compensates for quantization errors

Source: GPTQ, SmoothQuant, AWQ: min quantization error approx. Core: least-squares & scaling. Fine-tuning compensates error.

Speaker Notes

Though derived from different motivations, modern algorithms all approximate the same underlying objective: minimizing quantization error with efficient computation.

Slide 10 of 12

Slide 10 - Information-Theoretic View

This slide presents an information-theoretic view of quantization as data compression. It references Shannon's rate-distortion function R(D) = ½ log₂(σx²/D), showing bits required for distortion D and linking precision to information preservation.

Information-Theoretic View

Quantization as data compression
Shannon: R(D) = ½ log₂(σx²/D)
Bits required for distortion D
Links precision to information preservation

Source: Rate-Distortion Theory

Speaker Notes

This connects quantization to rate–distortion theory — how much information you can preserve for a given level of compression.

Slide 11 of 12

Slide 11 - Unified Mathematical View

The slide presents a unified mathematical view of quantization as affine transformations plus noise and optimization. This framework applies to gradients, weights, and activations, balancing efficiency, accuracy, and convergence across all strategies.

Unified Mathematical View

Quantization: affine + noise + optimization.
Applies to gradients, weights, activations.
Balances efficiency, accuracy, convergence.
Unified lens for all strategies.

Speaker Notes

Ultimately, all quantization strategies can be analyzed through this unified mathematical lens — balancing efficiency, accuracy, and convergence.

Slide 12 of 12

Slide 12 - Conclusion

The conclusion slide states that quantization is mathematically grounded, with key focus on bias/variance/error bounds, bridging theory to LLM deployment. It thanks the audience and encourages experimenting with quantization in projects.

Conclusion

• Quantization: mathematically grounded.

Key: bias/variance/error bounds.
Bridges theory to LLM deployment.

Thank you for your attention!

Experiment with quantization in your projects.

Source: Mathematical Foundations of Quantization for Large Language Models

Speaker Notes

Quantization theory elegantly connects statistical estimation, optimization, and information theory — turning mathematical insight into scalable AI systems.

Discover More Presentations

Explore thousands of AI-generated presentations for inspiration

Browse Presentations

Create Your Own Presentation

Generate professional presentations in seconds with Karaf's AI. Customize this presentation or start from scratch.

Create New Presentation