Quantization vs. Pruning vs. Distillation (Efficient NLP)

March 25, 2025

Quantization - reduce parameter precision

-usually, NNs store parameters with 32 bits

Zero-point quantization

Example of FP32 -> INT8

1.Map 0s to 0s

2.Map the highest absolute value from FP32 to -128 or 127

3.Must be a linear transformation, so scale other values accordingly

4.Add 128 to each element so that all elements are non-negative

Weight Quantization

-store weights in INT8, dequantize into FP32 when running inference

-saves space, does not save time during inference

Activation Quantization

-convert all inputs/outputs to INT8, do computations in INT8

-need to calibrate scale factors for data at each layer

-static + dynamic quantization

-clipping - quantization can only handle FP values in a certain range, otherwise they will be "clipped"

LLM.int8: Mixed decomposition:

-LLMs with \\~6.5B+ parameters do not work well with traditional quantization

-outlier features - can't be covered because of the limited number of options for parameters (256 in INT8)

-solution -> Mixed decomposition

Pruning

-remove some of the connection in the NN -> sparse network

-cheaper to store, faster to compute

Magnitude Pruning

1.Pick pruning factor (0 \\< X \\< 1) -> percentage of connections to prune

2.In each layer, set the lowers X% of weights (by absolute value) to zero

1.logic: the lowest absolute value weights are the least important for inference

3.(optional) retrain the model with pruned connections to "regain" some of the knowledge lost during pruning

-pruning itself doesn't reduce speed/size of model, because 0 values still occupy storage

-need to use a sparse execution engine that can take advantage of pruned NN structure

N:M Sparsity (Structured Pruning)

-for every group of M weights (in a specified axis/pattern), only N can be nonzero

-NVIDIA GPU Tensor Cores support it natively

Knowledge/Model Distillation

-use training data to create a teacher network and a student network

-use the student network to learn/predict outputs of the teacher network

Advantages:

-can modify the architecture of the student model to be different then teacher

-biggest potential gain in speed

Benefits:

-need to set up training data + run teacher model while training student

-relatively expensive