Quantization vs. Pruning vs. Distillation (Efficient NLP)

March 25, 2025

Quantization - reduce parameter precision

-usually, NNs store parameters with 32 bits

Zero-point quantization

Example of FP32 -> INT8

1.Map 0s to 0s
2.Map the highest absolute value from FP32 to -128 or 127
3.Must be a linear transformation, so scale other values accordingly
4.Add 128 to each element so that all elements are non-negative

Weight Quantization

-store weights in INT8, dequantize into FP32 when running inference
-saves space, does not save time during inference

Activation Quantization

-convert all inputs/outputs to INT8, do computations in INT8
-need to calibrate scale factors for data at each layer
-static + dynamic quantization
-clipping - quantization can only handle FP values in a certain range, otherwise they will be "clipped"

LLM.int8: Mixed decomposition:

-LLMs with \\~6.5B+ parameters do not work well with traditional quantization
-outlier features - can't be covered because of the limited number of options for parameters (256 in INT8)
-solution -> Mixed decomposition

Pruning

-remove some of the connection in the NN -> sparse network
-cheaper to store, faster to compute

Magnitude Pruning

1.Pick pruning factor (0 \\< X \\< 1) -> percentage of connections to prune
2.In each layer, set the lowers X% of weights (by absolute value) to zero
1.logic: the lowest absolute value weights are the least important for inference
3.(optional) retrain the model with pruned connections to "regain" some of the knowledge lost during pruning
-pruning itself doesn't reduce speed/size of model, because 0 values still occupy storage
-need to use a sparse execution engine that can take advantage of pruned NN structure

N:M Sparsity (Structured Pruning)

-for every group of M weights (in a specified axis/pattern), only N can be nonzero
-NVIDIA GPU Tensor Cores support it natively

Knowledge/Model Distillation

-use training data to create a teacher network and a student network
-use the student network to learn/predict outputs of the teacher network

Advantages:

-can modify the architecture of the student model to be different then teacher
-biggest potential gain in speed

Benefits:

-need to set up training data + run teacher model while training student
-relatively expensive

Engineering Optimizations