
GPT family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.

Deep models are fundamentally composed of matrices.
Therefore, they can be compressed by applying matrix compression (low-rank approximation) techniques, such as truncated Singular Value Decomposition (SVD):
where
Speed up layers in a CNN by a factor 2 − 13×, with negligible loss of performance:

Experiments:


Fishier imformation matrix:

Experiments:

KD was first proposed by Hinton in NIPS 2014, Deep Learning Workshop. The prototype of KD can be traced back to KDD 2006.


Information Leakage!

Adapted fron yibo's slides.






They use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising:

The Key idea is mapping the floating-point weights and/or activation values in the model to low-precision representations, such as integers. (Quantization for DNN: A Survey)




gpt-oss Quantization
gpt-oss-20b can run on systems with as little as 16GB memory! The magic comes from MXFP4 a new type of Block floating point.

THANKS