ZMIC Journal Club

Pre-training "Compression" ?

Reducing memory usage while maintaining the model’s capability.

Training strategy
- Mixed precision (fp16 & fp32, torch.amp)
- Gradient checkpointing (torch.utils.checkpoint)
Model architecting
- Sparse Attention (SepLLM, ICML 2025)
- Mixture of Experts (Switch Transformers, JMLR 2022)
- Parameter Sharing (ALBERT, ICLR 2020)
- Architecture Search (DARTS, ICLR 2019)
etc.

Model Compression Techniques "Less is More"