Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art (SoTA) pruning schemes target the entire Transformer, we adopt a simple, layer-wise L 2 -norm pruning on only the multi-layer perceptron (MLP) blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) L2-norm Pruning with Cross-Entropy Fine-Tuning (L2PFT), which relies on labeled data, versus (ii) L2-norm Pruning with KL-Divergence Self-Distillation (L2PSD), which utilizes only teacher logits without requiring labeled data. We evaluate both pipelines on the OLMo2-7B-SFT model for CommonsenseQA, suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, L2PSD achieves comparable or superior test accuracy to L2PFT, indicating that the choice of loss function has a significant impact on compressed model recovery in resource-constrained environments.
Related links
Details
Title
Constrained Edge AI Deployment: Fine-Tuning vs. Distillation for LLM Compression
Publication Details
MILCOM 2025 - 2025 IEEE Military Communications Conference (MILCOM), pp.1500-1505
Resource Type
Conference proceeding
Conference
IEEE Military Communications Conference (MILCOM) (Los Angeles, California, USA, 10/06/2025–10/10/2025)