Logo image
Constrained Edge AI Deployment: Fine-Tuning vs. Distillation for LLM Compression
Conference proceeding   Peer reviewed

Constrained Edge AI Deployment: Fine-Tuning vs. Distillation for LLM Compression

Jacob Sander, David Moe, Achraf Cohen, Brian Jalaian, Brent Venable and Venkat R. Dasari
MILCOM 2025 - 2025 IEEE Military Communications Conference (MILCOM), pp.1500-1505
IEEE Military Communications Conference
IEEE Military Communications Conference (MILCOM) (Los Angeles, California, USA, 10/06/2025–10/10/2025)
10/2025
Web of Science ID: WOS:001708627800257

Metrics

7 Record Views

Abstract

Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art (SoTA) pruning schemes target the entire Transformer, we adopt a simple, layer-wise L 2 -norm pruning on only the multi-layer perceptron (MLP) blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) L2-norm Pruning with Cross-Entropy Fine-Tuning (L2PFT), which relies on labeled data, versus (ii) L2-norm Pruning with KL-Divergence Self-Distillation (L2PSD), which utilizes only teacher logits without requiring labeled data. We evaluate both pipelines on the OLMo2-7B-SFT model for CommonsenseQA, suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, L2PSD achieves comparable or superior test accuracy to L2PFT, indicating that the choice of loss function has a significant impact on compressed model recovery in resource-constrained environments.

Details

Logo image