LLM Serving Optimization Techniques: A Comprehensive Analysis
DOI:
https://doi.org/10.32996/jcsts.2025.7.5.23Keywords:
Memory optimization, hardware acceleration, quantization, dynamic batching, model parallelismAbstract
This article presents a comprehensive analysis of optimization techniques for serving Large Language Models (LLMs), addressing the critical challenges posed by their exponential growth in size and computational requirements. This paper examines four key areas of optimization: hardware acceleration, serving architecture design, model compression, and dynamic scaling strategies. The article synthesizes findings from multiple studies demonstrating significant improvements in memory efficiency, throughput, latency, and cost-effectiveness through innovative approaches, including parameter-centric memory management, near-storage processing, adaptive batching, model parallelism, quantization, pruning, and intelligent caching. Also explore promising future directions in hardware-software co-design and advanced compiler optimizations that could further democratize access to these powerful models. The collective impact of these techniques enables more efficient deployment of LLMs across diverse computing environments, from high-performance data centers to resource-constrained edge devices.