LLM Serving Optimization Techniques: A Comprehensive Analysis

Venkata Siva Prasad Bharathula

doi:10.32996/jcsts.2025.7.5.23

Authors

Venkata Siva Prasad Bharathula University of Florida, USA

DOI:

https://doi.org/10.32996/jcsts.2025.7.5.23

Keywords:

Memory optimization, hardware acceleration, quantization, dynamic batching, model parallelism

Abstract

This article presents a comprehensive analysis of optimization techniques for serving Large Language Models (LLMs), addressing the critical challenges posed by their exponential growth in size and computational requirements. This paper examines four key areas of optimization: hardware acceleration, serving architecture design, model compression, and dynamic scaling strategies. The article synthesizes findings from multiple studies demonstrating significant improvements in memory efficiency, throughput, latency, and cost-effectiveness through innovative approaches, including parameter-centric memory management, near-storage processing, adaptive batching, model parallelism, quantization, pruning, and intelligent caching. Also explore promising future directions in hardware-software co-design and advanced compiler optimizations that could further democratize access to these powerful models. The collective impact of these techniques enables more efficient deployment of LLMs across diverse computing environments, from high-performance data centers to resource-constrained edge devices.

LLM Serving Optimization Techniques: A Comprehensive Analysis

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

rightbar

submission

menus

Notice: