Efficient LLM Deployment at the Edge Through Quantization
The widespread adoption of large language models (LLMs) has sparked a revolution in the development of innovative solutions, with inference expected to account for 90% of the costs associated with LLM applications, compared to only 10% for training. This cost disparity, along with the environmental impact of inference and data privacy concerns, has underscored the need for optimization at the edge. Quantization has emerged as a crucial technique, offering significant performance gains in computation and memory usage. In this presentation, we will delve into modern quantization techniques that facilitate the deployment of LLMs at the edge. We will explore popular methods including AWQ, SmoothQuant, and Block Quantization, examining their trade-offs and optimizations. Using popular open-source models like Llama, OPT, and Mistral, along with Llama.cpp, a well-regarded C++ implementation, as a case study, we will analyze the impact of quantization on model performance and provide insights into best practices for achieving overall efficiency in LLM deployments.
Speaker: Dwith Chenna, Advanced Micro Devices
Attend in person or online (see weblink)
Tuesday, 07/16/24
Contact:
Website: Click to VisitCost:
FreeSave this Event:
iCalendarGoogle Calendar
Yahoo! Calendar
Windows Live Calendar