Efficient LLM Deployment at the Edge Through Quantization

Dwith Chenna

The widespread adoption of large language models (LLMs) has sparked a revolution in the development of innovative solutions, with inference expected to account for 90% of the costs associated with LLM applications, compared to only 10% for training. This cost disparity, along with the environmental impact of inference and data privacy concerns, has underscored the need for optimization at the edge. Quantization has emerged as a crucial technique, offering significant performance gains in computation and memory usage. In this presentation, we will delve into modern quantization techniques that facilitate the deployment of LLMs at the edge. We will explore popular methods including AWQ, SmoothQuant, and Block Quantization, examining their trade-offs and optimizations. Using popular open-source models like Llama, OPT, and Mistral, along with Llama.cpp, a well-regarded C++ implementation, as a case study, we will analyze the impact of quantization on model performance and provide insights into best practices for achieving overall efficiency in LLM deployments.

Speaker: Dwith Chenna, Advanced Micro Devices

Attend in person or online (see weblink)

Tuesday, 07/16/24

06:30 PM - 08:30 PM

Contact:

Website: Click to Visit

Cost:

Free

Save this Event:

iCalendar
Google Calendar
Yahoo! Calendar
Windows Live Calendar

Hacker Dojo

855 Maude Avenue
Mountain View, CA 94043

<						>
S	M	T	W	T	F	S
			01	02	03	04
05	06	07	08	09	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tuesday, 07/16/24

Contact:

Cost:

Save this Event:

Hacker Dojo

Categories: