Deploying Accelerated Llama 3.2 from the Edge to the Cloud

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA.

Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated computing platform, Llama 3.2 offers developers, researchers, and enterprises valuable new capabilities and optimizations to realize their generative AI use cases.

Trained on NVIDIA H100 Tensor Core GPUs, the SLMs in 1B and 3B sizes are ideal for deploying Llama-based AI assistants across edge devices. The VLMs in 11B and 90B sizes support text and image inputs and output text. With multimodal support, the VLMs help developers build powerful applications requiring visual grounding, reasoning, and understanding. For example, they can build AI agents for image captioning, image-text retrieval, visual Q&A, and document Q&A, among others. The Llama Guard models now also support image input guardrails in addition to text input.

Llama 3.2 model architecture is an auto-regressive language model that uses an optimized transformer architecture. The instruction tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. All models support a long context length of 128K tokens and are optimized for inference with support for grouped query attention (GQA).

NVIDIA is optimizing the Llama 3.2 collection of models to deliver high throughput and low latency across millions of GPUs worldwide—from data centers to local workstations with NVIDIA RTX, and at the edge with NVIDIA Jetson. This post describes the hardware and software optimizations, customizations, and ease-of-deployment capabilities.

Accelerating Llama 3.2 performance with NVIDIA TensorRT

NVIDIA is accelerating the Llama 3.2 model collection to reduce cost and latency while delivering unparalleled throughput and providing an optimal end-user experience. NVIDIA TensorRT includes TensorRT and TensorRT-LLM libraries for high-performance deep learning inference.

The Llama 3.2 1B and Llama 3.2 3B models are being accelerated for long-context support in TensorRT-LLM using the scaled rotary position embedding (RoPE) technique and several other optimizations, including KV caching and in-flight batching.

The Llama 3.2 11B and Llama 3.2 90B models are multimodal and include a vision encoder with a text decoder. The vision encoder is being accelerated by exporting the model into an ONNX graph and building the TensorRT engine. ONNX export creates a standard model definition with built-in operators and standard data types, focused on inferencing. TensorRT uses the ONNX graph to optimize the model for target GPUs by building the TensorRT engine. These engines offer a variety of hardware-level optimizations to maximize NVIDIA GPU utilization through layer and tensor fusion in conjunction with kernel auto-tuning.

The visual information from the vision encoder is fused into the Llama text decoder with a cross-attention mechanism that is supported in TensorRT-LLM. This enables the VLMs to efficiently generate text by taking into account visual reasoning and understanding in context with text input.

Easily deploy generative AI solutions using NVIDIA NIM

The TensorRT optimizations are available through production-ready deployments using NVIDIA NIM microservices. NIM microservices accelerate the deployment of generative AI models across NVIDIA-accelerated infrastructure anywhere, including cloud, data center, and workstations.

Llama 3.2 90B Vision Instruct, Llama 3.2 11B Vision Instruct, Llama 3.2 3B Instruct, and Llama 3.2 1B Instruct are supported through NIM microservices for production deployments. NIM provides simplified management and orchestration of generative AI workloads, standard application programming interface (APIs), and enterprise support with production-ready containers. Offering strong and growing ecosystem support with over 175 partners integrating their solutions with NVIDIA NIM microservices, developers, researchers and enterprises around the world can maximize their return on investment for generative AI applications.

Customize and evaluate Llama 3.2 models with NVIDIA AI Foundry and NVIDIA NeMo

NVIDIA AI Foundry provides an end-to-end platform for Llama 3.2 model customizations with access to advanced AI tools, computing resources, and AI expertise. Fine-tuned on proprietary data, the custom models enable enterprises to achieve better performance and accuracy in domain-specific tasks, gaining a competitive edge.

With NVIDIA NeMo, developers can curate their training data, leverage advanced tuning techniques including LoRA, SFT, DPO, and RLHF to customize the Llama 3.2 models, evaluate for accuracy, and add guardrails to ensure appropriate responses from the models. AI Foundry provides dedicated capacity on NVIDIA DGX Cloud, and is supported by NVIDIA AI experts. The output is a custom Llama 3.2 model packaged as an NVIDIA NIM inference microservice, which can be deployed anywhere.

Scale local inference with NVIDIA RTX and NVIDIA Jetson

Today, Llama 3.2 models are optimized on the 100M+ NVIDIA RTX PCs and workstations worldwide. For Windows deployments, NVIDIA has optimized this suite of models to work efficiently using the ONNX-GenAI runtime, with a DirectML backend. Get started with the Llama 3.2 3B model on NVIDIA RTX.

The new VLM and SLM models unlock new capabilities on NVIDIA RTX systems. To demonstrate, we created an example of a multimodal retrieval-augmented generation (RAG) pipeline that combines text and visual data processing (for images, plots, and charts, for example) for enhanced information retrieval and generation.

Learn how to run this pipeline on NVIDIA RTX Linux systems using the Llama 3.2 SLM and VLM. Note that you’ll need a Linux workstation with an NVIDIA RTX professional GPU with 30+ GB of memory.

SLMs are tailored for local deployment on edge devices using techniques like distillation, pruning, and quantization to reduce memory, latency, and computational requirements while retaining accuracy for application-focused domains. To download and deploy the Llama 3.2 1B and 3B SLMs onboard your Jetson with optimized GPU inference and INT4/FP8 quantization, see the SLM Tutorial on NVIDIA Jetson AI Lab.

Multimodal models are increasingly useful in edge applications for their unique vision capabilities in video analytics and robotics. The Llama 3.2 11B VLM is supported on embedded Jetson AGX Orin 64 GB.

Advancing community AI models

An active open-source contributor, NVIDIA is committed to optimizing community software that helps users address their toughest challenges. Open-source AI models also promote transparency and enable users to broadly share work on AI safety and resilience.

The Hugging Face inference-as-a-service capabilities enable developers to rapidly deploy leading large language models (LLMs) such as the Llama 3 collection with optimization from NVIDIA NIM microservices running on NVIDIA DGX Cloud.

Get free access to NIM for research, development, and testing through the NVIDIA Developer Program.

Explore the NVIDIA AI inference platform further, including how NVIDIA NIM, NVIDIA TensorRT-LLM, NVIDIA TensorRT, and NVIDIA Triton use state-of-the-art techniques such as LoRA to accelerate the latest LLMs.

Related resources

Anjali Shah
Senior Deep Learning Scientist, Developer Advocate Engineering Group, NVIDIA

Jay Gould
Senior Product Marketing Manager, NVIDIA

Dustin Franklin
Developer Evangelist, Jetson Team, NVIDIA

Jay Rodge
Developer Advocate, LLMs, NVIDIA

Amanda Saunders
Director, Generative AI Product Marketing, NVIDIA

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top