This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA.
The NVIDIA RTX AI for Windows PCs platform offers a thriving ecosystem of thousands of open-source models for application developers to leverage and integrate into Windows applications. Notably, llama.cpp is one popular tool, with over 65K GitHub stars at the time of writing. Originally released in 2023, this open-source repository is a lightweight, efficient framework for large language model (LLM) inference that runs across a range of hardware platforms, including RTX PCs.
This post explains how llama.cpp on RTX PCs offers a compelling solution for building cross-platform or Windows-native applications that require LLM functionality.
Overview of llama.cpp
While LLMs have shown promise in unlocking exciting new use cases, their large memory and compute-intensive nature often make it challenging for developers to deploy them into production applications. To address this problem, llama.cpp provides a vast array of functionality to optimize model performance and deploy efficiently on a wide range of hardware.
At its core, llama.cpp leverages the ggml tensor library for machine learning. This lightweight software stack enables cross-platform use of llama.cpp without external dependencies. Extremely memory efficient, it’s an ideal choice for local on-device inference. The model data is packaged and deployed in a customized file format called GGUF, specifically designed and implemented by llama.cpp contributors.
Developers building projects on top of llama.cpp can choose from thousands of prepackaged models, covering a wide range of high-quality quantizations. A growing open-source community is actively developing the llama.cpp and ggml projects.
Accelerated performance of llama.cpp on NVIDIA RTX
NVIDIA continues to collaborate on improving and optimizing llama.cpp performance when running on RTX GPUs, as well as the developer experience. Some key contributions include:
- Implementing CUDA Graphs in llama.cpp to reduce overheads and gaps between kernel execution times to generate tokens.
- Reducing CPU overheads when preparing ggml graphs.
For more information on the latest contributions, see Optimizing llama.cpp AI Inference with CUDA Graphs.
Figure 1 shows NVIDIA internal measurements showcasing throughput performance on NVIDIA GeForce RTX GPUs using a Llama 3 8B model on llama.cpp. On the NVIDIA RTX 4090 GPU, users can expect ~150 tokens per second, with an input sequence length of 100 tokens and an output sequence length of 100 tokens.
To build the llama.cpp library using NVIDIA GPU optimizations with the CUDA backend, visit llama.cpp/docs on GitHub.
Figure 1. NVIDIA internal throughput performance measurements on NVIDIA GeForce RTX GPUs, featuring a Llama 3 8B model with an input sequence lengths of 100 tokens, generating 100 tokens
Ecosystem of developers building with llama.cpp
A vast ecosystem of developer frameworks and abstractions are built on top of llama.cpp for developers to further accelerate their application development journey. Popular developer tools such as Ollama, Homebrew, and LMStudio all extend and leverage the capabilities of llama.cpp under-the-hood to offer abstracted developer experiences. Key functionalities of some of these tools include configuration and dependency management, bundling of model weights, abstracted UIs, and a locally run API endpoint to an LLM.
Additionally, there is a broad ecosystem of models that are already pre-optimized and available for developers to leverage using llama.cpp on RTX systems. Notable models include the latest GGUF quantized versions of Llama 3.2 available on Hugging Face.
In addition, llama.cpp is offered as an inference deployment mechanism as part of the NVIDIA RTX AI Toolkit.
Applications accelerated with llama.cpp on the RTX platform
More than 50 tools and apps are now accelerated with llama.cpp, including:
- Backyard.ai: With Backyard.ai, users can unleash their creativity with AI by interacting with their favorite characters virtually, in a private environment, with full ownership and control. This platform leverages llama.cpp to accelerate LLM models on RTX systems.
- Brave: Brave has built Leo, a smart AI assistant, directly into the Brave browser. With privacy-preserving Leo, users can now ask questions, summarize pages and PDFs, write code, and create new text. With Leo, users can leverage Ollama, which utilizes llama.cpp for acceleration on RTX systems, to interact with local LLMs on their devices.
- Opera: Opera has now integrated Local AI Models to augment a user’s browsing needs as part of the developer version of Opera One. Opera has integrated these capabilities using Ollama, leveraging the llama.cpp backend running entirely locally on NVIDIA RTX systems. In Opera’s browser AI, Aria, users can also ask the engine about web pages for summaries and translations, get more information with additional searches, generate text and images, and read the responses out loud with support for over 50 languages.
- Sourcegraph: Sourcegraph Cody is an AI coding assistant that supports the latest LLMs and uses the best developer context to provide accurate code suggestions. Cody can also work with models running on the local machine and in air-gapped environments. It leverages Ollama, which uses llama.cpp, for local inference support accelerated on NVIDIA RTX GPUs.
Get started
Using llama.cpp on RTX AI PCs offers developers a compelling solution to accelerate AI workloads on GPUs. With llama.cpp, developers can leverage a C++ implementation for LLM inferencing with a lightweight installation package. Learn more and get started with the llama.cpp on RTX AI Toolkit.
NVIDIA is committed to contributing to and accelerating open-source software on the RTX AI platform.
Annamalai Chockalingam
Product Marketing Manager, NeMo Megatron and NeMo NLP products, NVIDIA