llama.cpp

Pure C/C++ LLM inference engine enabling local model execution on CPU and GPU with minimal dependencies and maximum portability.

—

114K+ GitHub Stars

5/5 Rating

2023 Founded

Overview

llama.cpp is a pure C/C++ implementation of LLM inference with minimal dependencies, designed for efficient local execution on consumer hardware. It supports CPU and GPU acceleration (Metal, CUDA, Vulkan) and pioneered the GGUF quantization format enabling models to run in 4-8GB RAM. The project powers countless local AI applications and has become the de facto standard for on-device LLM inference.

The Verdict

Who Should Use llama.cpp?

Best For

Running LLMs locally on personal computers without cloud costs
Privacy-focused applications requiring offline inference
Embedded and edge devices with limited resources
Developers building local-first AI applications

Not Ideal For

Non-technical users seeking plug-and-play solutions
Production applications needing managed infrastructure

What's Great

C/C++ LLM inference engine with broad platform support
Runs entirely locally with no cloud dependencies
Minimal RAM usage via advanced quantization (4-bit, 8-bit)
Cross-platform support (Windows, macOS, Linux, mobile)
Active community with frequent updates and model support

GitHub

Watch Out For

Requires command-line knowledge and manual setup
Performance varies significantly based on hardware
No managed hosting or enterprise support

GitHub Issues

Pricing

Open Source

Completely free under MIT license

View all features & details

Key Features

Pure C/C++ with zero dependencies
GGUF quantization format (2-8 bit)
CPU, Metal, CUDA, Vulkan acceleration
Support for Llama, Mistral, Qwen, and 100+ models
Built-in HTTP server for API access
Flash Attention and other optimizations

Platforms

Windows, macOS, Linux
iOS, Android (via bindings)
Raspberry Pi and embedded
Docker containers

How It Compares

Feature	llama.cpp	Ollama	LocalAI
Ease of Use	CLI-based	Very easy	Moderate
Performance	Excellent	Good	Good
Deployment	Self-hosted	Self-hosted	Self-hosted
Best For	Maximum control	Simplicity	OpenAI compat

User Reviews

Loading reviews...