TensorRT

open-source Free Star13k

NVIDIA's SDK for high-performance deep learning inference on NVIDIA GPUs, delivering up to 36x speedup over CPU-only platforms via mixed-precision optimization and model compilation.

api available python real time multimodal self hosted

36x Faster Than CPU

13K+ GitHub Stars

2017 Released

Overview

TensorRT is NVIDIA's SDK for high-performance deep learning inference, providing an ecosystem of tools including inference compilers, runtimes, and model optimizations that deliver low latency and high throughput for production applications. It accepts trained models from PyTorch, TensorFlow, and ONNX, then compiles and optimizes them specifically for NVIDIA GPU architectures using mixed-precision computation (FP4, FP8, INT4, INT8, FP16, BF16, FP32). TensorRT-LLM extends this to large language models with specialized optimizations for transformers, while TensorRT Model Optimizer adds pruning, distillation, and quantization for further compression.

The Verdict

Who Should Use TensorRT?

Best For

Teams deploying deep learning models on NVIDIA GPU infrastructure
Production inference requiring lowest possible latency
LLM serving at scale using TensorRT-LLM
Edge and embedded AI on Jetson and DriveOS platforms
Computer vision, video analytics, and speech AI pipelines

Not Ideal For

Teams running on AMD, Intel, or cloud TPU infrastructure
Rapid prototyping — compilation adds significant setup overhead
Highly dynamic model architectures that can't be optimized statically

What's Great

Up to 36x faster inference vs. CPU-only platforms
8x performance increase for GPT-J 6B; 4x for Llama 2 70B
Broad quantization support: FP4, FP8, INT4, INT8, FP16, BF16
TensorRT-LLM for transformer-specific optimizations (free, open-source)
Deploys across edge (Jetson), desktop, and data center
Native PyTorch and HuggingFace integration
5.3x better total cost of ownership for LLM workloads

NVIDIA Official · GitHub

Watch Out For

NVIDIA GPUs only — no AMD, Intel, or TPU support
Engine compilation is hardware-specific; engines don't transfer between GPU generations
Significant complexity for custom or dynamic architectures
TensorRT Cloud (hyper-optimized engine generation) limited to select partners
CUDA version dependencies can complicate deployment environments

GitHub Issues · Official Docs

Pricing

Open Source

Apache 2.0 license — core SDK, TensorRT-LLM, and Model Optimizer are free

TensorRT Cloud

Select Partners

Hyper-optimized engine generation service; limited access program

View all features & details

Key Features

Mixed-precision inference (FP4, FP8, INT4, INT8, FP16, BF16, FP32)
TensorRT-LLM for LLM-specific transformer optimizations
TensorRT Model Optimizer (pruning, distillation, quantization)
Dynamic shapes and strongly-typed networks
Custom layer extensions via IPluginV3
Multi-device inference with collective operations (GA)
ONNX, PyTorch, and TensorFlow model import
Python bindings (3.10–3.14) and C++ API

Supported Platforms

Linux x86-64 (Ubuntu 22.04, 24.04, 26.04)
Linux aarch64 (SBSA)
Windows x64
NVIDIA Jetson / JetPack
DriveOS (automotive)
QNX (embedded)

Use Cases

Large language model inference
Computer vision (CNNs, object detection)
Video analytics and speech AI
Automotive embedded AI (DRIVE platform)
Robotics and edge AI
Diffusion models and multimodal inference

Current Version

TensorRT 11.1.0 (June 2026)
Requires CUDA 13.3 or 12.9
CMake 3.31+ for building from source
Python 3.10–3.14 supported

How It Compares

Feature	TensorRT	vLLM	llama.cpp
Hardware	NVIDIA only	NVIDIA, AMD, Intel, TPU	CPU + GPU (multi-vendor)
Primary Use	GPU inference optimization	LLM serving throughput	Local/edge CPU inference
LLM Support	TensorRT-LLM	Native, production-ready	Native, lightweight
Quantization	FP4–FP32 (widest range)	AWQ, GPTQ, FP8	GGUF (CPU-optimized)
Latency	Best on NVIDIA	Excellent throughput	Best on CPU
Open Source	Apache 2.0	Apache 2.0	MIT
Best For	NVIDIA GPU production	High-volume LLM serving	Portable edge inference

User Reviews

Loading reviews...