TensorRT
NVIDIA's SDK for high-performance deep learning inference on NVIDIA GPUs, delivering up to 36x speedup over CPU-only platforms via mixed-precision optimization and model compilation.
Overview
TensorRT is NVIDIA's SDK for high-performance deep learning inference, providing an ecosystem of tools including inference compilers, runtimes, and model optimizations that deliver low latency and high throughput for production applications. It accepts trained models from PyTorch, TensorFlow, and ONNX, then compiles and optimizes them specifically for NVIDIA GPU architectures using mixed-precision computation (FP4, FP8, INT4, INT8, FP16, BF16, FP32). TensorRT-LLM extends this to large language models with specialized optimizations for transformers, while TensorRT Model Optimizer adds pruning, distillation, and quantization for further compression.
The Verdict
Who Should Use TensorRT?
Best For
- Teams deploying deep learning models on NVIDIA GPU infrastructure
- Production inference requiring lowest possible latency
- LLM serving at scale using TensorRT-LLM
- Edge and embedded AI on Jetson and DriveOS platforms
- Computer vision, video analytics, and speech AI pipelines
Not Ideal For
- Teams running on AMD, Intel, or cloud TPU infrastructure
- Rapid prototyping — compilation adds significant setup overhead
- Highly dynamic model architectures that can't be optimized statically
What's Great
- Up to 36x faster inference vs. CPU-only platforms
- 8x performance increase for GPT-J 6B; 4x for Llama 2 70B
- Broad quantization support: FP4, FP8, INT4, INT8, FP16, BF16
- TensorRT-LLM for transformer-specific optimizations (free, open-source)
- Deploys across edge (Jetson), desktop, and data center
- Native PyTorch and HuggingFace integration
- 5.3x better total cost of ownership for LLM workloads
Watch Out For
- NVIDIA GPUs only — no AMD, Intel, or TPU support
- Engine compilation is hardware-specific; engines don't transfer between GPU generations
- Significant complexity for custom or dynamic architectures
- TensorRT Cloud (hyper-optimized engine generation) limited to select partners
- CUDA version dependencies can complicate deployment environments
Pricing
View all features & details
Key Features
- Mixed-precision inference (FP4, FP8, INT4, INT8, FP16, BF16, FP32)
- TensorRT-LLM for LLM-specific transformer optimizations
- TensorRT Model Optimizer (pruning, distillation, quantization)
- Dynamic shapes and strongly-typed networks
- Custom layer extensions via IPluginV3
- Multi-device inference with collective operations (GA)
- ONNX, PyTorch, and TensorFlow model import
- Python bindings (3.10–3.14) and C++ API
Supported Platforms
- Linux x86-64 (Ubuntu 22.04, 24.04, 26.04)
- Linux aarch64 (SBSA)
- Windows x64
- NVIDIA Jetson / JetPack
- DriveOS (automotive)
- QNX (embedded)
Use Cases
- Large language model inference
- Computer vision (CNNs, object detection)
- Video analytics and speech AI
- Automotive embedded AI (DRIVE platform)
- Robotics and edge AI
- Diffusion models and multimodal inference
Current Version
- TensorRT 11.1.0 (June 2026)
- Requires CUDA 13.3 or 12.9
- CMake 3.31+ for building from source
- Python 3.10–3.14 supported
How It Compares
| Feature | TensorRT | vLLM | llama.cpp |
|---|---|---|---|
| Hardware | NVIDIA only | NVIDIA, AMD, Intel, TPU | CPU + GPU (multi-vendor) |
| Primary Use | GPU inference optimization | LLM serving throughput | Local/edge CPU inference |
| LLM Support | TensorRT-LLM | Native, production-ready | Native, lightweight |
| Quantization | FP4–FP32 (widest range) | AWQ, GPTQ, FP8 | GGUF (CPU-optimized) |
| Latency | Best on NVIDIA | Excellent throughput | Best on CPU |
| Open Source | Apache 2.0 | Apache 2.0 | MIT |
| Best For | NVIDIA GPU production | High-volume LLM serving | Portable edge inference |