TensorRT iconTensorRT

open-source Free Star13k

NVIDIA's SDK for high-performance deep learning inference on NVIDIA GPUs, delivering up to 36x speedup over CPU-only platforms via mixed-precision optimization and model compilation.

36x Faster Than CPU
13K+ GitHub Stars
2017 Released

Overview

TensorRT is NVIDIA's SDK for high-performance deep learning inference, providing an ecosystem of tools including inference compilers, runtimes, and model optimizations that deliver low latency and high throughput for production applications. It accepts trained models from PyTorch, TensorFlow, and ONNX, then compiles and optimizes them specifically for NVIDIA GPU architectures using mixed-precision computation (FP4, FP8, INT4, INT8, FP16, BF16, FP32). TensorRT-LLM extends this to large language models with specialized optimizations for transformers, while TensorRT Model Optimizer adds pruning, distillation, and quantization for further compression.

The Verdict

Who Should Use TensorRT?

Best For

  • Teams deploying deep learning models on NVIDIA GPU infrastructure
  • Production inference requiring lowest possible latency
  • LLM serving at scale using TensorRT-LLM
  • Edge and embedded AI on Jetson and DriveOS platforms
  • Computer vision, video analytics, and speech AI pipelines

Not Ideal For

  • Teams running on AMD, Intel, or cloud TPU infrastructure
  • Rapid prototyping — compilation adds significant setup overhead
  • Highly dynamic model architectures that can't be optimized statically

What's Great

  • Up to 36x faster inference vs. CPU-only platforms
  • 8x performance increase for GPT-J 6B; 4x for Llama 2 70B
  • Broad quantization support: FP4, FP8, INT4, INT8, FP16, BF16
  • TensorRT-LLM for transformer-specific optimizations (free, open-source)
  • Deploys across edge (Jetson), desktop, and data center
  • Native PyTorch and HuggingFace integration
  • 5.3x better total cost of ownership for LLM workloads

Watch Out For

  • NVIDIA GPUs only — no AMD, Intel, or TPU support
  • Engine compilation is hardware-specific; engines don't transfer between GPU generations
  • Significant complexity for custom or dynamic architectures
  • TensorRT Cloud (hyper-optimized engine generation) limited to select partners
  • CUDA version dependencies can complicate deployment environments

Pricing

View all features & details

Key Features

  • Mixed-precision inference (FP4, FP8, INT4, INT8, FP16, BF16, FP32)
  • TensorRT-LLM for LLM-specific transformer optimizations
  • TensorRT Model Optimizer (pruning, distillation, quantization)
  • Dynamic shapes and strongly-typed networks
  • Custom layer extensions via IPluginV3
  • Multi-device inference with collective operations (GA)
  • ONNX, PyTorch, and TensorFlow model import
  • Python bindings (3.10–3.14) and C++ API

Supported Platforms

  • Linux x86-64 (Ubuntu 22.04, 24.04, 26.04)
  • Linux aarch64 (SBSA)
  • Windows x64
  • NVIDIA Jetson / JetPack
  • DriveOS (automotive)
  • QNX (embedded)

Use Cases

  • Large language model inference
  • Computer vision (CNNs, object detection)
  • Video analytics and speech AI
  • Automotive embedded AI (DRIVE platform)
  • Robotics and edge AI
  • Diffusion models and multimodal inference

Current Version

  • TensorRT 11.1.0 (June 2026)
  • Requires CUDA 13.3 or 12.9
  • CMake 3.31+ for building from source
  • Python 3.10–3.14 supported

How It Compares

Feature TensorRT vLLM llama.cpp
Hardware NVIDIA only NVIDIA, AMD, Intel, TPU CPU + GPU (multi-vendor)
Primary Use GPU inference optimization LLM serving throughput Local/edge CPU inference
LLM Support TensorRT-LLM Native, production-ready Native, lightweight
Quantization FP4–FP32 (widest range) AWQ, GPTQ, FP8 GGUF (CPU-optimized)
Latency Best on NVIDIA Excellent throughput Best on CPU
Open Source Apache 2.0 Apache 2.0 MIT
Best For NVIDIA GPU production High-volume LLM serving Portable edge inference

User Reviews

Loading reviews...