TensorRT over SSH
Compile, profile, and invoke TensorRT models on NVIDIA hardware.
This guide walks you through compiling, profiling, and invoking a TensorRT
model on your own NVIDIA hardware over SSH using NVIDIA’s trtexec tool.
TensorRT compilation is more heavyweight than ONNX Runtime, but still practical for iterative development on NVIDIA GPUs. Building a MobileNetV2 engine takes around 70 seconds, while a ResNet-50 completes in about 40 seconds. Profiling is fast at around 15 seconds per model. The total turnaround stays under two minutes for most models — compared to around 10 minutes for the same workflow on a cloud provider. On the other hand, cloud providers let you test on a wide variety of edge devices without managing any hardware.
You will learn how to:
- Set up
trtexecon the target device - Connect to the device over SSH
- Compile an ONNX model to a TensorRT engine
- Profile the compiled engine
- Invoke the engine with real input data
Prerequisites
Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.
Locating trtexec on the target device
The trtexec provider requires NVIDIA’s trtexec tool, which is included
with TensorRT. You can find it on your device by running:
ssh user@host find / -name trtexec -type f 2>/dev/nullCommon paths include:
/usr/src/tensorrt/bin/trtexec/opt/tensorrt/bin/trtexecIf trtexec is not on the device’s $PATH, you will need to provide
the full path when connecting to the device (see Connecting to your device below).
Creating a project
from embedl_hub.core import HubContextfrom embedl_hub.core import LocalPathctx = HubContext( project_name="TensorRT SSH", artifact_base_dir=LocalPath("my-artifacts"),)The HubContext is your entry point. It manages the project, artifact
directory, devices, and tracking. We’ll register a device in the next
section.
The artifact_base_dir is where compiled models, profiling results, and
other outputs are stored on disk. If omitted, HubContext creates a
temporary directory when used as a context manager (with ctx:), and
cleans it up automatically when the context exits. This is convenient
for scripts where you only need the in-memory results and don’t need to
persist artifacts to disk.
For alternative ways to configure project context, see the configuration guide.
Connecting to your device
Next, configure a connection to your target device over SSH.
from embedl_hub.core import HubContextfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core import LocalPathdevice = DeviceManager.get_tensorrt_device( SSHConfig(host="192.168.1.10", username="nvidia"), name="jetson",)ctx = HubContext( project_name="TensorRT SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)If trtexec is not on the device’s $PATH, pass the full path via TrtexecConfig:
from embedl_hub.core.device import TrtexecConfigdevice = DeviceManager.get_tensorrt_device( SSHConfig(host="192.168.1.10", username="nvidia"), name="jetson", provider_config=TrtexecConfig( trtexec_path="/usr/src/tensorrt/bin/trtexec", ),)The name parameter is a label you choose for this device; you reference
it by that label when creating components later (e.g. device="jetson").
Preparing a model
The compile step expects an ONNX file. You can save
your existing PyTorch model in ONNX format using torch.onnx.export:
import torchfrom torchvision.models import mobilenet_v2model = mobilenet_v2(weights="IMAGENET1K_V2")example_input = torch.rand(1, 3, 224, 224)torch.onnx.export( model, example_input, "mobilenet_v2.onnx", input_names=["input"], output_names=["output"], opset_version=18, external_data=False, dynamo=False,)Compiling a model
Compile the ONNX model to a TensorRT engine on the target device. The model
is transferred over SSH, compiled using trtexec, and the engine file is
fetched back:
from embedl_hub.core import HubContextfrom embedl_hub.core import LocalPathfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core.compile import TensorRTCompilerdevice = DeviceManager.get_tensorrt_device( SSHConfig(host="192.168.1.10", username="nvidia"), name="jetson", # provider_config=trtexec_config, # if trtexec is not on PATH)ctx = HubContext( project_name="TensorRT SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)compiler = TensorRTCompiler(device="jetson")with ctx: compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx")) print(compiled.path.file_path)TensorRT optimizes the model as part of compilation, applying FP16 precision by default to reduce memory usage and inference latency on the target GPU. For further gains, INT8 quantization is supported when a calibration cache is provided.
Providing a calibration cache for INT8
To enable INT8 quantization, you need a TensorRT calibration cache file
(.cache) containing pre-computed per-tensor dynamic ranges. This file
must be generated externally using the TensorRT Python API (e.g. trt.IInt8EntropyCalibrator2) from your calibration dataset before
calling the compiler.
compiler = TensorRTCompiler( device="jetson", calib_path=LocalPath("path/to/calibration.cache"),)The calib_path parameter accepts a local path to the .cache file.
It is automatically uploaded to the target device and passed to trtexec with the --calib flag. You also need to enable INT8 mode
via trtexec_cli_args:
from embedl_hub.core.device import TrtexecConfigdevice = DeviceManager.get_tensorrt_device( SSHConfig(host="192.168.1.10", username="nvidia"), name="jetson", provider_config=TrtexecConfig( trtexec_cli_args=("--int8",), ),)Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).
Specifying a TensorRT version
If the target device has multiple TensorRT versions installed, you can specify which one to use:
compiler = TensorRTCompiler( device="jetson", tensorrt_version="10.0",)Profiling a model
Profile the compiled engine on the target device:
from embedl_hub.core import HubContextfrom embedl_hub.core import LocalPathfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core.compile import TensorRTCompilerfrom embedl_hub.core.profile import TensorRTProfilerdevice = DeviceManager.get_tensorrt_device( SSHConfig(host="192.168.1.10", username="nvidia"), name="jetson", # provider_config=trtexec_config, # if trtexec is not on PATH)ctx = HubContext( project_name="TensorRT SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)compiler = TensorRTCompiler(device="jetson")profiler = TensorRTProfiler(device="jetson")with ctx: compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx")) result = profiler.run(ctx, compiled) print("Latency:", result.latency.value) print("FPS:", result.fps.value)Your runs are automatically synced to your project on hub.embedl.com.
Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:
- Can we optimize the slowest layer?
- Why aren’t certain layers running on the expected compute unit?
Invoking a model
Invoke the compiled engine with real input data to get inference outputs:
import numpy as npfrom embedl_hub.core import HubContextfrom embedl_hub.core import LocalPathfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core.compile import TensorRTCompilerfrom embedl_hub.core.invoke import TensorRTInvokerdevice = DeviceManager.get_tensorrt_device( SSHConfig(host="192.168.1.10", username="nvidia"), name="jetson", # provider_config=trtexec_config, # if trtexec is not on PATH)ctx = HubContext( project_name="TensorRT SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)compiler = TensorRTCompiler(device="jetson")invoker = TensorRTInvoker(device="jetson")input_data = dict(input=np.random.rand(1, 3, 224, 224).astype(np.float32))with ctx: compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx")) invocation = invoker.run(ctx, compiled, input_data) print(invocation.output)The input_data dictionary maps input tensor names to NumPy arrays.
Next steps
- Learn how to view, name, and tag your runs, and how to interpret profiling results in the exploring results guide.
- See the providers guide for the full reference of supported provider and toolchain combinations.