Skip to main content
Documentation

TensorRT over SSH

Compile, profile, and invoke TensorRT models on NVIDIA hardware.

This guide walks you through compiling, profiling, and invoking a TensorRT model on your own NVIDIA hardware over SSH using NVIDIA’s trtexec tool.

TensorRT compilation is more heavyweight than ONNX Runtime, but still practical for iterative development on NVIDIA GPUs. Building a MobileNetV2 engine takes around 70 seconds, while a ResNet-50 completes in about 40 seconds. Profiling is fast at around 15 seconds per model. The total turnaround stays under two minutes for most models — compared to around 10 minutes for the same workflow on a cloud provider. On the other hand, cloud providers let you test on a wide variety of edge devices without managing any hardware.

You will learn how to:

  • Set up trtexec on the target device
  • Connect to the device over SSH
  • Compile an ONNX model to a TensorRT engine
  • Profile the compiled engine
  • Invoke the engine with real input data

Prerequisites

Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.

Locating trtexec on the target device

The trtexec provider requires NVIDIA’s trtexec tool, which is included with TensorRT. You can find it on your device by running:

ssh user@host find / -name trtexec -type f 2>/dev/null

Common paths include:

/usr/src/tensorrt/bin/trtexec
/opt/tensorrt/bin/trtexec

If trtexec is not on the device’s $PATH, you will need to provide the full path when connecting to the device (see Connecting to your device below).

Creating a project

from embedl_hub.core import HubContext
from embedl_hub.core import LocalPath
ctx = HubContext(
    project_name="TensorRT SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
)

The HubContext is your entry point. It manages the project, artifact directory, devices, and tracking. We’ll register a device in the next section.

The artifact_base_dir is where compiled models, profiling results, and other outputs are stored on disk. If omitted, HubContext creates a temporary directory when used as a context manager (with ctx:), and cleans it up automatically when the context exits. This is convenient for scripts where you only need the in-memory results and don’t need to persist artifacts to disk.

For alternative ways to configure project context, see the configuration guide.

Connecting to your device

Next, configure a connection to your target device over SSH.

from embedl_hub.core import HubContext
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core import LocalPath
device = DeviceManager.get_tensorrt_device(
    SSHConfig(host="192.168.1.10", username="nvidia"),
    name="jetson",
)
ctx = HubContext(
    project_name="TensorRT SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)

If trtexec is not on the device’s $PATH, pass the full path via TrtexecConfig:

from embedl_hub.core.device import TrtexecConfig
device = DeviceManager.get_tensorrt_device(
    SSHConfig(host="192.168.1.10", username="nvidia"),
    name="jetson",
    provider_config=TrtexecConfig(
        trtexec_path="/usr/src/tensorrt/bin/trtexec",
    ),
)

The name parameter is a label you choose for this device; you reference it by that label when creating components later (e.g. device="jetson").

Preparing a model

The compile step expects an ONNX file. You can save your existing PyTorch model in ONNX format using torch.onnx.export:

import torch
from torchvision.models import mobilenet_v2
model = mobilenet_v2(weights="IMAGENET1K_V2")
example_input = torch.rand(1, 3, 224, 224)
torch.onnx.export(
    model,
    example_input,
    "mobilenet_v2.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=18,
    external_data=False,
    dynamo=False,
)

Compiling a model

Compile the ONNX model to a TensorRT engine on the target device. The model is transferred over SSH, compiled using trtexec, and the engine file is fetched back:

from embedl_hub.core import HubContext
from embedl_hub.core import LocalPath
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core.compile import TensorRTCompiler
device = DeviceManager.get_tensorrt_device(
    SSHConfig(host="192.168.1.10", username="nvidia"),
    name="jetson",
    # provider_config=trtexec_config,  # if trtexec is not on PATH
)
ctx = HubContext(
    project_name="TensorRT SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)
compiler = TensorRTCompiler(device="jetson")
with ctx:
    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))
    print(compiled.path.file_path)

TensorRT optimizes the model as part of compilation, applying FP16 precision by default to reduce memory usage and inference latency on the target GPU. For further gains, INT8 quantization is supported when a calibration cache is provided.

Providing a calibration cache for INT8

To enable INT8 quantization, you need a TensorRT calibration cache file (.cache) containing pre-computed per-tensor dynamic ranges. This file must be generated externally using the TensorRT Python API (e.g. trt.IInt8EntropyCalibrator2) from your calibration dataset before calling the compiler.

compiler = TensorRTCompiler(
    device="jetson",
    calib_path=LocalPath("path/to/calibration.cache"),
)

The calib_path parameter accepts a local path to the .cache file. It is automatically uploaded to the target device and passed to trtexec with the --calib flag. You also need to enable INT8 mode via trtexec_cli_args:

from embedl_hub.core.device import TrtexecConfig
device = DeviceManager.get_tensorrt_device(
    SSHConfig(host="192.168.1.10", username="nvidia"),
    name="jetson",
    provider_config=TrtexecConfig(
        trtexec_cli_args=("--int8",),
    ),
)

Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).

Specifying a TensorRT version

If the target device has multiple TensorRT versions installed, you can specify which one to use:

compiler = TensorRTCompiler(
    device="jetson",
    tensorrt_version="10.0",
)

Profiling a model

Profile the compiled engine on the target device:

from embedl_hub.core import HubContext
from embedl_hub.core import LocalPath
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core.compile import TensorRTCompiler
from embedl_hub.core.profile import TensorRTProfiler
device = DeviceManager.get_tensorrt_device(
    SSHConfig(host="192.168.1.10", username="nvidia"),
    name="jetson",
    # provider_config=trtexec_config,  # if trtexec is not on PATH
)
ctx = HubContext(
    project_name="TensorRT SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)
compiler = TensorRTCompiler(device="jetson")
profiler = TensorRTProfiler(device="jetson")
with ctx:
    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))
    result = profiler.run(ctx, compiled)
    print("Latency:", result.latency.value)
    print("FPS:", result.fps.value)

Your runs are automatically synced to your project on hub.embedl.com.

Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:

  • Can we optimize the slowest layer?
  • Why aren’t certain layers running on the expected compute unit?

Invoking a model

Invoke the compiled engine with real input data to get inference outputs:

import numpy as np
from embedl_hub.core import HubContext
from embedl_hub.core import LocalPath
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core.compile import TensorRTCompiler
from embedl_hub.core.invoke import TensorRTInvoker
device = DeviceManager.get_tensorrt_device(
    SSHConfig(host="192.168.1.10", username="nvidia"),
    name="jetson",
    # provider_config=trtexec_config,  # if trtexec is not on PATH
)
ctx = HubContext(
    project_name="TensorRT SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)
compiler = TensorRTCompiler(device="jetson")
invoker = TensorRTInvoker(device="jetson")
input_data = dict(input=np.random.rand(1, 3, 224, 224).astype(np.float32))
with ctx:
    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))
    invocation = invoker.run(ctx, compiled, input_data)
    print(invocation.output)

The input_data dictionary maps input tensor names to NumPy arrays.

Next steps