ONNX Runtime over SSH

This guide walks you through compiling, profiling, and invoking an ONNX Runtime model on your own hardware over SSH using the embedl-onnxruntime backend.

You will learn how to:

Install and configure embedl-onnxruntime on the target device
Compile an ONNX model with quantization on the target device
Profile the compiled model
Invoke the model with real input data

Prerequisites

Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.

Installing embedl-onnxruntime on the target device

The embedl-onnxruntime provider requires the embedl-onnxruntime package to be installed on the target device. We recommend installing it in a virtual environment:

# On the target device:

$ python3 -m venv ~/embedl-ort-env

$ source ~/embedl-ort-env/bin/activate

$ pip install embedl-onnxruntime

If you installed into a virtual environment, note the full path to the embedl-onnxruntime binary — you will need it when compiling later:

realpath ~/embedl-ort-env/bin/embedl-onnxruntime

/home/pi/embedl-ort-env/bin/embedl-onnxruntime

If the binary is already on the device’s $PATH, you can skip this step.

Creating a project

from embedl_hub.core import HubContext

from embedl_hub.core import LocalPath

ctx = HubContext(

    project_name="ONNX Runtime SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

The HubContext is your entry point. It manages the project, artifact directory, devices, and tracking. We’ll register a device in the next section.

The artifact_base_dir is where compiled models, profiling results, and other outputs are stored on disk. If omitted, HubContext creates a temporary directory when used as a context manager (with ctx:), and cleans it up automatically when the context exits. This is convenient for scripts where you only need the in-memory results and don’t need to persist artifacts to disk.

For alternative ways to configure project context, see the configuration guide.

Connecting to your device

Next, configure a connection to your target device over SSH.

from embedl_hub.core import HubContext

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core.device import EmbedlONNXRuntimeConfig

from embedl_hub.core import LocalPath

device = DeviceManager.get_embedl_onnxruntime_device(

    SSHConfig(host="192.168.1.42", username="pi"),

    name="rpi",

    provider_config=EmbedlONNXRuntimeConfig(

        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",

),

ctx = HubContext(

    project_name="ONNX Runtime SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

If embedl-onnxruntime is on the device’s $PATH, you can omit embedl_onnxruntime_path.

The name parameter is a label you choose for this device; you reference it by that label when creating components later (e.g. device="rpi").

Preparing a model

The compile step expects an ONNX file. You can save your existing PyTorch model in ONNX format using torch.onnx.export:

import torch

from torchvision.models import mobilenet_v2

model = mobilenet_v2(weights="IMAGENET1K_V2")

example_input = torch.rand(1, 3, 224, 224)

torch.onnx.export(

    model,

    example_input,

    "mobilenet_v2.onnx",

    input_names=["input"],

    output_names=["output"],

    opset_version=18,

    external_data=False,

    dynamo=False,

Compiling a model

Compile the ONNX model with quantization on the target device. The model is transferred to the device over SSH, compiled there, and the result is fetched back.

from embedl_hub.core import HubContext

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core.device import EmbedlONNXRuntimeConfig

from embedl_hub.core import LocalPath

from embedl_hub.core.compile import ONNXRuntimeCompiler

device = DeviceManager.get_embedl_onnxruntime_device(

    SSHConfig(host="192.168.1.42", username="pi"),

    name="rpi",

    provider_config=EmbedlONNXRuntimeConfig(

        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",

),

ctx = HubContext(

    project_name="ONNX Runtime SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

compiler = ONNXRuntimeCompiler(device="rpi")

with ctx:

    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))

    print(compiled.path.file_path)

The embedl-onnxruntime provider quantizes the model as part of compilation, applying INT8 post-training quantization to lower the precision of weights and activations. This reduces memory usage and inference latency on the target device.

Providing calibration data

Although quantization reduces the model’s precision, you can mitigate the accuracy loss by providing calibration data — a small set of representative input samples. You don’t need a large dataset; usually, a few hundred samples are more than enough. If no calibration data is provided, random data is used.

compiler = ONNXRuntimeCompiler(

    device="rpi",

    calibration_data=LocalPath("path/to/dataset"),

The calibration_data parameter accepts a path to a directory of .npy files, or a dictionary of NumPy arrays where keys are the model input names and values have shape (num_samples, *input_shape).

For file-based calibration, place one .npy file per sample directly in the directory for single-input models. For multi-input models, create one subdirectory per input tensor (named after the input), each with the same number of files.

Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).

Profiling a model

Profile the compiled model on the target device:

from embedl_hub.core import HubContext

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core.device import EmbedlONNXRuntimeConfig

from embedl_hub.core import LocalPath

from embedl_hub.core.compile import ONNXRuntimeCompiler

from embedl_hub.core.profile import ONNXRuntimeProfiler

device = DeviceManager.get_embedl_onnxruntime_device(

    SSHConfig(host="192.168.1.42", username="pi"),

    name="rpi",

    provider_config=EmbedlONNXRuntimeConfig(

        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",

),

ctx = HubContext(

    project_name="ONNX Runtime SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

compiler = ONNXRuntimeCompiler(device="rpi")

profiler = ONNXRuntimeProfiler(device="rpi")

with ctx:

    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))

    result = profiler.run(ctx, compiled)

    print("Latency:", result.latency.value)

    print("FPS:", result.fps.value)

Your runs are automatically synced to your project on hub.embedl.com.

Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:

Can we optimize the slowest layer?
Why aren’t certain layers running on the expected compute unit?

Invoking a model

Invoke the compiled model with real input data to get inference outputs:

import numpy as np

from embedl_hub.core import HubContext

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core.device import EmbedlONNXRuntimeConfig

from embedl_hub.core import LocalPath

from embedl_hub.core.compile import ONNXRuntimeCompiler

from embedl_hub.core.invoke import ONNXRuntimeInvoker

device = DeviceManager.get_embedl_onnxruntime_device(

    SSHConfig(host="192.168.1.42", username="pi"),

    name="rpi",

    provider_config=EmbedlONNXRuntimeConfig(

        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",

),

ctx = HubContext(

    project_name="ONNX Runtime SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

compiler = ONNXRuntimeCompiler(device="rpi")

invoker = ONNXRuntimeInvoker(device="rpi")

input_data = dict(input=np.random.rand(1, 3, 224, 224).astype(np.float32))

with ctx:

    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))

    invocation = invoker.run(ctx, compiled, input_data)

    print(invocation.output)

The input_data dictionary maps input tensor names to NumPy arrays.

Next steps

Learn how to view, name, and tag your runs, and how to interpret profiling results in the exploring results guide.
See the providers guide for the full reference of supported provider and toolchain combinations.