Skip to main content
Documentation

ONNX Runtime over SSH

Compile, profile, and invoke ONNX Runtime models on your own hardware.

This guide walks you through compiling, profiling, and invoking an ONNX Runtime model on your own hardware over SSH using the embedl-onnxruntime backend.

This is one of the fastest backends available, making it ideal for experimentation and rapid iteration. Compiling and quantizing a MobileNetV2 takes around 7 seconds, and profiling it adds another 12 seconds. Even a larger model like ResNet-50 completes a full compile-and-profile cycle in under 30 seconds — compared to around 10 minutes for the same model on a cloud provider. On the other hand, cloud providers give you access to a wide range of edge devices without having to set up any hardware yourself.

You will learn how to:

  • Install and configure embedl-onnxruntime on the target device
  • Compile an ONNX model with quantization on the target device
  • Profile the compiled model
  • Invoke the model with real input data

Prerequisites

Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.

Installing embedl-onnxruntime on the target device

The embedl-onnxruntime provider requires the embedl-onnxruntime package to be installed on the target device. We recommend installing it in a virtual environment:

# On the target device:
$ python3 -m venv ~/embedl-ort-env
$ source ~/embedl-ort-env/bin/activate
$ pip install embedl-onnxruntime

If you installed into a virtual environment, note the full path to the embedl-onnxruntime binary — you will need it when compiling later:

realpath ~/embedl-ort-env/bin/embedl-onnxruntime
/home/pi/embedl-ort-env/bin/embedl-onnxruntime

If the binary is already on the device’s $PATH, you can skip this step.

Creating a project

embedl-hub init \
    --project "ONNX Runtime SSH" \
    --artifact-dir ~/my-artifacts

This sets the default project and artifact directory for subsequent commands. The artifact directory is where compiled models, profiling results, and other outputs are stored on disk. Later commands — such as profiling a model from a previous compile step — look here for previously produced artifacts. If omitted, a platform-specific default location is used.

You can view your current settings at any time:

embedl-hub show

Connecting to your device

Next, configure a connection to your target device over SSH.

In the CLI, device connection details are passed directly to each command:

embedl-hub compile onnxruntime embedl-onnxruntime \
    --host 192.168.1.42 \
    --user pi \
    --exec-path /home/pi/embedl-ort-env/bin/embedl-onnxruntime \
    ...

If embedl-onnxruntime is on the device’s $PATH, you can omit the --exec-path flag.

Preparing a model

The compile step expects an ONNX file. You can save your existing PyTorch model in ONNX format using torch.onnx.export:

import torch
from torchvision.models import mobilenet_v2
model = mobilenet_v2(weights="IMAGENET1K_V2")
example_input = torch.rand(1, 3, 224, 224)
torch.onnx.export(
    model,
    example_input,
    "mobilenet_v2.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=18,
    external_data=False,
    dynamo=False,
)

Compiling a model

Compile the ONNX model with quantization on the target device. The model is transferred to the device over SSH, compiled there, and the result is fetched back.

embedl-hub compile onnxruntime embedl-onnxruntime \
    --model /path/to/mobilenet_v2.onnx \
    --host 192.168.1.42 \
    --user pi

The embedl-onnxruntime provider quantizes the model as part of compilation, applying INT8 post-training quantization to lower the precision of weights and activations. This reduces memory usage and inference latency on the target device.

Providing calibration data

Although quantization reduces the model’s precision, you can mitigate the accuracy loss by providing calibration data — a small set of representative input samples. You don’t need a large dataset; usually, a few hundred samples are more than enough. If no calibration data is provided, random data is used.

Calibration data is not yet supported via the CLI for embedl-onnxruntime. Use the Python API instead.

Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).

Profiling a model

Profile the compiled model on the target device:

embedl-hub profile onnxruntime embedl-onnxruntime \
    --from-run latest \
    --host 192.168.1.42 \
    --user pi

Use embedl-hub log to view your runs.

Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:

  • Can we optimize the slowest layer?
  • Why aren’t certain layers running on the expected compute unit?

Invoking a model

Invoke the compiled model with real input data to get inference outputs:

embedl-hub invoke onnxruntime embedl-onnxruntime \
    --from-run latest \
    --host 192.168.1.42 \
    --user pi \
    --input /path/to/input.npz

The --input flag accepts a .npz file — a NumPy archive where each key is an input tensor name and each value is the corresponding array.

Next steps