ONNX Runtime over SSH
Compile, profile, and invoke ONNX Runtime models on your own hardware.
This guide walks you through compiling, profiling, and invoking an ONNX Runtime model on your own hardware over SSH using the embedl-onnxruntime backend.
This is one of the fastest backends available, making it ideal for experimentation and rapid iteration. Compiling and quantizing a MobileNetV2 takes around 7 seconds, and profiling it adds another 12 seconds. Even a larger model like ResNet-50 completes a full compile-and-profile cycle in under 30 seconds — compared to around 10 minutes for the same model on a cloud provider. On the other hand, cloud providers give you access to a wide range of edge devices without having to set up any hardware yourself.
You will learn how to:
- Install and configure
embedl-onnxruntimeon the target device - Compile an ONNX model with quantization on the target device
- Profile the compiled model
- Invoke the model with real input data
Prerequisites
Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.
Installing embedl-onnxruntime on the target device
The embedl-onnxruntime provider requires the embedl-onnxruntime package
to be installed on the target device. We recommend installing it in a
virtual environment:
# On the target device:$ python3 -m venv ~/embedl-ort-env$ source ~/embedl-ort-env/bin/activate$ pip install embedl-onnxruntimeIf you installed into a virtual environment, note the full path to the embedl-onnxruntime binary — you will need it when compiling later:
realpath ~/embedl-ort-env/bin/embedl-onnxruntime/home/pi/embedl-ort-env/bin/embedl-onnxruntimeIf the binary is already on the device’s $PATH, you can skip this step.
Creating a project
embedl-hub init \
--project "ONNX Runtime SSH" \
--artifact-dir ~/my-artifactsThis sets the default project and artifact directory for subsequent commands. The artifact directory is where compiled models, profiling results, and other outputs are stored on disk. Later commands — such as profiling a model from a previous compile step — look here for previously produced artifacts. If omitted, a platform-specific default location is used.
You can view your current settings at any time:
embedl-hub showConnecting to your device
Next, configure a connection to your target device over SSH.
In the CLI, device connection details are passed directly to each command:
embedl-hub compile onnxruntime embedl-onnxruntime \
--host 192.168.1.42 \
--user pi \
--exec-path /home/pi/embedl-ort-env/bin/embedl-onnxruntime \
...If embedl-onnxruntime is on the device’s $PATH, you can omit the --exec-path flag.
Preparing a model
The compile step expects an ONNX file. You can save
your existing PyTorch model in ONNX format using torch.onnx.export:
import torchfrom torchvision.models import mobilenet_v2model = mobilenet_v2(weights="IMAGENET1K_V2")example_input = torch.rand(1, 3, 224, 224)torch.onnx.export( model, example_input, "mobilenet_v2.onnx", input_names=["input"], output_names=["output"], opset_version=18, external_data=False, dynamo=False,)Compiling a model
Compile the ONNX model with quantization on the target device. The model is transferred to the device over SSH, compiled there, and the result is fetched back.
embedl-hub compile onnxruntime embedl-onnxruntime \
--model /path/to/mobilenet_v2.onnx \
--host 192.168.1.42 \
--user piThe embedl-onnxruntime provider quantizes the model as part of
compilation, applying INT8 post-training quantization to lower the
precision of weights and activations. This reduces memory usage and
inference latency on the target device.
Providing calibration data
Although quantization reduces the model’s precision, you can mitigate the accuracy loss by providing calibration data — a small set of representative input samples. You don’t need a large dataset; usually, a few hundred samples are more than enough. If no calibration data is provided, random data is used.
Calibration data is not yet supported via the CLI for embedl-onnxruntime.
Use the Python API instead.
Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).
Profiling a model
Profile the compiled model on the target device:
embedl-hub profile onnxruntime embedl-onnxruntime \
--from-run latest \
--host 192.168.1.42 \
--user piUse embedl-hub log to view your runs.
Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:
- Can we optimize the slowest layer?
- Why aren’t certain layers running on the expected compute unit?
Invoking a model
Invoke the compiled model with real input data to get inference outputs:
embedl-hub invoke onnxruntime embedl-onnxruntime \
--from-run latest \
--host 192.168.1.42 \
--user pi \
--input /path/to/input.npzThe --input flag accepts a .npz file — a NumPy archive where each key
is an input tensor name and each value is the corresponding array.
Next steps
- Learn how to view, name, and tag your runs, and how to interpret profiling results in the exploring results guide.
- See the providers guide for the full reference of supported provider and toolchain combinations.