Quickstart
Run your first model from start to finish with embedl-hub
This guide shows you how to go from having an idea for an application to benchmarking a model on remote hardware. To showcase this, we will optimize and profile a model that will run on a Samsung Galaxy S24 mobile phone.
You will learn how to quantize, compile, and benchmark a model using the Embedl Hub Python library.
Prerequisites
If you haven’t already done so, follow the instructions in the setup guide to:
- Create an Embedl Hub account
- Install the Embedl Hub Python library
- Configure an API Key
- Set up a remote hardware cloud, such as Qualcomm AI Hub
Create a project and experiment
Create a project and experiment for the application:
embedl-hub init \
--project "Quickstart" \
--experiment "Samsung Galaxy S24 image classifier"
The project’s metadata is stored locally in a file named .embedl_hub_ctx
.
.embedl_hub_ctx is personal to you and should not be committed to version control. Add it to your .gitignore.
To view the contents of this file, run:
embedl-hub show
You can use the embedl-hub show
command at any time to determine which
project is currently active.
Compile the model from PyTorch to ONNX
Now that we’ve created a project and experiment, let’s verify that the model runs as expected on the target hardware. This process is requires a series of steps:
Compile: PyTorch -> ONNX
(Quantize: ONNX -> ONNX)
Compile: ONNX -> TFLite
The compile
step expects a TorchScript file.
You can convert your existing PyTorch model using tracing or scripting.
For this guide, we will convert the Torchvision MobileNet V2 model to TorchScript using scripting:
import torch
from torchvision.models import mobilenet_v2
# Define the model and example input
model = mobilenet_v2(weights="IMAGENET1K_V2")
example_input = torch.rand(1, 3, 224, 224)
# Script the model
script_model = torch.jit.script(model, example_inputs=[example_data])
# Save the converted model to disk
torch.jit.save(script_model, "path/to/mobilenet_v2.pt")
Compile the saved model to ONNX format for use in later steps. Be sure to specify the model’s target input size, device and runtime:
embedl-hub compile \
--model /path/to/mobilenet_v2.pt \
--size 1,3,224,224 \
--device "Samsung Galaxy S24" \
--runtime onnx
Since we haven’t set an output name, embedl-hub compile
will save the model as mobilenet_v2.onnx
.
(Optional) Quantize the model
Quantizing a model can drastically reduce its inference latency on hardware, so we recommend completing this step.
Quantization lowers the number of bits used to represent the weights and activations in a neural network, which reduces both the memory and compute needed to run the model.
Although lowering the model’s precision also decreases its ability to accurately “think”, you can mitigate this by calibrating the model on example data. You don’t need a large dataset to achieve a good quantization accuracy; usually, a few hundred samples are more than enough.
embedl-hub quantize \
--model /path/to/mobilenet_v2.onnx \
--data /path/to/dataset \
--num-samples 100
Note: Some models have operations that are notoriously difficult to quantize, which can lead to a huge drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).
Compile the model from ONNX to hardware runtime
Now, we can prepare the model for inference on a Samsung Galaxy S24 hosted in the cloud. We choose a runtime, such as LiteRT (formerly known as TFLite). Then, we convert the model from the generic ONNX representation to an appropriate hardware-friendly format:
embedl-hub compile \
--model /path/to/mobilenet_v2_quantized.onnx \
--size 1,3,224,224 \
--device "Samsung Galaxy S24" \
--runtime tflite
By default, embedl-hub compile
will save the compiled model as mobilenet_v2_quantized.tflite
.
Benchmark the model on remote hardware
Let’s evaluate how well the model performs using remote hardware:
embedl-hub benchmark \
--model /path/to/mobilenet_v2_quantized.tflite \
--device "Samsung Galaxy S24"
Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for advanced debugging and for iterating on the model’s design. We can answer questions like:
- Can we optimize the slowest layer?
- Why aren’t certain layers executed on the correct compute unit?