Quickstart
Run your first model from start to finish with embedl-hub
This guide shows you how to go from having an idea for an application to benchmarking a fine-tuned model on remote hardware. To showcase this, we will fine-tune a model for a cat detection application that will run on a Samsung Galaxy S24 mobile phone.
You will learn how to:
- Find a model using Embedl Hub’s on-device benchmarks
- Fine-tune, quantize, and benchmark the model using the Embedl Hub CLI
Prerequisites
We assume you’ve read the overview documentation and that you already:
- Created an Embedl Hub account
- Installed the Embedl Hub CLI
- Configured an API Key
- Set up a remote hardware provider, such as Qualcomm AI Hub
Create a project
Create a project and experiment for the cat detector application:
embedl-hub init \
--project "Cat Detector" \
--experiment "My own cat detector model tracking"
The project’s metadata is stored locally in a file named .embedl_hub_ctx
.
To view the contents of this file, run:
embedl-hub show
You can use the embedl-hub show
command at any time to determine which
project is currently active.
Choose a model
Embedl Hub provides interactive tools and thousands of on-device benchmarks to help you find the best model for your application. Visit the Explore page, select your hardware platform of interest, and compare how different models perform based on accuracy and on-device latency.
Look for a model that fulfills your latency constraints on the target hardware and that has a high accuracy. High accuracy on a diverse dataset, such as ImageNet, suggests better performance on your own dataset during fine-tuning.
When you find an ideal model, click on the model’s datapoint in the Explore graph. This will take you to the model’s details page, where you can learn more about the model and find its ID.
For this guide, we will use the model torchvision-mobilenet-v2-int8.
Prepare a dataset
Prepare the dataset that you will use to fine-tune and, optionally, quantize the model.
The dataset should have two classes (cat
and not_cat
), and it should have the
following structure:
/path/to/dataset/
├── train/
│ ├── cat/
│ │ ├── example_0.jpg
│ │ ├── example_1.jpg
│ │ └── ...
│ └── not_cat/
│ ├── example_0.jpg
│ ├── example_1.jpg
│ └── ...
├── val/
│ ├── cat/
│ │ ├── example_0.jpg
│ │ └── ...
│ └── not_cat/
│ ├── example_0.jpg
│ └── ...
Fine-tune the model
With the model selected and dataset prepared, we’re ready to fine-tune the model!
We use the ID of the model we chose in Choose a model for --id
, and we use the location of the dataset we prepared in Prepare a dataset for --data
.
embedl-hub tune \
--id torchvision-mobilenet-v2-int8 \
--num-classes 2 \
--data /path/to/dataset \
--epochs 10 \
--batch-size 64 \
--learning-rate 0.0001
If you have access to a GPU and want help choosing hyperparameters, simply omit them, and the CLI will help you find a good starting point.
By default, embedl-hub tune
will save the fine-tuned model as mobilenet_v2_tuned.pt
.
Export the model
Now that we’ve fine-tuned the model on the cat or not dataset, let’s verify that the model runs as expected on the target hardware. This process is requires a series of steps:
Export: PyTorch -> ONNX
(Quantize: ONNX -> ONNX)
Compile: ONNX -> TFLite
Export the tuned PyTorch model to ONNX format for use in later steps. Be sure to specify the model’s target image size and device:
embedl-hub export \
--model /path/to/mobilenet_v2_tuned.pt \
--size 224,224 \
--device "Samsung Galaxy S24"
Since we haven’t set an output name, embedl-hub export
will save the model as mobilenet_v2_tuned.onnx
.
(Optional) Quantize the model
Quantizing a model can drastically reduce its inference latency on hardware, so we recommend completing this step.
Quantization lowers the number of bits used to represent the weights and activations in a neural network, which reduces both the memory and compute needed to run the model.
Although lowering the model’s precision also decreases its ability to accurately “think”, you can mitigate this by calibrating the model on example data. You don’t need a large dataset to achieve a good quantization accuracy; usually, a few hundred samples are more than enough.
embedl-hub quantize \
--model /path/to/mobilenet_v2_tuned.onnx \
--data /path/to/dataset \
--num-samples 100
Note: Some models have operations that are notoriously difficult to quantize, which can lead to a huge drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).
Compile the model
Now, we can prepare the model for inference on a Samsung Galaxy S24 hosted in the cloud. We choose a runtime, such as LiteRT (formerly known as TFLite). Then, we convert the model from the generic ONNX representation to an appropriate hardware-friendly format:
embedl-hub compile \
--model /path/to/mobilenet_v2_tuned_quantized.onnx \
--device "Samsung Galaxy S24" \
--runtime tflite
By default, embedl-hub compile
will save the compiled model as mobilenet_v2_tuned_quantized.tflite
.
Benchmark the model on remote hardware
Let’s evaluate how well the model performs using remote hardware:
embedl-hub benchmark \
--model /path/to/mobilenet_v2_tuned_quantized.tflite \
--device "Samsung Galaxy S24"
Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for advanced debugging and for iterating on the model’s design. We can answer questions like:
- Can we optimize the slowest layer?
- Why aren’t certain layers executed on the correct compute unit?