Skip to main content

From PyTorch to Shipping local AI on Android

Published

By Elina Norling

On-device AI offers many advantages for Android apps; enabling low-latency interactions, offline functionality and total privacy to name a few. But running AI on local devices is far harder than running it in a Jupyter notebook.

In this guide, we’ll break down why it is so hard and walk through how to optimize and run models on Android devices. We’ll also demonstrate how you can test it on different devices without needing physical access to a wide range of hardware.

Why run AI locally and why it’s hard on Android

Many modern Android apps rely on real-time intelligence to deliver a smooth and responsive user experience. Pose detection in fitness apps, AR filters in social apps, on-device audio processing, and live classification are all examples. These tasks benefit from running models locally rather than in the cloud, since the data can be processed directly where it is generated. Local inference gives you speed, privacy, and the ability to work offline – but it also comes with a challenge: getting these models to perform consistently across the enormous range of Android devices.

This became clear in a conversation I had recently with my friend Noah, an Android developer working on a lightweight pose-detection feature for a fitness app. He trained a MobileNet-based model in PyTorch, converted it to TFLite, and verified it on the three phones he had available, a Pixel 7, Galaxy S21, and a mid-range Motorola. Everything looked smooth. But after release, he started receiving reviews from users on other devices reporting sluggish performance, unstable frame rates, and in some cases even crashes before inference even began.

Noah’s experience isn’t unusual. In fact, it’s one of the most common issues Android developers run into when working with on-device AI. Apps that work perfectly on a few phones but feel slow or broken on others, frustrated users leaving negative reviews, and developers end up removing the on-device feature – losing many of the benefits of running AI on-device in the first place.

Digging deeper into the problem

To understand why situations like Noah’s happen, we need to look more closely at why the same model can show completely different latency, stability, and device-specific performance across devices, making on-device AI development so challenging.

1. Performance varies across devices and chipsets

Android hardware is highly diverse. Two devices released the same year can behave completely differently when running the exact same model. One may use an NPU and reach 60 FPS, another may fall back to GPU, and a third may run everything on the CPU and struggle to reach usable performance or even crash.

Rule of thumb:

  • CPUs run almost anything but rarely meet real-time needs.
  • GPUs are faster but depend heavily on runtime (TFLite GPU delegate, NNAPI, Vulkan) support
  • NPUs are fastest but only for models correctly adapted and compiled correctly for that chipset

And it’s not just about faster or slower processors. Android devices vary widely in how their accelerators and drivers support different operations and precisions, and runtime delegates often make different decisions about which compute units to use. As a result, two phones can execute the same model through completely different paths – resulting in noticeably different stability and performance on each device.

This makes broad testing essential. Yet most developers don’t have access to many devices, which is why issues often remain hidden until after launch.

2. Development complexity and setup effort discourage local AI

Even once a model works on your own device, getting it ready for actual deployment requires navigating a surprisingly complex toolchain. Exporting a model from PyTorch to ONNX and then to TFLite is only the beginning. Many hardware vendors expose their own delegates, runtimes, and SDKs, and each of them behaves slightly differently.

Developers I’ve spoken to say that even small on-device features, such as a simple classifier or filter, can take a huge amount of effort to get running well. Setting up TFLite GPU delegates, NNAPI, or vendor-specific runtimes on Qualcomm or Google Tensor devices requires time and experimentation. And when something doesn’t work, error messages are often vague, making it difficult to pinpoint whether the issue is an operator the hardware doesn’t support, relies on a precision (like FP32) that the accelerator can’t handle, or simply unsupported hardware acceleration.

3. Battery, speed, and hardware limitations are obstacles

Finally, even if you get the model running, real-world constraints remain. Phones have limited thermal budgets; running heavy models can overheat the device and throttle performance. Battery drain is a persistent concern – users will quickly uninstall an app that consumes too much power. Smaller or very old phones also have limited RAM and weaker accelerators, meaning some models simply will not run well no matter how they are optimized.

Several developers we have talked to point out that “not every device has enough AI processing power to handle heavy workloads with real-time requirements,” and this makes on-device performance inherently unpredictable. Bigger models are often too slow, too “hot”, or too power-hungry. Smaller models may lack accuracy. Getting the right balance requires careful optimization.

Solving Android devs’ on-device problems

The challenges described above are exactly what we built Embedl Hub to solve. Because these issues are ones we’ve encountered ourselves, we set out to create a tool that helps you identify which models perform well across devices, understand how they behave on different chipsets, optimize them for specific hardware targets, and verify the models on real Android devices in the cloud.

At a high level, the platform lets you

  • Compile your model for the correct runtime and accelerators on the target device, ensuring it can use the available hardware.
  • Optimize your model to reduce latency, memory usage, and energy consumption, and to enable NPU acceleration on many modern chipsets.
  • Benchmark your model on real edge hardware in the cloud to measure and compare device-specific latency, memory use, and execution paths.

Embedl Hub logs your metrics, parameters, and benchmarks, and presents them in a web UI where you can inspect layer-level behavior, compare devices side by side, and reproduce every run. Our goal with this UI is to make it easy to confidently choose the best model–device combination before releasing your app.

To showcase the platform we’ve built, we’ll demonstrate how it can be used to optimize and profile a model running on a Samsung Galaxy S24 mobile phone.

Compile the model

Let’s say you want to run a MobileNetV2 model trained in PyTorch. First, export the model to ONNX and then compile it for the target runtime. In this case, we want to run it using LiteRT (TFLite).

To compile it with the embedl-hub CLI, you run the command:

embedl-hub compile \
    --model /path/to/mobilenet_v2.onnx \

This step gives you an early indication of whether the model is actually compatible with the device’s chipset and execution paths, so you can catch the kinds of issues that usually only appear after launch, before users start leaving reviews.

Optimize the model

Quantization is an optional but highly recommended step that can drastically reduce inference latency and memory usage. On mobile and embedded hardware, most optimizations come from quantization: By lowering the numerical precision of weights and activations to lower numerical precision (such as INT8), the model becomes faster and more power-efficient. It is especially useful when deploying models to resource-constrained hardware such as mobile phones or embedded boards and is often required for NPU acceleration on modern Android devices.

While this can reduce the model’s accuracy, you can minimize the loss by calibrating with a small sample dataset, typically just a few hundred examples.

embedl-hub quantize \
    --model /path/to/mobilenet_v2.tflite \
    --data /path/to/dataset \
    --num-samples 100

This feature directly addresses issues developers frequently encounter, such as inference failing due to unsupported ops. It also reduces memory use and battery consumption, and allows the model to run more efficiently on hardware that benefits from quantized execution, including NPUs and CPUs.

Benchmark the model on remote hardware

Now that the model is compiled (and quantized), you can run it on real hardware directly through one of Embedl Hub’s integrated device clouds.

embedl-hub benchmark \
    --model /path/to/mobilenet_v2.quantized.tflite \
    --device "Samsung Galaxy S24"

This is where many of the earlier problems are finally made totally visible: The benchmark results reveal how the model behaves on real devices. With these results, you can quickly see which devices run the model well, which don’t, and decide how to adapt or further develop before releasing your app.

In this example, we run the model on Samsung Galaxy S24. There are a large number of devices to choose from on Embedl Hub – Galaxy phones, Pixel phones, Snapdragon development boards – allowing you to test across the very diversity that makes Android deployment difficult. See supported devices.

Analyze & compare performance in the Web UI

Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for debugging and for iterating on the model’s design. We can answer questions like:

  • How does my model behave across different chipsets?
  • Can we optimize the slowest layer?
  • Why aren’t certain layers executed on the correct compute unit?

This interface lets you verify model performance across many devices without repeating setup work, with all your on-device efforts gathered in one place.

The visualizations in the dashboard make it easy to understand why a model behaves differently across chipsets, helping you systematically improve and optimize its performance for the hardware you target for. And by comparing multiple devices through our device cloud, you can confidently test and choose the best model–device combination before releasing your app.

Share your feedback

Embedl Hub is still in beta, and we’d love to hear your feedback and which features or devices you’d like to see next, so we can continue solving the problems Android devs face when building on-device AI.

Try it out at hub.embedl.com and let us know what you think!