On-device Inference with LiteRT
LiteRT CompiledModel API represents the modern standard for on-device ML
inference, offering streamlined hardware acceleration that significantly
outperforms the Interpreter API. This interface simplifies the
deployment of .tflite models across a wide range of edge platforms by
providing a unified developer experiences and advanced features designed for
maximum hardware efficiency.
Why Choose the CompiledModel API?
While the Interpreter API remains available for backward compatibility, the
CompiledModel API is where new performance and accelerator features are
prioritized. It is the recommended choice for these reasons:
Best-in-class GPU acceleration: Leverages ML Drift, the state-of-the-art GPU acceleration library, to deliver reliable GPU inference across mobile, web, desktop, and IoT devices. See GPU acceleration with LiteRT.
Unified NPU access: Provides a single, consistent developer experience to access NPUs from various providers like Google Tensor, Qualcomm, MediaTek, abstracting away vendor-specific compilers and runtime complexities. See NPU acceleration with LiteRT.
Automated hardware selection: Automatically selects the optimal backend among CPU, GPU, and NPU, based on available hardware and internal priority logic, eliminating the need for manual delegate configuration.
Asynchronous execution: Utilizes OS-level mechanisms (like sync fences) to allow hardware accelerators to trigger directly upon completion of previous tasks without involving the CPU. This can reduce latency by up to 2x and ensures a smoother, more interactive AI experience.
Efficient I/O buffer management: Leverages the
TensorBufferAPI to manage high-performance data flow between accelerators. This includes zero-copy buffer interop acrossAHardwareBuffer, OpenCL, and OpenGL, eliminating costly data copies between preprocessing, inference, and post-processing stages.
Get Started with CompiledModel API
For classical ML models, see the following demo apps.
- Image segmentation Kotlin App: CPU/GPU/NPU inference.
- Image segmentation C++ App: CPU/GPU/NPU inference with async execution.
For GenAI models, see the following demo apps:
- EmbeddingGemma semantic similarity C++ App: CPU/GPU/NPU inference.
Supported platforms
LiteRT CompiledModel API supports high-performance inferences across Android,
iOS, Web, IoT, and Desktop devices. See platform-specific guide.