How do you run YOLO on Android?

Run YOLO on Android by exporting the model to TensorFlow Lite (TFLite), quantizing weights for size and speed, integrating with CameraX for live frames, and running inference on either CPU, GPU, or NNAPI delegate depending on the device. The TFLite interpreter handles model loading and inference.

What is TensorFlow Lite?

TensorFlow Lite is a runtime for running machine-learning models on mobile, embedded, and edge devices. It supports quantization, hardware acceleration via GPU and NNAPI delegates, and produces small, fast model files suitable for on-device inference on Android, iOS, and microcontrollers.

How fast is YOLO on a phone?

Quantized YOLO models like YOLOv5n or YOLOv8n run at 15-30 FPS on mid-range Android phones using TFLite with GPU delegation. Performance varies significantly with model size, input resolution, and device hardware — benchmark on your target device before committing.

How do I optimise an ML model for Android?

Optimise for Android by quantizing to int8, pruning unused channels, using the smallest model variant that meets accuracy requirements, choosing the right TFLite delegate (GPU, NNAPI, or Hexagon), and running inference on a background thread to keep the UI smooth.

Can Android phones run real-time object detection?

Yes. Modern Android phones run real-time object detection at 15-30 FPS using quantized YOLO models via TFLite, with GPU or NNAPI acceleration. For assistive AI products like SmartON, this is fast enough to describe a user's environment in near real time.

Object Detection on Android: YOLO + TFLite

Why On-Device Object Detection

For SmartON's currency detection to be useful for visually impaired users, it needed to work without internet — in a market, in a rural area, anywhere. Server-based inference was ruled out from the start. On-device inference using TensorFlow Lite was the only viable path.

This tutorial covers the full pipeline: training a YOLO model, converting it to TFLite, integrating it with Android's camera API, and optimizing for real-world performance. The code patterns come directly from SmartON's currency detection implementation.

Step 1: Train Your YOLO Model

We use YOLOv8 (Ultralytics) for training. It exports cleanly to ONNX and from there to TFLite, has strong documentation, and the nano/small variants are small enough for mobile inference.

from ultralytics import YOLO

# Train from scratch or fine-tune a pretrained model
model = YOLO('yolov8n.pt')  # nano variant for mobile

results = model.train(
  data='currency.yaml',      # your dataset config
  epochs=100,
  imgsz=640,
  batch=16,
  device='cuda',
  workers=4,
  augment=True               # mosaic, mixup, flips, color jitter
)

Your currency.yaml should specify train/val/test paths and class names. For SmartON, we had 7 classes (₹10, ₹20, ₹50, ₹100, ₹200, ₹500, ₹2000) plus an orientation class attribute.

Step 2: Export to TFLite with INT8 Quantization

Quantizing from FP32 to INT8 reduces model size by 4× and inference time by 2–3× on mobile, with under 2% accuracy loss for most detection tasks:

# Export to ONNX first, then to TFLite
model.export(format='tflite', int8=True, data='currency.yaml')
# This runs representative data through the model to calibrate quantization
# Provide 100-200 calibration images representative of your production data

The output is a .tflite file. For SmartON's currency model, the FP32 model was 12.4MB; the INT8 quantized version is 3.1MB — comfortable for embedding in an Android APK.

Step 3: Android Integration

Add the TFLite dependencies to your build.gradle:

dependencies {
  implementation 'org.tensorflow:tensorflow-lite:2.14.0'
  implementation 'org.tensorflow:tensorflow-lite-gpu:2.14.0'  // GPU delegate
  implementation 'org.tensorflow:tensorflow-lite-support:0.4.4'
}

Load and run the model:

class CurrencyDetector(private val context: Context) {
  private lateinit var interpreter: Interpreter

  fun init() {
    val gpuDelegate = GpuDelegate()
    val options = Interpreter.Options().addDelegate(gpuDelegate)

    val modelBuffer = loadModelFile("currency_int8.tflite")
    interpreter = Interpreter(modelBuffer, options)
  }

  fun detect(bitmap: Bitmap): List {
    val input = preprocessBitmap(bitmap)  // resize to 640x640, normalize
    val output = Array(1) { Array(25200) { FloatArray(12) } }  // YOLO output shape

    interpreter.run(input, output)
    return postprocess(output, bitmap.width, bitmap.height)
  }
}

The postprocessing step applies non-maximum suppression (NMS) and maps detection coordinates from model space back to image space. Use the ObjectDetector class from TFLite Support Library to handle this if you don't want to implement NMS yourself.

Step 4: Camera Integration with CameraX

CameraX (Jetpack) is the right choice for camera integration in 2025. It handles device-specific quirks, lifecycle management, and provides a clean API for analysis use cases:

val imageAnalysis = ImageAnalysis.Builder()
  .setTargetResolution(Size(640, 640))
  .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
  .build()

imageAnalysis.setAnalyzer(cameraExecutor) { imageProxy ->
  val bitmap = imageProxy.toBitmap()  // extension function
  val detections = currencyDetector.detect(bitmap)
  runOnUiThread { updateUI(detections) }
  imageProxy.close()  // IMPORTANT: always close the proxy
}

STRATEGY_KEEP_ONLY_LATEST is critical for real-time inference. If your model can't process frames as fast as the camera produces them, this strategy drops stale frames rather than building a backlog — keeping inference current rather than delayed.

Performance Optimization

Achieved performance on SmartON (Snapdragon 870, NNAPI):

YOLOv8n INT8, input 320×320: ~28ms per frame (35fps effective)
YOLOv8n INT8, input 640×640: ~68ms per frame (14fps effective)
YOLOv8s INT8, input 640×640: ~145ms per frame (6fps effective)

For real-time use cases, use the nano variant with 320×320 input. For accuracy-critical use cases (detailed OCR, precise distance estimation), step up to 640×640 at the cost of frame rate.

Enable NNAPI delegate on supported devices (Snapdragon 855+): it offloads inference to the dedicated neural processing unit and is typically 2–4× faster than the GPU delegate for quantized models.

SmartON uses this pipeline for currency detection. Read the full SmartON build story → or reach out if you're building computer vision for Android.

Object Detection on Android: YOLO and TFLite

Why On-Device Object Detection

Step 1: Train Your YOLO Model

Step 2: Export to TFLite with INT8 Quantization

Step 3: Android Integration

Step 4: Camera Integration with CameraX

Performance Optimization

Frequently Asked Questions

Related Posts

Building SmartON: Assistive AI for the Visually Impaired

Deploying AI on Jetson Nano: A Practical Guide