MLPerf Tiny 1.0 Confirms: Plumerai’s Inference Engine is Again the World’s Fastest

This blog post was originally published at Plumerai’s website. It is reprinted here with the permission of Plumerai.

Earlier this year in April we presented our MLPerf Tiny 0.7 benchmark scores, showing that our inference engine runs your AI models faster than any other tool. Today, MLPerf released the Tiny 1.0 scores and Plumerai has done it again: we still have the world’s fastest inference engine for Arm Cortex-M architectures. Faster inferencing means you can run larger and more accurate AI models, go into sleep mode earlier to save power, and run them on smaller and lower cost MCUs. Our inference engine executes the AI model as-is and does no additional quantization, no binarization, no pruning, and no model compression. There is no accuracy loss. It simply runs faster and in a smaller memory footprint.

Here’s the overview of the MLPerf Tiny results for an Arm Cortex-M4 MCU:

Vendor Visual Wake Words Image Classification Keyword Spotting Anomaly Detection
Plumerai 208.6 ms 173.2 ms 71.7 ms 5.6 ms
STMicroelectronics 230.5 ms 226.9 ms 75.1 ms 7.6 ms
OctoML (microTVM, CMSIS-NN) 301.2 ms 389.5 ms 99.8 ms 8.6 ms
OctoML (microTVM, native codegen) 336.5 ms 389.2 ms 144.0 ms 11.7 ms

Official MLPerf Tiny 1.0 inference results for an Arm Cortex-M4 at 120MHz (STM32L4R5ZIT6U).

Compared to TensorFlow Lite for Microcontrollers with Arm’s CMSIS-NN optimized kernels, we run 1.74x faster:

Speed Visual Wake Words Image Classification Keyword Spotting Anomaly Detection Speedup
TFLM latency 335.97 ms 376.08 ms 100.72 ms 8.45 ms
Plumerai latency 194.36 ms 170.42 ms 66.32 ms 5.59 ms
Speedup 1.73x 2.21x 1.52x 1.51x 1.74x

MLPerf Tiny inference latency for the Arm Cortex-M4 at 120MHz inside the STM32L4R9 MCU.

But not only latency is important. Since MCUs often have very limited memory on board it’s important that the inference engine uses minimal memory while executing the neural network. MLPerf Tiny does not report numbers for memory usage, but here are the memory savings we can achieve on the benchmarks compared to TensorFlow Lite for Microcontrollers. On average we use less than half the memory:

Memory Visual Wake Words Image Classification Keyword Spotting Anomaly Detection Reduction
TFLM memory (KiB) 98.80 54.10 23.90 2.60
Plumerai memory (KiB) 36.50 37.80 17.00 1.00
Reduction 2.71x 1.43x 1.41x 2.60x 2.04x

MLPerf Tiny memory usage is lower by a factor of 2.04x.

MLPerf doesn’t report code size, so again we compare against TensorFlow Lite for Microcontrollers. The table below shows that we reduce code size on average by a factor of 2.18x. Using our inference engine means you can use MCUs that have significantly smaller flash size.

Code size Visual Wake Words Image Classification Keyword Spotting Anomaly Detection Reduction
TFLM code (KiB) 209.60 116.40 126.20 67.20
Plumerai code (KiB) 89.20 48.30 46.10 54.20
Reduction 2.35x 2.41x 2.74x 1.24x 2.18x

MLPerf Tiny code size is lower by a factor of 2.18x.

Want to see how fast your models can run? You can submit them for free on our Plumerai Benchmark service. We email you the results in minutes.

Are you deploying AI on microcontrollers? Let’s talk.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top