This blog post was originally published at Plumerai’s website. It is reprinted here with the permission of Plumerai.
Earlier this year in April we presented our MLPerf Tiny 0.7 benchmark scores, showing that our inference engine runs your AI models faster than any other tool. Today, MLPerf released the Tiny 1.0 scores and Plumerai has done it again: we still have the world’s fastest inference engine for Arm Cortex-M architectures. Faster inferencing means you can run larger and more accurate AI models, go into sleep mode earlier to save power, and run them on smaller and lower cost MCUs. Our inference engine executes the AI model as-is and does no additional quantization, no binarization, no pruning, and no model compression. There is no accuracy loss. It simply runs faster and in a smaller memory footprint.
Here’s the overview of the MLPerf Tiny results for an Arm Cortex-M4 MCU:
Vendor | Visual Wake Words | Image Classification | Keyword Spotting | Anomaly Detection |
---|---|---|---|---|
Plumerai | 208.6 ms | 173.2 ms | 71.7 ms | 5.6 ms |
STMicroelectronics | 230.5 ms | 226.9 ms | 75.1 ms | 7.6 ms |
OctoML (microTVM, CMSIS-NN) | 301.2 ms | 389.5 ms | 99.8 ms | 8.6 ms |
OctoML (microTVM, native codegen) | 336.5 ms | 389.2 ms | 144.0 ms | 11.7 ms |
Official MLPerf Tiny 1.0 inference results for an Arm Cortex-M4 at 120MHz (STM32L4R5ZIT6U).
Compared to TensorFlow Lite for Microcontrollers with Arm’s CMSIS-NN optimized kernels, we run 1.74x faster:
Speed | Visual Wake Words | Image Classification | Keyword Spotting | Anomaly Detection | Speedup |
---|---|---|---|---|---|
TFLM latency | 335.97 ms | 376.08 ms | 100.72 ms | 8.45 ms | |
Plumerai latency | 194.36 ms | 170.42 ms | 66.32 ms | 5.59 ms | |
Speedup | 1.73x | 2.21x | 1.52x | 1.51x | 1.74x |
MLPerf Tiny inference latency for the Arm Cortex-M4 at 120MHz inside the STM32L4R9 MCU.
But not only latency is important. Since MCUs often have very limited memory on board it’s important that the inference engine uses minimal memory while executing the neural network. MLPerf Tiny does not report numbers for memory usage, but here are the memory savings we can achieve on the benchmarks compared to TensorFlow Lite for Microcontrollers. On average we use less than half the memory:
Memory | Visual Wake Words | Image Classification | Keyword Spotting | Anomaly Detection | Reduction |
---|---|---|---|---|---|
TFLM memory (KiB) | 98.80 | 54.10 | 23.90 | 2.60 | |
Plumerai memory (KiB) | 36.50 | 37.80 | 17.00 | 1.00 | |
Reduction | 2.71x | 1.43x | 1.41x | 2.60x | 2.04x |
MLPerf Tiny memory usage is lower by a factor of 2.04x.
MLPerf doesn’t report code size, so again we compare against TensorFlow Lite for Microcontrollers. The table below shows that we reduce code size on average by a factor of 2.18x. Using our inference engine means you can use MCUs that have significantly smaller flash size.
Code size | Visual Wake Words | Image Classification | Keyword Spotting | Anomaly Detection | Reduction |
---|---|---|---|---|---|
TFLM code (KiB) | 209.60 | 116.40 | 126.20 | 67.20 | |
Plumerai code (KiB) | 89.20 | 48.30 | 46.10 | 54.20 | |
Reduction | 2.35x | 2.41x | 2.74x | 1.24x | 2.18x |
MLPerf Tiny code size is lower by a factor of 2.18x.
Want to see how fast your models can run? You can submit them for free on our Plumerai Benchmark service. We email you the results in minutes.
Are you deploying AI on microcontrollers? Let’s talk.