Wednesday, January 10, 2024

Etnaviv NPU update 14: Object detection with decent performance

When almost two months ago I got MobileNetV1 running with useful performance on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.

Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?

Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.

SSDLite MobileDet is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.

The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.

So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:

Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.

I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.

After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.

If we look at a practical use case such that supported by Frigate NVR, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.

Given the price level of the single board computers that contain the VIPNano, this is quite a good bang for your bucks. And all open source and heading to mainline!

Next steps

I have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.

No comments: