Friday, February 23, 2024

Etnaviv NPU update 17: Faster!

In the last update I explained how compression of zero weights gave our driver such a big performance improvement.

Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.

TL;DR

Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.

On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!

Enable more convolutions

Our driver was rejecting convolutions with a number of output channels that is not divisible by the number of convolution cores in the NPU because at the start of the development the code that lays the weights out in memory didn't support that. That caused TensorFlow Lite to run the convolutions in CPU, and some of them were big enough to take a few milliseconds, several times more than on the NPU.

When implementing support for bigger kernels I had to add improvements to the tiling of the convolutions and that included adding support for these other convolutions. So by just removing the rejection of these, we got a nice speed up on SSD MobileDet: from 32.7ms to 27ms!

That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.

Caching of the input tensor

So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.

That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.

Next steps

At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.

But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.

Thursday, February 8, 2024

Etnaviv NPU update 16: A nice performance jump

After the open-source driver for VeriSilicon's Vivante NPU was merged into Mesa two weeks ago, I have been taking some rest and thinking about what will come next.

Automated testing

I have a merge request to Mesa almost ready that will enable continuous integration testing on real hardware, but it depends on solving what seem to be problems with the power supplies of the boards in the HW testing lab. Collabora is graciously looking at it. Thanks!

Performance

I have been talking with quite a few people about the whole effort of bringing open-source to NPU hardware and something that came up more than once is the question of reaching or surpassing the performance level of the proprietary drivers.

It is a fair concern, because the systolic arrays will be underutilized if they starve of data. And given how fast they are in performing the arithmetic operations, and how slow memory buses and chips on embedded are (related to high-end GPUs, at least), this starving and the consequent underutilization are very likely to happen.

IP vendors go to great lengths to prevent that from happening, inventing ways of getting the data faster to the processing elements, reducing the memory bandwidth used, and balancing the use of the different cores/arrays. There is plenty of published research on this area, which helps when figuring out how to make the most of a particular piece of hardware.

Weight compression

Something I started working on last week is compression of zero values in the weight buffers. Sparsity is very common in the neural models that this hardware is targeted to run, and common convolutions such as strided and depthwise can easily have zero ratios of 90% and more.

By compressing consecutive zeroes in a buffer we can greatly reduce pressure on the memory bus, keeping the processing units better fed (though I'm sure we are still far from getting good utilization).

By opportunistically using the 5 available bits to compress consecutive runs of zeroes, I was able to improve the performance of the MobileNetV1 model from 15.7 ms to 9.9 ms, and that of the SSDLite MobileDet model from 56.1 ms to 32.7 ms.



As shown in the graph above, we still have quite some room for improvement before we reach the performance of the proprietary driver, but we are getting close pretty fast. I also believe that we can tailor the driver to user's needs to surpass the performance of the proprietary driver for specific models, as this is open-source and everybody can chip in, see how things are made and improve them.

IRC channel

I mentioned this in passing some time ago, but now that we have a driver at this level of usefulness, I think it is a good moment to remind that we have an IRC channel in the OFTC network to discuss anything about doing accelerated machine learning on the edge with upstream open-source software: #ml-mainline. You can click here to join via a web interface, though I recommend setting up an account at matrix.org.

What next

Should I continue working on performance? Enable more models for new use cases? Enable this driver on more SoCs (i.MX8MP and S905D3 look interesting)? Start writing a driver for a completely different IP, such as Rockchip's or Amlogic's?

I still haven't decided, so if you have an opinion please drop a comment in this blog, or at any of the social networks linked from this blog.

I'm currently available for contracting, so I should be able to get on your project full-time on short notice.