Friday, June 28, 2024

Etnaviv NPU update 19: Ideas On Board sponsors support for the NXP i.MX 8M Plus SoC

Last week I started work on adding support to the Etnaviv driver for the NPU inside the NXP i.MX 8M Plus SoC (VeriSilicon's VIPNano-SI+).

This work is sponsored by the open source consultancy Ideas On Boards, and will include the same level of support as for the Amlogic A311D SoC, which means full acceleration for the SSDLite MobileDet object detection model.


Right now all kinds of basic convolutions are supported, and work is well on its way for strided convolutions.

For basic convolutions, most of the work was switching to a totally different way of encoding weights. At the low-level, the weights are encoded with Huffman, and zero run length encoding on top. This low level encoding has been already reverse engineered and implemented by Philipp Zabel of Pengutronix, as mentioned in my previous update on the variant of this NPU shipped inside the Amlogic S905D3.

How weights are laid on top of the encoding is also different, so I had to reverse engineer that and implement it in the Mesa driver. That plus some changes on how tiling is computed got basic convolutions working, then I moved to strided convolutions. Pointwise convolutions got supported at the same time as basic convolutions, as they are not any different on this particular hardware.

Strided convolutions are still not natively supported by the hardware, so I reused the code that lowers them to basic convolutions. But the existing jobs that use the tensor manipulation cores to transform the input tensor for strides contained many assumptions that don't hold valid in this hardware.

So I have been reverse engineering these differences and now I have all kinds of strided convolutions supported up to 32 output channels. I feel that these will be done after addressing a couple of details about how the tensor reshuffle jobs are distributed among the available TP cores.

Afterwards I will look at depthwise convolutions, which may be supported natively by the hardware, while on the A311D these were lowered to basic convolutions.

Then on to tensor addition operations, and that should be all that is needed to get SSDLite MobileDet running, hopefully close to the performance of the closed source driver.

I'm very grateful to Ideas On Board for sponsoring this work, for their trust on me to get it done, and for their vision of a fully featured mainline platform that all companies can base their products on without being held captive by any single vendor.

I'm testing all this on a Verdin iMX8M Plus board that was kindly offered by Daniel Lang at Toradex, thanks!


Thursday, June 13, 2024

Rockchip NPU update 4: Kernel driver for the RK3588 NPU submitted to mainline

In the past few weeks I have been working on among other things a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.

It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this link.

The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.

 

The userspace  driver is in a less polished state, but fully featured at this state. I will be working on this in the next few days so it can be properly submitted for review.

This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.

Tuesday, May 7, 2024

Etnaviv NPU update 18: Getting the driver to work on the Amlogic S905D3 SoC

With new releases of the Linux kernel and Mesa drivers poised to be packaged by Linux distributions, the TensorFlow Lite driver for the NPU in the Amlogic A311D SoC will be available to users with minimal effort.

With that work bearing its fruits, I have been looking at how this driver could be of use with other hardware.

Philipp Zabel of Pengutronix has been looking at adding support for the NPU in the NXP i.MX 8M Plus SoC, and he has made great progress on reverse engineering the in-memory format of the weights tensor, which is different from that used in the A311D.

I started by probing what would entail supporting the NPU in the S905D3 SoC from Amlogic, and I found it not that different from what is currently supported, besides it also using a new format for the weights tensor.

Weights, the other kind of.
Weights, the other kind of them.
Looked a bit further, and found that this format is very similar to what Philip had been reverse engineering and implementing support for.

After a couple of weeks staring at memory dumps and writing a python tool to decode them, I realized that the run-length and Huffman encodings were the same, with only a few differences such as where and how the bias values were stored.

With a few changes to Philip's work-in-progress branch I got my first tests passing on the Libre Computer Solitude SBC board.

Next I will look at supporting more weights tensor dimensions and fixing bugs in how the weights and other values are encoded.

The command stream programming seems to be very similar to that of the A311D, so I don't expect much work to be needed there.

Once everything is working at the same level as with the A311D, I will move to determine the optimal values for the zero run-length and Huffman symbol maps, for maximum compression and thus performance (as NPUs are so fast at arithmetic that they tend to be memory starved).

Big thanks to Pengutronix for supporting Philip's work, and to Libre Computer for having supported the development of the driver so far.

Friday, April 19, 2024

Rockchip NPU update 3: Real-time object detection on RK3588

Progress

Yesterday I managed to implement in my open-source driver all the remaining operations so the SSDLite MobileDet model can run on Rockchip's NPU in the RK3588 SoC.

Performance is pretty good at 30 frames per second when using just one of the 3 cores that the NPU contains.


 I uploaded the generated video to YouTube at:

You can get the source code at my branch here.

 

Next steps

Now that we got to this level of usefulness, I'm going to switch to writing a kernel driver suited for inclusion into the Linux kernel, to the drivers/accel subsystem.

There is still lots of work to do, but progress is going pretty fast, though as I write more drivers for different NPUs I will have to split my time among them. At least, until we get more contributors! :)

Thursday, March 28, 2024

Rockchip NPU update 2: MobileNetV1 is done

Progress

For  the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.

I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.

And it not only runs flawlessly, but at the same performance level as the blob.

It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.

by Julien Langlois CC BY-SA 3.0

 tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Teflon delegate: loaded rknpu driver

teflon: compiling graph: 89 tensors 27 operations
...
teflon: compiled graph, took 413 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 10 ms
teflon: invoked graph, took 10 ms
0.984314: hen
0.019608: cock
0.000000: toilet tissue
0.000000: sea cucumber
0.000000: wood rabbit
time: 10.776ms

Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.

Thanks to TL Lim of PINE64 for sending me a QuartzPro64 board for me to hack on.

Next steps

I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.

After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.

Saturday, March 16, 2024

Rockchip NPU update 1: A walk in the park?

During the past weeks I have paused work on the driver for the Vivante NPU and have started work on a new driver, for Rockchip's own NPU IP, as used in SoCs such as RK3588(S) and RK3568.

The version of the NPU in the RK3588 claims a performance of 6 TOPS across its 3 cores, though from what I have read, people are having trouble making use of more than one core in parallel, with the closed source driver.

A nice walk in the park

Rockchip, as most other vendors of NPU IP, provides a GPLed kernel driver and pushes out their userspace driver in binary form. The kernel driver is pleasantly simple and relatively up-to-date in regards of its use of internal kernel APIs. The userspace stack though is notoriously buggy and difficult to use, with basic features still unimplemented and performance being quite below what the hardware should be able to achieve.

To be clear, this is on top of the usual problems related to closed-source drivers. I get the impression that Rockchip's NPU team is really understaffed.

Other people had already looked at reverse-engineering the HW so they could address the limitations and bugs in the closed source driver, and use it in situations not supported by Rockchip. I used information acquired by Pierre-Hugues Husson and Jasbir Matharu to get started, a big thanks to them!

After the initial environment was setup (had to forward-port their kernel driver to v6.8), I wrote a simple library that can be loaded in the process with LD_PRELOAD and that, by overriding the ioctl and other syscalls, I was able to dump the buffers that the proprietary userspace driver sends to the hardware.

I started looking at a buffer that from the debug logs of the proprietary driver contained register writes, and when looking at the register descriptions in the TRM, I saw that it had to be closely based on NVIDIA's NVDLA open-source NPU IP.

With Rockchip's (terse) description of the registers, NVDLA's documentation and source code for both the hardware and the userspace driver, I have been able to make progress several times faster than I was able to when working on VeriSilicon's driver (for which I had zero documentation).

Right now I am at the stage at which I am able to correctly execute TensorFLow Lite's Conv2D and DepthwiseConv2D operations with different combinations of input dimensions, weight dimensions, strides and padding. Next is to support multiple output channels.

I'm currently using Rockchip's kernel, but as soon as I'm able to run object detection models with decent hardware utilization, I plan to start writing a new kernel driver for mainlining.

Rockchip's kernel driver has gems such as passing addresses in the kernel address space across the UAPI...

Tests run fast and reliably, even with high concurrency:

tomeu@arm-64:~/mesa$ TEFLON_TEST_DELEGATE=~/mesa/build/src/gallium/targets/teflon/libteflon.so TEFLON_TEST_DATA=src/gallium/targets/teflon/tests LD_LIBRARY_PATH=/home/tomeu/tflite-vx-delegate/build/_deps/tensorflow-build/ ~/.cargo/bin/gtest-runner run --gtest /home/tomeu/mesa/build/src/gallium/targets/teflon/test_teflon --output /tmp -j8 --tests-per-group 1 --baseline ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-fails.txt --flakes ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-flakes.txt  --skips ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-skips.txt
Running gtest on 8 threads in 1-test groups
Pass: 0, Duration: 0
Pass: 139, Skip: 14, Duration: 2, Remaining: 2
Pass: 277, Skip: 22, Duration: 4, Remaining: 0
Pass: 316, Skip: 24, Duration: 4, Remaining: 0

You can find the source code in this branch.

Friday, February 23, 2024

Etnaviv NPU update 17: Faster!

In the last update I explained how compression of zero weights gave our driver such a big performance improvement.

Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.

TL;DR

Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.

On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!

Enable more convolutions

Our driver was rejecting convolutions with a number of output channels that is not divisible by the number of convolution cores in the NPU because at the start of the development the code that lays the weights out in memory didn't support that. That caused TensorFlow Lite to run the convolutions in CPU, and some of them were big enough to take a few milliseconds, several times more than on the NPU.

When implementing support for bigger kernels I had to add improvements to the tiling of the convolutions and that included adding support for these other convolutions. So by just removing the rejection of these, we got a nice speed up on SSD MobileDet: from 32.7ms to 27ms!

That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.

Caching of the input tensor

So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.

That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.

Next steps

At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.

But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.