Saturday, November 16, 2024

Etnaviv NPU update 21: Support for the NPU in the NXP i.MX 8M Plus SoC is upstream!

Several months have passed since the last update. This has been in part due to the summer holidays and a gig doing some non-upstream work, but I have also had the opportunity to continue my work on the NPU driver for the VeriSilicon NPU in the NXP i.MX 8M Plus SoC, thanks to my friends at Ideas on Board.

CC BY-NC 4.0 Henrik Boye
 I'm very happy with what has been accomplished so far, with the first concrete result being the merge in Mesa of the support for NXP's SoC. Thanks to Philipp Zabel and Christian Gmeiner for helping with their ideas and code reviews.

With this, as of yesterday, one can accelerate models such as SSDLite MobileDet on that SoC with only open source software, with the support being provided directly from projects that are already ubiquitous in today's products, such as the Linux kernel and Mesa3D. We can expect this functionality to reach distributions such as Debian in due time, for seamless installation and integration in products.

With this milestone reached, I will be working on expanding support for more models, with a first goal of enabling YOLO-like models, starting with YOLOX. I will be working as well on performance, as currently we are not fully using the capabilities of this hardware.

Wednesday, July 31, 2024

Etnaviv NPU update 20: Fast object detection on the NXP i.MX 8M Plus SoC

I'm happy to announce that my first project regarding support for the NPU in NXP's i.MX 8M Plus SoC has reached the feature complete stage.

CC BY-NC 4.0 Henrik Boye

For the last several weeks I have been working full-time on adding support for the NPU to the existing Etnaviv driver. Most of the existing code that supports the NPU in the Amlogic A311D was reused, but NXP used a much more recent version of the NPU IP so some advancements required new code, and this in turn required reverse engineering.

This work has been kindly sponsored by the Open Source consultancy Ideas On Board, for which I am very grateful. I hope this will be useful to those companies that need full mainline support in their products, even if it is just the start.

This company is unique in working on both NPU and camera drivers in Linux mainline, so they have the best experience for products that require long term support and vision processing.

Since the last update I have fixed the last bugs in the compression of the weights tensor and implemented support for a new hardware-assisted way of executing depthwise convolutions. Some improvements on how the tensor addition operation is lowered to convolutions was needed as well.

Performance is pretty good already, allowing for detecting objects in video streams at 30 frames per second, so at a similar performance level as the NPU in the Amlogic A311D. Some performance features are left to be implemented, so I think there is still substantial room for improvement.

At current the code is at a very much proof-of-concept state. The next step is cleaning it all up and submitting for review to Mesa3D. In the meantime, you can find the draft code at https://gitlab.freedesktop.org/tomeu/mesa/-/tree/etnaviv-imx8mp.

A big thanks to Philipp Zabel who reverse engineered the bitstream format of the weight encoding and added some patches to the kernel that were required for the NPU to work reliably.

Friday, June 28, 2024

Etnaviv NPU update 19: Ideas On Board sponsors support for the NXP i.MX 8M Plus SoC

Last week I started work on adding support to the Etnaviv driver for the NPU inside the NXP i.MX 8M Plus SoC (VeriSilicon's VIPNano-SI+).

This work is sponsored by the open source consultancy Ideas On Boards, and will include the same level of support as for the Amlogic A311D SoC, which means full acceleration for the SSDLite MobileDet object detection model.


Right now all kinds of basic convolutions are supported, and work is well on its way for strided convolutions.

For basic convolutions, most of the work was switching to a totally different way of encoding weights. At the low-level, the weights are encoded with Huffman, and zero run length encoding on top. This low level encoding has been already reverse engineered and implemented by Philipp Zabel of Pengutronix, as mentioned in my previous update on the variant of this NPU shipped inside the Amlogic S905D3.

How weights are laid on top of the encoding is also different, so I had to reverse engineer that and implement it in the Mesa driver. That plus some changes on how tiling is computed got basic convolutions working, then I moved to strided convolutions. Pointwise convolutions got supported at the same time as basic convolutions, as they are not any different on this particular hardware.

Strided convolutions are still not natively supported by the hardware, so I reused the code that lowers them to basic convolutions. But the existing jobs that use the tensor manipulation cores to transform the input tensor for strides contained many assumptions that don't hold valid in this hardware.

So I have been reverse engineering these differences and now I have all kinds of strided convolutions supported up to 32 output channels. I feel that these will be done after addressing a couple of details about how the tensor reshuffle jobs are distributed among the available TP cores.

Afterwards I will look at depthwise convolutions, which may be supported natively by the hardware, while on the A311D these were lowered to basic convolutions.

Then on to tensor addition operations, and that should be all that is needed to get SSDLite MobileDet running, hopefully close to the performance of the closed source driver.

I'm very grateful to Ideas On Board for sponsoring this work, for their trust on me to get it done, and for their vision of a fully featured mainline platform that all companies can base their products on without being held captive by any single vendor.

I'm testing all this on a Verdin iMX8M Plus board that was kindly offered by Daniel Lang at Toradex, thanks!


Thursday, June 13, 2024

Rockchip NPU update 4: Kernel driver for the RK3588 NPU submitted to mainline

In the past few weeks I have been working on among other things a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.

It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this link.

The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.

 

The userspace  driver is in a less polished state, but fully featured at this state. I will be working on this in the next few days so it can be properly submitted for review.

This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.

Tuesday, May 7, 2024

Etnaviv NPU update 18: Getting the driver to work on the Amlogic S905D3 SoC

With new releases of the Linux kernel and Mesa drivers poised to be packaged by Linux distributions, the TensorFlow Lite driver for the NPU in the Amlogic A311D SoC will be available to users with minimal effort.

With that work bearing its fruits, I have been looking at how this driver could be of use with other hardware.

Philipp Zabel of Pengutronix has been looking at adding support for the NPU in the NXP i.MX 8M Plus SoC, and he has made great progress on reverse engineering the in-memory format of the weights tensor, which is different from that used in the A311D.

I started by probing what would entail supporting the NPU in the S905D3 SoC from Amlogic, and I found it not that different from what is currently supported, besides it also using a new format for the weights tensor.

Weights, the other kind of.
Weights, the other kind of them.
Looked a bit further, and found that this format is very similar to what Philip had been reverse engineering and implementing support for.

After a couple of weeks staring at memory dumps and writing a python tool to decode them, I realized that the run-length and Huffman encodings were the same, with only a few differences such as where and how the bias values were stored.

With a few changes to Philip's work-in-progress branch I got my first tests passing on the Libre Computer Solitude SBC board.

Next I will look at supporting more weights tensor dimensions and fixing bugs in how the weights and other values are encoded.

The command stream programming seems to be very similar to that of the A311D, so I don't expect much work to be needed there.

Once everything is working at the same level as with the A311D, I will move to determine the optimal values for the zero run-length and Huffman symbol maps, for maximum compression and thus performance (as NPUs are so fast at arithmetic that they tend to be memory starved).

Big thanks to Pengutronix for supporting Philip's work, and to Libre Computer for having supported the development of the driver so far.

Friday, April 19, 2024

Rockchip NPU update 3: Real-time object detection on RK3588

Progress

Yesterday I managed to implement in my open-source driver all the remaining operations so the SSDLite MobileDet model can run on Rockchip's NPU in the RK3588 SoC.

Performance is pretty good at 30 frames per second when using just one of the 3 cores that the NPU contains.


 I uploaded the generated video to YouTube at:

You can get the source code at my branch here.

 

Next steps

Now that we got to this level of usefulness, I'm going to switch to writing a kernel driver suited for inclusion into the Linux kernel, to the drivers/accel subsystem.

There is still lots of work to do, but progress is going pretty fast, though as I write more drivers for different NPUs I will have to split my time among them. At least, until we get more contributors! :)

Thursday, March 28, 2024

Rockchip NPU update 2: MobileNetV1 is done

Progress

For  the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.

I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.

And it not only runs flawlessly, but at the same performance level as the blob.

It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.

by Julien Langlois CC BY-SA 3.0

 tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Teflon delegate: loaded rknpu driver

teflon: compiling graph: 89 tensors 27 operations
...
teflon: compiled graph, took 413 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 10 ms
teflon: invoked graph, took 10 ms
0.984314: hen
0.019608: cock
0.000000: toilet tissue
0.000000: sea cucumber
0.000000: wood rabbit
time: 10.776ms

Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.

Thanks to TL Lim of PINE64 for sending me a QuartzPro64 board for me to hack on.

Next steps

I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.

After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.