Thursday, June 13, 2024

Rockchip NPU update 4: Kernel driver for the RK3588 NPU submitted to mainline

In the past few weeks I have been working on among other things a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.

It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this link.

The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.

 

The userspace  driver is in a less polished state, but fully featured at this state. I will be working on this in the next few days so it can be properly submitted for review.

This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.

Tuesday, May 7, 2024

Etnaviv NPU update 18: Getting the driver to work on the Amlogic S905D3 SoC

With new releases of the Linux kernel and Mesa drivers poised to be packaged by Linux distributions, the TensorFlow Lite driver for the NPU in the Amlogic A311D SoC will be available to users with minimal effort.

With that work bearing its fruits, I have been looking at how this driver could be of use with other hardware.

Philipp Zabel of Pengutronix has been looking at adding support for the NPU in the NXP i.MX 8M Plus SoC, and he has made great progress on reverse engineering the in-memory format of the weights tensor, which is different from that used in the A311D.

I started by probing what would entail supporting the NPU in the S905D3 SoC from Amlogic, and I found it not that different from what is currently supported, besides it also using a new format for the weights tensor.

Weights, the other kind of.
Weights, the other kind of them.
Looked a bit further, and found that this format is very similar to what Philip had been reverse engineering and implementing support for.

After a couple of weeks staring at memory dumps and writing a python tool to decode them, I realized that the run-length and Huffman encodings were the same, with only a few differences such as where and how the bias values were stored.

With a few changes to Philip's work-in-progress branch I got my first tests passing on the Libre Computer Solitude SBC board.

Next I will look at supporting more weights tensor dimensions and fixing bugs in how the weights and other values are encoded.

The command stream programming seems to be very similar to that of the A311D, so I don't expect much work to be needed there.

Once everything is working at the same level as with the A311D, I will move to determine the optimal values for the zero run-length and Huffman symbol maps, for maximum compression and thus performance (as NPUs are so fast at arithmetic that they tend to be memory starved).

Big thanks to Pengutronix for supporting Philip's work, and to Libre Computer for having supported the development of the driver so far.

Friday, April 19, 2024

Rockchip NPU update 3: Real-time object detection on RK3588

Progress

Yesterday I managed to implement in my open-source driver all the remaining operations so the SSDLite MobileDet model can run on Rockchip's NPU in the RK3588 SoC.

Performance is pretty good at 30 frames per second when using just one of the 3 cores that the NPU contains.


 I uploaded the generated video to YouTube at:

You can get the source code at my branch here.

 

Next steps

Now that we got to this level of usefulness, I'm going to switch to writing a kernel driver suited for inclusion into the Linux kernel, to the drivers/accel subsystem.

There is still lots of work to do, but progress is going pretty fast, though as I write more drivers for different NPUs I will have to split my time among them. At least, until we get more contributors! :)

Thursday, March 28, 2024

Rockchip NPU update 2: MobileNetV1 is done

Progress

For  the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.

I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.

And it not only runs flawlessly, but at the same performance level as the blob.

It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.

by Julien Langlois CC BY-SA 3.0

 tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Teflon delegate: loaded rknpu driver

teflon: compiling graph: 89 tensors 27 operations
...
teflon: compiled graph, took 413 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 10 ms
teflon: invoked graph, took 10 ms
0.984314: hen
0.019608: cock
0.000000: toilet tissue
0.000000: sea cucumber
0.000000: wood rabbit
time: 10.776ms

Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.

Thanks to TL Lim of PINE64 for sending me a QuartzPro64 board for me to hack on.

Next steps

I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.

After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.

Saturday, March 16, 2024

Rockchip NPU update 1: A walk in the park?

During the past weeks I have paused work on the driver for the Vivante NPU and have started work on a new driver, for Rockchip's own NPU IP, as used in SoCs such as RK3588(S) and RK3568.

The version of the NPU in the RK3588 claims a performance of 6 TOPS across its 3 cores, though from what I have read, people are having trouble making use of more than one core in parallel, with the closed source driver.

A nice walk in the park

Rockchip, as most other vendors of NPU IP, provides a GPLed kernel driver and pushes out their userspace driver in binary form. The kernel driver is pleasantly simple and relatively up-to-date in regards of its use of internal kernel APIs. The userspace stack though is notoriously buggy and difficult to use, with basic features still unimplemented and performance being quite below what the hardware should be able to achieve.

To be clear, this is on top of the usual problems related to closed-source drivers. I get the impression that Rockchip's NPU team is really understaffed.

Other people had already looked at reverse-engineering the HW so they could address the limitations and bugs in the closed source driver, and use it in situations not supported by Rockchip. I used information acquired by Pierre-Hugues Husson and Jasbir Matharu to get started, a big thanks to them!

After the initial environment was setup (had to forward-port their kernel driver to v6.8), I wrote a simple library that can be loaded in the process with LD_PRELOAD and that, by overriding the ioctl and other syscalls, I was able to dump the buffers that the proprietary userspace driver sends to the hardware.

I started looking at a buffer that from the debug logs of the proprietary driver contained register writes, and when looking at the register descriptions in the TRM, I saw that it had to be closely based on NVIDIA's NVDLA open-source NPU IP.

With Rockchip's (terse) description of the registers, NVDLA's documentation and source code for both the hardware and the userspace driver, I have been able to make progress several times faster than I was able to when working on VeriSilicon's driver (for which I had zero documentation).

Right now I am at the stage at which I am able to correctly execute TensorFLow Lite's Conv2D and DepthwiseConv2D operations with different combinations of input dimensions, weight dimensions, strides and padding. Next is to support multiple output channels.

I'm currently using Rockchip's kernel, but as soon as I'm able to run object detection models with decent hardware utilization, I plan to start writing a new kernel driver for mainlining.

Rockchip's kernel driver has gems such as passing addresses in the kernel address space across the UAPI...

Tests run fast and reliably, even with high concurrency:

tomeu@arm-64:~/mesa$ TEFLON_TEST_DELEGATE=~/mesa/build/src/gallium/targets/teflon/libteflon.so TEFLON_TEST_DATA=src/gallium/targets/teflon/tests LD_LIBRARY_PATH=/home/tomeu/tflite-vx-delegate/build/_deps/tensorflow-build/ ~/.cargo/bin/gtest-runner run --gtest /home/tomeu/mesa/build/src/gallium/targets/teflon/test_teflon --output /tmp -j8 --tests-per-group 1 --baseline ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-fails.txt --flakes ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-flakes.txt  --skips ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-skips.txt
Running gtest on 8 threads in 1-test groups
Pass: 0, Duration: 0
Pass: 139, Skip: 14, Duration: 2, Remaining: 2
Pass: 277, Skip: 22, Duration: 4, Remaining: 0
Pass: 316, Skip: 24, Duration: 4, Remaining: 0

You can find the source code in this branch.

Friday, February 23, 2024

Etnaviv NPU update 17: Faster!

In the last update I explained how compression of zero weights gave our driver such a big performance improvement.

Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.

TL;DR

Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.

On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!

Enable more convolutions

Our driver was rejecting convolutions with a number of output channels that is not divisible by the number of convolution cores in the NPU because at the start of the development the code that lays the weights out in memory didn't support that. That caused TensorFlow Lite to run the convolutions in CPU, and some of them were big enough to take a few milliseconds, several times more than on the NPU.

When implementing support for bigger kernels I had to add improvements to the tiling of the convolutions and that included adding support for these other convolutions. So by just removing the rejection of these, we got a nice speed up on SSD MobileDet: from 32.7ms to 27ms!

That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.

Caching of the input tensor

So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.

That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.

Next steps

At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.

But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.

Thursday, February 8, 2024

Etnaviv NPU update 16: A nice performance jump

After the open-source driver for VeriSilicon's Vivante NPU was merged into Mesa two weeks ago, I have been taking some rest and thinking about what will come next.

Automated testing

I have a merge request to Mesa almost ready that will enable continuous integration testing on real hardware, but it depends on solving what seem to be problems with the power supplies of the boards in the HW testing lab. Collabora is graciously looking at it. Thanks!

Performance

I have been talking with quite a few people about the whole effort of bringing open-source to NPU hardware and something that came up more than once is the question of reaching or surpassing the performance level of the proprietary drivers.

It is a fair concern, because the systolic arrays will be underutilized if they starve of data. And given how fast they are in performing the arithmetic operations, and how slow memory buses and chips on embedded are (related to high-end GPUs, at least), this starving and the consequent underutilization are very likely to happen.

IP vendors go to great lengths to prevent that from happening, inventing ways of getting the data faster to the processing elements, reducing the memory bandwidth used, and balancing the use of the different cores/arrays. There is plenty of published research on this area, which helps when figuring out how to make the most of a particular piece of hardware.

Weight compression

Something I started working on last week is compression of zero values in the weight buffers. Sparsity is very common in the neural models that this hardware is targeted to run, and common convolutions such as strided and depthwise can easily have zero ratios of 90% and more.

By compressing consecutive zeroes in a buffer we can greatly reduce pressure on the memory bus, keeping the processing units better fed (though I'm sure we are still far from getting good utilization).

By opportunistically using the 5 available bits to compress consecutive runs of zeroes, I was able to improve the performance of the MobileNetV1 model from 15.7 ms to 9.9 ms, and that of the SSDLite MobileDet model from 56.1 ms to 32.7 ms.



As shown in the graph above, we still have quite some room for improvement before we reach the performance of the proprietary driver, but we are getting close pretty fast. I also believe that we can tailor the driver to user's needs to surpass the performance of the proprietary driver for specific models, as this is open-source and everybody can chip in, see how things are made and improve them.

IRC channel

I mentioned this in passing some time ago, but now that we have a driver at this level of usefulness, I think it is a good moment to remind that we have an IRC channel in the OFTC network to discuss anything about doing accelerated machine learning on the edge with upstream open-source software: #ml-mainline. You can click here to join via a web interface, though I recommend setting up an account at matrix.org.

What next

Should I continue working on performance? Enable more models for new use cases? Enable this driver on more SoCs (i.MX8MP and S905D3 look interesting)? Start writing a driver for a completely different IP, such as Rockchip's or Amlogic's?

I still haven't decided, so if you have an opinion please drop a comment in this blog, or at any of the social networks linked from this blog.

I'm currently available for contracting, so I should be able to get on your project full-time on short notice.

Wednesday, January 24, 2024

Etnaviv NPU update 15: We are upstream!

Today the initial merge request for Teflon was merged into Mesa, along with the first hardware driver, for VeriSilicon's Vivante NPU.

For those who don't know, Teflon is a TensorFlow Lite delegate that aims to support several AI accelerators (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the MIT license.


This will have the following advantages for the project:

  1. The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the release calendar.
  2. Contribution to the project will happen within the development process of Mesa. This is a well-established process in which employees from companies such as Google, Valve, Imagination, Intel, Microsoft and AMD work together on their GPU drivers.
  3. The project has great technical infrastructure, maintained by awesome sysadmins:
  4. More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:
    • The NIR intermediate representation with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.
    • The Gallium internal API that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as Android NNAPI.
  5. And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: XDC.

The story so far

In 2022, while still at Collabora, I started adding OpenCL support to the Etnaviv driver in Mesa. Etnaviv is a userspace and kernel driver for VeriSilicon's Vivante NPUs.

The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.

I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (systolic arrays, as I realized later).

Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since Google released their first TPU.

With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.

During the summer and with Libre Computer's sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.

By autumn I was able to run that same object classification model (MobileNet V1) 3 times faster than the CPU was able to. A month later I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.

Afterwards I got to work on object detection models, and by the start of 2024 I managed to run SSDLite MobileDet at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!

The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.

Next steps

Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!

I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.

There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.

There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.

And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.

Thanks

One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:

Collabora for allowing me to start playing with this while I still worked with them.

Libre Computer and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.

Igalia for letting Christian Gmeiner spend time reviewing all my code and answering my questions about Etnaviv.

Embedded Recipes for giving me the opportunity to present my work last autumn in Paris.

Lucas Stach from Pengutronix for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.

Neil Armstrong from Linaro for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.

And a collective thanks to the DRI/Mesa community for being so awesome!

Wednesday, January 10, 2024

Etnaviv NPU update 14: Object detection with decent performance

When almost two months ago I got MobileNetV1 running with useful performance on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.

Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?

Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.

SSDLite MobileDet is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.

The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.

So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:

Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.

I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.

After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.

If we look at a practical use case such that supported by Frigate NVR, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.

Given the price level of the single board computers that contain the VIPNano, this is quite a good bang for your bucks. And all open source and heading to mainline!

Next steps

I have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.