Saturday, March 16, 2024

Rockchip NPU update 1: A walk in the park?

During the past weeks I have paused work on the driver for the Vivante NPU and have started work on a new driver, for Rockchip's own NPU IP, as used in SoCs such as RK3588(S) and RK3568.

The version of the NPU in the RK3588 claims a performance of 6 TOPS across its 3 cores, though from what I have read, people are having trouble making use of more than one core in parallel, with the closed source driver.

A nice walk in the park

Rockchip, as most other vendors of NPU IP, provides a GPLed kernel driver and pushes out their userspace driver in binary form. The kernel driver is pleasantly simple and relatively up-to-date in regards of its use of internal kernel APIs. The userspace stack though is notoriously buggy and difficult to use, with basic features still unimplemented and performance being quite below what the hardware should be able to achieve.

To be clear, this is on top of the usual problems related to closed-source drivers. I get the impression that Rockchip's NPU team is really understaffed.

Other people had already looked at reverse-engineering the HW so they could address the limitations and bugs in the closed source driver, and use it in situations not supported by Rockchip. I used information acquired by Pierre-Hugues Husson and Jasbir Matharu to get started, a big thanks to them!

After the initial environment was setup (had to forward-port their kernel driver to v6.8), I wrote a simple library that can be loaded in the process with LD_PRELOAD and that, by overriding the ioctl and other syscalls, I was able to dump the buffers that the proprietary userspace driver sends to the hardware.

I started looking at a buffer that from the debug logs of the proprietary driver contained register writes, and when looking at the register descriptions in the TRM, I saw that it had to be closely based on NVIDIA's NVDLA open-source NPU IP.

With Rockchip's (terse) description of the registers, NVDLA's documentation and source code for both the hardware and the userspace driver, I have been able to make progress several times faster than I was able to when working on VeriSilicon's driver (for which I had zero documentation).

Right now I am at the stage at which I am able to correctly execute TensorFLow Lite's Conv2D and DepthwiseConv2D operations with different combinations of input dimensions, weight dimensions, strides and padding. Next is to support multiple output channels.

I'm currently using Rockchip's kernel, but as soon as I'm able to run object detection models with decent hardware utilization, I plan to start writing a new kernel driver for mainlining.

Rockchip's kernel driver has gems such as passing addresses in the kernel address space across the UAPI...

Tests run fast and reliably, even with high concurrency:

tomeu@arm-64:~/mesa$ TEFLON_TEST_DELEGATE=~/mesa/build/src/gallium/targets/teflon/libteflon.so TEFLON_TEST_DATA=src/gallium/targets/teflon/tests LD_LIBRARY_PATH=/home/tomeu/tflite-vx-delegate/build/_deps/tensorflow-build/ ~/.cargo/bin/gtest-runner run --gtest /home/tomeu/mesa/build/src/gallium/targets/teflon/test_teflon --output /tmp -j8 --tests-per-group 1 --baseline ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-fails.txt --flakes ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-flakes.txt  --skips ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-skips.txt
Running gtest on 8 threads in 1-test groups
Pass: 0, Duration: 0
Pass: 139, Skip: 14, Duration: 2, Remaining: 2
Pass: 277, Skip: 22, Duration: 4, Remaining: 0
Pass: 316, Skip: 24, Duration: 4, Remaining: 0

You can find the source code in this branch.

Friday, February 23, 2024

Etnaviv NPU update 17: Faster!

In the last update I explained how compression of zero weights gave our driver such a big performance improvement.

Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.

TL;DR

Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.

On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!

Enable more convolutions

Our driver was rejecting convolutions with a number of output channels that is not divisible by the number of convolution cores in the NPU because at the start of the development the code that lays the weights out in memory didn't support that. That caused TensorFlow Lite to run the convolutions in CPU, and some of them were big enough to take a few milliseconds, several times more than on the NPU.

When implementing support for bigger kernels I had to add improvements to the tiling of the convolutions and that included adding support for these other convolutions. So by just removing the rejection of these, we got a nice speed up on SSD MobileDet: from 32.7ms to 27ms!

That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.

Caching of the input tensor

So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.

That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.

Next steps

At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.

But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.

Thursday, February 8, 2024

Etnaviv NPU update 16: A nice performance jump

After the open-source driver for VeriSilicon's Vivante NPU was merged into Mesa two weeks ago, I have been taking some rest and thinking about what will come next.

Automated testing

I have a merge request to Mesa almost ready that will enable continuous integration testing on real hardware, but it depends on solving what seem to be problems with the power supplies of the boards in the HW testing lab. Collabora is graciously looking at it. Thanks!

Performance

I have been talking with quite a few people about the whole effort of bringing open-source to NPU hardware and something that came up more than once is the question of reaching or surpassing the performance level of the proprietary drivers.

It is a fair concern, because the systolic arrays will be underutilized if they starve of data. And given how fast they are in performing the arithmetic operations, and how slow memory buses and chips on embedded are (related to high-end GPUs, at least), this starving and the consequent underutilization are very likely to happen.

IP vendors go to great lengths to prevent that from happening, inventing ways of getting the data faster to the processing elements, reducing the memory bandwidth used, and balancing the use of the different cores/arrays. There is plenty of published research on this area, which helps when figuring out how to make the most of a particular piece of hardware.

Weight compression

Something I started working on last week is compression of zero values in the weight buffers. Sparsity is very common in the neural models that this hardware is targeted to run, and common convolutions such as strided and depthwise can easily have zero ratios of 90% and more.

By compressing consecutive zeroes in a buffer we can greatly reduce pressure on the memory bus, keeping the processing units better fed (though I'm sure we are still far from getting good utilization).

By opportunistically using the 5 available bits to compress consecutive runs of zeroes, I was able to improve the performance of the MobileNetV1 model from 15.7 ms to 9.9 ms, and that of the SSDLite MobileDet model from 56.1 ms to 32.7 ms.



As shown in the graph above, we still have quite some room for improvement before we reach the performance of the proprietary driver, but we are getting close pretty fast. I also believe that we can tailor the driver to user's needs to surpass the performance of the proprietary driver for specific models, as this is open-source and everybody can chip in, see how things are made and improve them.

IRC channel

I mentioned this in passing some time ago, but now that we have a driver at this level of usefulness, I think it is a good moment to remind that we have an IRC channel in the OFTC network to discuss anything about doing accelerated machine learning on the edge with upstream open-source software: #ml-mainline. You can click here to join via a web interface, though I recommend setting up an account at matrix.org.

What next

Should I continue working on performance? Enable more models for new use cases? Enable this driver on more SoCs (i.MX8MP and S905D3 look interesting)? Start writing a driver for a completely different IP, such as Rockchip's or Amlogic's?

I still haven't decided, so if you have an opinion please drop a comment in this blog, or at any of the social networks linked from this blog.

I'm currently available for contracting, so I should be able to get on your project full-time on short notice.

Wednesday, January 24, 2024

Etnaviv NPU update 15: We are upstream!

Today the initial merge request for Teflon was merged into Mesa, along with the first hardware driver, for VeriSilicon's Vivante NPU.

For those who don't know, Teflon is a TensorFlow Lite delegate that aims to support several AI accelerators (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the MIT license.


This will have the following advantages for the project:

  1. The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the release calendar.
  2. Contribution to the project will happen within the development process of Mesa. This is a well-established process in which employees from companies such as Google, Valve, Imagination, Intel, Microsoft and AMD work together on their GPU drivers.
  3. The project has great technical infrastructure, maintained by awesome sysadmins:
  4. More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:
    • The NIR intermediate representation with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.
    • The Gallium internal API that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as Android NNAPI.
  5. And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: XDC.

The story so far

In 2022, while still at Collabora, I started adding OpenCL support to the Etnaviv driver in Mesa. Etnaviv is a userspace and kernel driver for VeriSilicon's Vivante NPUs.

The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.

I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (systolic arrays, as I realized later).

Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since Google released their first TPU.

With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.

During the summer and with Libre Computer's sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.

By autumn I was able to run that same object classification model (MobileNet V1) 3 times faster than the CPU was able to. A month later I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.

Afterwards I got to work on object detection models, and by the start of 2024 I managed to run SSDLite MobileDet at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!

The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.

Next steps

Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!

I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.

There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.

There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.

And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.

Thanks

One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:

Collabora for allowing me to start playing with this while I still worked with them.

Libre Computer and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.

Igalia for letting Christian Gmeiner spend time reviewing all my code and answering my questions about Etnaviv.

Embedded Recipes for giving me the opportunity to present my work last autumn in Paris.

Lucas Stach from Pengutronix for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.

Neil Armstrong from Linaro for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.

And a collective thanks to the DRI/Mesa community for being so awesome!

Wednesday, January 10, 2024

Etnaviv NPU update 14: Object detection with decent performance

When almost two months ago I got MobileNetV1 running with useful performance on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.

Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?

Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.

SSDLite MobileDet is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.

The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.

So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:

Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.

I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.

After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.

If we look at a practical use case such that supported by Frigate NVR, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.

Given the price level of the single board computers that contain the VIPNano, this is quite a good bang for your bucks. And all open source and heading to mainline!

Next steps

I have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.

Thursday, December 21, 2023

Etnaviv NPU update 13: Don't cross the tensors

"Don't cross the streams. It would be bad."

IR refactorings

A big part of what I have been up to in the past two weeks has been a serious refactoring of the data structures that hold the model data in the different phases until the HW configurations is generated.

What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.

The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:

A small subsection of SSDLite MobileDet

The adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.

As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.

For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.

I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.

For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.

Tensor addition

Also put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.

Lowering the strides

Similarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.

Status and next steps

Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.

The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.

The whole of SSDLite MobileDet

 

Wednesday, December 6, 2023

Etnaviv NPU update 12: Towards SSDLite MobileDet

During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to MobileDet, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example Frigate NVR.

I haven't mentioned it in a few updates, but all this work keeps being sponsored by Libre Computer, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out Alta and Solitude for the first such boards in the market.

Upstreaming

Igalia's Christian Gmeiner has been giving me great feedback at the merge request, and as part of that I submitted a patch to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded. 

This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.

Convolutions with 5x5 weights

Until now I had implemented support only for weights with dimensions 1x1 (aka pointwise convolutions) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.

I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.

Tensor addition

I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.

This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.

Softmax pooling

One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.

The blob implements these operations on the programmable core, with CL-like kernels.

So I undusted the work that I did in the first half of 2023 and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in Etnaviv to make use of the programmable core.

Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.

With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.

A new test suite

With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.

To get around that, I reimplemented the tests in C++ with GoogleTest, which is supported by Emma Anholt's deqp-runner and will allow me to run the tests in parallel, making full use of the CPU cores in the board.

That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.

With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.

Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.

There is a kernel module parameter to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.

Next steps

I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.

So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.