Tomeu Vizoso: Rockchip NPU update 4: Kernel driver for the RK3588 NPU submitted to mainline

Thursday, June 13, 2024

Rockchip NPU update 4: Kernel driver for the RK3588 NPU submitted to mainline

In the past few weeks I have been working on among other things a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.

It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this link.

The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.

The userspace driver is in a less polished state, but already fully featured. I will be working on this in the next few days so it can be properly submitted for review.

This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.

27 comments:

Liam said...: Did you happen to modify the HDMI_RX input for better support for realtime raw video for computer vision?; July 4, 2024 at 11:48 PM
Tomeu Vizoso said...: Hi Liam, I have worked only on strictly the NPU so far, on the RK3588. For my demo I just used a USB webcam.; July 5, 2024 at 7:38 AM
Anonymous said...: Hi, thanks for sharing your updates and progress. Sorry for the newbie questions, I am starting to learn about NPU and related stuff.

May I ask which board are you using for the RK3588 development?
Also, how do you know the NPU itself has 3 cores? I searched only and the datasheet and did not find that reference
Finally, why is that you can run 4 streams simultaneously on 3 cores? Is it thanks to 3 NPU cores + 1 GPU.

I am planning to buy the Orange Pi 5 Pro to try out your ideas, so I really appreciate your updates and sharing your hard work. Thanks!; July 14, 2024 at 4:37 PM
Tomeu Vizoso said...: > May I ask which board are you using for the RK3588 development?

I am using a QuartzPro64 that Pine54 sent me.

> Also, how do you know the NPU itself has 3 cores? I searched only and the datasheet and did not find that reference

I think I first saw it in the source code of their kernel driver.

> Finally, why is that you can run 4 streams simultaneously on 3 cores? Is it thanks to 3 NPU cores + 1 GPU.

The kernel contains a job queue and dispatches jobs from it to the 3 cores. Because part of running the model happens outside of the NPU, running one thread above the number of cores gives us more total throughput without a degradation if inferences per second.

> I am planning to buy the Orange Pi 5 Pro to try out your ideas, so I really appreciate your updates and sharing your hard work. Thanks!

You are welcome, hope you have lots of fun.; July 14, 2024 at 5:26 PM
Anonymous said...: Is your open source kernel driver compatible with the proprietary rknn and rkllm SDKs?; July 16, 2024 at 1:57 AM
Tomeu Vizoso said...: > Is your open source kernel driver compatible with the proprietary rknn and rkllm SDKs?

No, it's not and it couldn't be, as the UABI that Rockchip chose wouldn't be acceptable in the mainline Linux kernel.; July 16, 2024 at 8:03 AM
Anonymous said...: Hi Tomeu , very nice work we were able to run yolo v5 custom model using rknn framework, but for yolo v8 and above no examples were given since we were using their rknn SDK , with your open source driver Yolo v8 and above should be possible could you give /post example implementation of Yolo v 8 ?; July 21, 2024 at 8:30 AM
Tomeu Vizoso said...: > Hi Tomeu , very nice work we were able to run yolo v5 custom model using rknn framework, but for yolo v8 and above no examples were given since we were using their rknn SDK , with your open source driver Yolo v8 and above should be possible could you give /post example implementation of Yolo v 8 ?

Hi, unfortunately I haven't worked yet on Yolo support. Some additional operations would need to be implemented, and there may be also bugs that prevent it from working. It is open source though, so you can either do that work yourselves, or we can talk about some other arrangement.

In the later case, you can send me an email to tomeu@tomeuvizoso.net. Good luck!; July 21, 2024 at 2:39 PM
Anonymous said...: > Hi, unfortunately I haven't worked yet on Yolo support. Some additional operations would need to be implemented, and there may be also bugs that prevent it from working.

Can additional operations be implemented solely in Mesa, or would they need to be added to the kernel driver as well?; October 12, 2024 at 7:47 PM
Tomeu Vizoso said...: > Can additional operations be implemented solely in Mesa, or would they need to be added to the kernel driver as well?

Solely in Mesa.; October 14, 2024 at 6:52 AM
Anonymous said...: "Has anyone successfully run real-time YOLO object detection on Armbian OS using the Rockchip 3588S2 (Radxa CM5)? Specifically, I'm curious if FFmpeg can be used for real-time processing and whether the NPU and GPU on this setup are accessible and capable of handling detection tasks effectively. Would love to hear your insights or experiences!; January 1, 2025 at 4:07 PM
Anonymous said...: "Has anyone successfully run real-time YOLO object detection on Armbian OS using the Rockchip 3588S2 (Radxa CM5)? Specifically, I'm curious if FFmpeg can be used for real-time processing and whether the NPU and GPU on this setup are accessible and capable of handling detection tasks effectively. Would love to hear your insights or experiences!"; January 1, 2025 at 4:09 PM
Anonymous said...: Hi Tomeu,

Are you pushing to get this mainlined ? It would be a great addition du rk3588

Many thanks,; January 10, 2025 at 1:50 PM
Tomeu Vizoso said...: > Are you pushing to get this mainlined ? It would be a great addition du rk3588

Yes, I'm now working again on this.; February 11, 2025 at 10:44 AM
Anonymous said...: Hello,

Currently, the only viable option for utilizing the NPU is rknn-toolkit2, but while it can be made to "work," it is far from being able to "fully utilize its potential." It has numerous bugs and is not very practical for real-world use...

Would it be possible to make this driver more versatile, allowing it to support not just MobileDet, but also LLMs and Stable Diffusion?
What would be required to achieve this?; February 19, 2025 at 4:18 PM
Tomeu Vizoso said...: > Would it be possible to make this driver more versatile, allowing it to support not just MobileDet, but also LLMs and Stable Diffusion?

Sure, that is something I would like to work on.

> What would be required to achieve this?

Some funding. I provide development services as an independent contractor, and the product of my work is released as open-source and integrated in upstream projects such as the Linux kernel.

For the other NPUs I write drivers for, companies that are negatively affected by the low quality of the proprietary drivers contract with me to develop the drivers further.

For Rockchip NPUs, that hasn't happened yet.; February 19, 2025 at 4:27 PM
Anonymous said...: Thank you for your quick response. That makes sense.

However, one thing is certain: Rockchip is unlikely to provide funding or contract work for this. They have historically provided minimal support for their own hardware, including their kernel and GPU. So, I don't think we can expect any financial backing from them.

That being said, if the goal is to make this driver more universally usable, what would be the most reasonable approach? Would reverse engineering the rknn-toolkit2 be the best route, or is there a more elegant alternative? As far as I know, this device has always struggled with optimization issues in shared libraries.; February 19, 2025 at 5:31 PM
Tomeu Vizoso said...: > However, one thing is certain: Rockchip is unlikely to provide funding or contract work for this. They have historically provided minimal support for their own hardware, including their kernel and GPU. So, I don't think we can expect any financial backing from them.

Yes, that is my impression as well.

> That being said, if the goal is to make this driver more universally usable, what would be the most reasonable approach? Would reverse engineering the rknn-toolkit2 be the best route, or is there a more elegant alternative? As far as I know, this device has always struggled with optimization issues in shared libraries.

I think one just needs to reverse engineer the hardware. Then build a standard stack on top. For now I'm focusing on tensorflow-lite, but other frontends could be added on top of the hardware-specific drivers.

I would like to reproduce the success story of open-source GPU drivers with the NPUs.; February 19, 2025 at 5:42 PM
Anonymous said...: I see.
The issue is that NPUs, unlike GPUs, do not have standardized specifications like OpenGL or Vulkan, and they lack compatibility with other hardware. This is probably why Rockchip created their own rknn.
However, trying to implement it in the same way as rknn requires a significant amount of work. (Moreover, the rknn-toolkit is full of bugs, and it's quite inconvenient to write rknn code for each application.)
Therefore, wouldn’t directly supporting an open-source and versatile framework like OpenVINO be the most reliable, universal, and straightforward solution?; February 19, 2025 at 7:32 PM
Anonymous said...: I see.

The issue is that NPUs, unlike GPUs, do not have standardized specifications like OpenGL or Vulkan, and they lack compatibility with other hardware. This is probably why Rockchip created their own rknn.

However, trying to implement it in the same way as rknn requires a significant amount of work. (Moreover, the rknn-toolkit is full of bugs, and it's quite inconvenient to write rknn code for each application.)

Given this, would it be a more practical approach to directly support an open-source and versatile framework like OpenVINO?; February 20, 2025 at 7:42 AM
Tomeu Vizoso said...: > The issue is that NPUs, unlike GPUs, do not have standardized specifications like OpenGL or Vulkan, and they lack compatibility with other hardware. This is probably why Rockchip created their own rknn.

I think this is a fair characterisation of the situation.

> However, trying to implement it in the same way as rknn requires a significant amount of work. (Moreover, the rknn-toolkit is full of bugs, and it's quite inconvenient to write rknn code for each application.)

I also think that would be a bad idea. The plan I'm currently executing is to have the hardware-specific drivers in Mesa3D, beneath the Gallium hardware abstraction layer. Above it I implement hardware-independent framework backbends, such as a TensorFlowLite delegate.

>
Given this, would it be a more practical approach to directly support an open-source and versatile framework like OpenVINO?

OpenVINO and ROCm are open source but not community-led, which would mean we would be subjected to the whims of their corporate overlords. It wouldn't really be that practical. That said, maybe there is a way of using the Mesa3D drivers from those frameworks.; February 20, 2025 at 9:13 AM
Anonymous said...: Does that mean it avoids using an intermediate format like RKNN? If that is possible, that would be quite promising.
By processing directly without an intermediate layer, the implementation would be simpler, making it more flexible and extensible across different APIs. If the Gallium frontend could be generalized, it might even be adaptable to other devices.

Additionally, if this could be integrated with Python, many existing projects could run with minimal code modifications.

However, running inference efficiently with INT4/INT8/INT16 usually requires precompilation, which is why formats like .rknn or .engine are commonly used.
It would be great if there were a way to achieve this without relying on such formats...; February 20, 2025 at 5:25 PM
Tomeu Vizoso said...: > Does that mean it avoids using an intermediate format like RKNN? If that is possible, that would be quite promising.

Yep. You instruct TensorFlow Lite (LiteRT) to use the delegate that Mesa3D provides, and anything else works without changes.

The same could be done for ExecuTorch.

> By processing directly without an intermediate layer, the implementation would be simpler, making it more flexible and extensible across different APIs. If the Gallium frontend could be generalized, it might even be adaptable to other devices.

It is intended to be hardware-independent. I have implementations for both VeriSilicon-based NPUs, and NVDLA-based ones (Rockchip's so far). It is a private interface, so it can be changed and extended as we add support for more hardware and for more frameworks.

> However, running inference efficiently with INT4/INT8/INT16 usually requires precompilation, which is why formats like .rknn or .engine are commonly used.

Compilation can take quite some CPU time, but with TensorFlowLite you can do it either online (transparently on the first execution) or precompile and ship the precompiled model.; February 20, 2025 at 5:36 PM
Anonymous said...: So, if I'm getting this straight, your demo video is feeding one frame to each of the 3 cores, so when you're displaying the results, you're displaying at least 3 frames ago? Is there a way to process a single frame across all three cores, if I care about latency more than I care about total frames per second?; March 9, 2025 at 6:47 PM
Tomeu Vizoso said...: > So, if I'm getting this straight, your demo video is feeding one frame to each of the 3 cores, so when you're displaying the results, you're displaying at least 3 frames ago? Is there a way to process a single frame across all three cores, if I care about latency more than I care about total frames per second?

No, I split the model in operations, and submit each of them independently to the kernel, and it schedules the operations to any of the 3 cores. But most models are sequential in nature, so the dependencies among operations mean that for a given inference most operations run in a single core, with only a few operations running in parallel in a second core.

If the 3 cores shared the same SRAM, then we could efficiently split a single operation in 3 parts and run each in parallel in the 3 cores, but that's unfortunately not the case on the RK3588.; March 10, 2025 at 10:35 AM
Anonymous said...: Does the rocket kernel driver allows to monitor NPU load ?; March 14, 2025 at 5:58 AM
Tomeu Vizoso said...: > Does the rocket kernel driver allows to monitor NPU load ?

I'm not exposing any metrics yet.; March 14, 2025 at 10:33 AM