In the past few weeks I have been working on among other things a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.
It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this link.
The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.
The userspace driver is in a less polished state, but fully featured at this state. I will be working on this in the next few days so it can be properly submitted for review.
This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.
Did you happen to modify the HDMI_RX input for better support for realtime raw video for computer vision?
ReplyDeleteHi Liam, I have worked only on strictly the NPU so far, on the RK3588. For my demo I just used a USB webcam.
ReplyDeleteHi, thanks for sharing your updates and progress. Sorry for the newbie questions, I am starting to learn about NPU and related stuff.
ReplyDeleteMay I ask which board are you using for the RK3588 development?
Also, how do you know the NPU itself has 3 cores? I searched only and the datasheet and did not find that reference
Finally, why is that you can run 4 streams simultaneously on 3 cores? Is it thanks to 3 NPU cores + 1 GPU.
I am planning to buy the Orange Pi 5 Pro to try out your ideas, so I really appreciate your updates and sharing your hard work. Thanks!
> May I ask which board are you using for the RK3588 development?
ReplyDeleteI am using a QuartzPro64 that Pine54 sent me.
> Also, how do you know the NPU itself has 3 cores? I searched only and the datasheet and did not find that reference
I think I first saw it in the source code of their kernel driver.
> Finally, why is that you can run 4 streams simultaneously on 3 cores? Is it thanks to 3 NPU cores + 1 GPU.
The kernel contains a job queue and dispatches jobs from it to the 3 cores. Because part of running the model happens outside of the NPU, running one thread above the number of cores gives us more total throughput without a degradation if inferences per second.
> I am planning to buy the Orange Pi 5 Pro to try out your ideas, so I really appreciate your updates and sharing your hard work. Thanks!
You are welcome, hope you have lots of fun.
Is your open source kernel driver compatible with the proprietary rknn and rkllm SDKs?
ReplyDelete> Is your open source kernel driver compatible with the proprietary rknn and rkllm SDKs?
ReplyDeleteNo, it's not and it couldn't be, as the UABI that Rockchip chose wouldn't be acceptable in the mainline Linux kernel.
Hi Tomeu , very nice work we were able to run yolo v5 custom model using rknn framework, but for yolo v8 and above no examples were given since we were using their rknn SDK , with your open source driver Yolo v8 and above should be possible could you give /post example implementation of Yolo v 8 ?
ReplyDelete> Hi Tomeu , very nice work we were able to run yolo v5 custom model using rknn framework, but for yolo v8 and above no examples were given since we were using their rknn SDK , with your open source driver Yolo v8 and above should be possible could you give /post example implementation of Yolo v 8 ?
ReplyDeleteHi, unfortunately I haven't worked yet on Yolo support. Some additional operations would need to be implemented, and there may be also bugs that prevent it from working. It is open source though, so you can either do that work yourselves, or we can talk about some other arrangement.
In the later case, you can send me an email to tomeu@tomeuvizoso.net. Good luck!
> Hi, unfortunately I haven't worked yet on Yolo support. Some additional operations would need to be implemented, and there may be also bugs that prevent it from working.
ReplyDeleteCan additional operations be implemented solely in Mesa, or would they need to be added to the kernel driver as well?
> Can additional operations be implemented solely in Mesa, or would they need to be added to the kernel driver as well?
DeleteSolely in Mesa.