Monday, July 28, 2025

Rockchip NPU update 6: We are in mainline!

The kernel portion of the Linux driver for the Rockchip NPUs has been merged into the maintainer tree, and will be sent in the next pull request to Linus. The userspace portion of the driver has just been merged as well, in the main Mesa repository.


This means that in the next few weeks the two components of the Rocket driver will be in official releases of the Linux and Mesa projects, and Linux distributions will start to pick them up and package. Once that happens, we will have seamless accelerated inference on one more category of hardware.

It has been a bit over a year since I started working on the driver, though the actual feature implementation took just over two months of that. The rest of the time was spent waiting for reviews and reacting to excellent feedback from many contributors to the Linux kernel. The driver is now much better because of that frank feedback.

What I see in the near future for this driver is support for other Rockchip SoCs and some performance work, to match that of the proprietary driver. But of course, with it being open source, contributors can just start hacking on it and sending patches over for review and merging.

I'm now working on further improvements to the Etnaviv driver for the Vivante NPUs, and have started work with Arm engineers on a new driver for their Ethos line of NPUs

So stay tuned for more news on accelerated inference on the edge in mainline Linux!


8 comments:

Anonymous said...

Hi, and first of all great and amazing work! I do not understand much of this but followed your blog about the rk3588 and the upstream progress from the beginning.

I have two questions:
- Assume I have installed a recent kernel and mesa that support the driver, how can I use it to run inference? I saw you used tflite in your mobilenet example. Is this example available somewhere?
- Is your driver able to run even LLM models? Rockchip has two different toolkits, rknn-toolkit2 for vision and rkllm for llm. Is your implementation able to handle both?

Anonymous said...

Hi! You can test with mobilenetv1 as per the Teflon docs: https://docs.mesa3d.org/teflon.html

Or with some experimentation you can reproduce an object detection demo with: https://github.com/tomeuv/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi/

But it's just a TFLite external delegate, so you can just follow the TensorFlow Lite official documentation.

Anonymous said...

Hello! I'm looking forward to use the NPU to build and convert some LLM's and other transformer-based models to work on it. I'm so tired of rockchip RKNN and RKLLM shortcomings and inability to fix or hack anything there myself and waiting for support of more modules/layers to come. I wonder, is it possible now, given this driver, make the NPU work with something like onnx-runtime or maybe a tinygrad (reference it here, since afaik it requires to implement just several ops and you can easily add a new device support). Where you'd suggest to start with that? Maybe you have some examples of implementing a backend support?

Anonymous said...

Hi Anonymous, you can look at the current implementation of the TensorFlow Lite delegate: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/frontends/teflon/tfl_device.c?ref_type=heads

A frontend basically implements the backend API for the framework of choice, then transforms the graph representation from the framework API to Gallium's representation.

The rest is forwarding calls to compile and execute the graph.

It's not much, that's why the TFLite backend fits in a single file.

KrisztiƔn Szegi said...

Hi Tomeu!

Can you give me the most accurate model that runs on rk3588?
AFAIU ReLU6 is supported, and SSDLite+MobileDet works (as you demonstrated), but things like yolov8 with SiLU is not implemented?
Am I correct?

Tomeu Vizoso said...

So far I have only tested with MobileNetV1/2 and MobileDet, which means most convolutions configurations, tensor additions and ReLU. Nothing else is implemented at the moment, but it shouldn't be that difficult now that we have the basics. It is possible though that the hardware cannot properly accelerate SiLU and that a different activation function should be chosen when generating the YOLOV8 model.

Juan-Esf91 said...

Hi Tomeu, great works.
Allwinner T527 and A733 have vivante npu and the radxa team freed documents for this. Npu doc avalaible: https://gitlab.com/tina5.0_aiot/product/docs/-/tree/be1db4639682f80d27a7c01693eaf05991435574/Software%20%E8%BD%AF%E4%BB%B6%E7%B1%BB%E6%96%87%E6%A1%A3/SDK%E6%A8%A1%E5%9D%97%E5%BC%80%E5%8F%91%E6%8C%87%E5%8D%97/NPU%E6%A8%A1%E5%9D%97%E5%BC%80%E5%8F%91%E6%8C%87%E5%8D%97

Anonymous said...

Hi Juan, unfortunately those docs are about how to use the downstream drivers. Allwinner probably doesn't have access to the hardware documentation for the NPU from VeriSilicon.