Tomeu Vizoso: Rockchip NPU update 6: We are in mainline!

Monday, July 28, 2025

Rockchip NPU update 6: We are in mainline!

The kernel portion of the Linux driver for the Rockchip NPUs has been merged into the maintainer tree, and will be sent in the next pull request to Linus. The userspace portion of the driver has just been merged as well, in the main Mesa repository.

This means that in the next few weeks the two components of the Rocket driver will be in official releases of the Linux and Mesa projects, and Linux distributions will start to pick them up and package. Once that happens, we will have seamless accelerated inference on one more category of hardware.

It has been a bit over a year since I started working on the driver, though the actual feature implementation took just over two months of that. The rest of the time was spent waiting for reviews and reacting to excellent feedback from many contributors to the Linux kernel. The driver is now much better because of that frank feedback.

What I see in the near future for this driver is support for other Rockchip SoCs and some performance work, to match that of the proprietary driver. But of course, with it being open source, contributors can just start hacking on it and sending patches over for review and merging.

I'm now working on further improvements to the Etnaviv driver for the Vivante NPUs, and have started work with Arm engineers on a new driver for their Ethos line of NPUs.

So stay tuned for more news on accelerated inference on the edge in mainline Linux!

16 comments:

Anonymous said...: Hi, and first of all great and amazing work! I do not understand much of this but followed your blog about the rk3588 and the upstream progress from the beginning.

I have two questions:
- Assume I have installed a recent kernel and mesa that support the driver, how can I use it to run inference? I saw you used tflite in your mobilenet example. Is this example available somewhere?
- Is your driver able to run even LLM models? Rockchip has two different toolkits, rknn-toolkit2 for vision and rkllm for llm. Is your implementation able to handle both?; July 31, 2025 at 11:14 PM
Anonymous said...: Hi! You can test with mobilenetv1 as per the Teflon docs: https://docs.mesa3d.org/teflon.html

Or with some experimentation you can reproduce an object detection demo with: https://github.com/tomeuv/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi/

But it's just a TFLite external delegate, so you can just follow the TensorFlow Lite official documentation.; August 2, 2025 at 8:09 AM
Anonymous said...: Hello! I'm looking forward to use the NPU to build and convert some LLM's and other transformer-based models to work on it. I'm so tired of rockchip RKNN and RKLLM shortcomings and inability to fix or hack anything there myself and waiting for support of more modules/layers to come. I wonder, is it possible now, given this driver, make the NPU work with something like onnx-runtime or maybe a tinygrad (reference it here, since afaik it requires to implement just several ops and you can easily add a new device support). Where you'd suggest to start with that? Maybe you have some examples of implementing a backend support?; August 6, 2025 at 7:10 PM
Anonymous said...: Hi Anonymous, you can look at the current implementation of the TensorFlow Lite delegate: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/frontends/teflon/tfl_device.c?ref_type=heads

A frontend basically implements the backend API for the framework of choice, then transforms the graph representation from the framework API to Gallium's representation.

The rest is forwarding calls to compile and execute the graph.

It's not much, that's why the TFLite backend fits in a single file.; August 7, 2025 at 10:43 AM
Krisztián Szegi said...: Hi Tomeu!

Can you give me the most accurate model that runs on rk3588?
AFAIU ReLU6 is supported, and SSDLite+MobileDet works (as you demonstrated), but things like yolov8 with SiLU is not implemented?
Am I correct?; August 11, 2025 at 2:33 PM
Tomeu Vizoso said...: So far I have only tested with MobileNetV1/2 and MobileDet, which means most convolutions configurations, tensor additions and ReLU. Nothing else is implemented at the moment, but it shouldn't be that difficult now that we have the basics. It is possible though that the hardware cannot properly accelerate SiLU and that a different activation function should be chosen when generating the YOLOV8 model.; August 11, 2025 at 5:58 PM
Juan-Esf91 said...: Hi Tomeu, great works.
Allwinner T527 and A733 have vivante npu and the radxa team freed documents for this. Npu doc avalaible: https://gitlab.com/tina5.0_aiot/product/docs/-/tree/be1db4639682f80d27a7c01693eaf05991435574/Software%20%E8%BD%AF%E4%BB%B6%E7%B1%BB%E6%96%87%E6%A1%A3/SDK%E6%A8%A1%E5%9D%97%E5%BC%80%E5%8F%91%E6%8C%87%E5%8D%97/NPU%E6%A8%A1%E5%9D%97%E5%BC%80%E5%8F%91%E6%8C%87%E5%8D%97; August 15, 2025 at 2:25 AM
Anonymous said...: Hi Juan, unfortunately those docs are about how to use the downstream drivers. Allwinner probably doesn't have access to the hardware documentation for the NPU from VeriSilicon.; August 15, 2025 at 9:55 AM
Anonymous said...: > which means most convolutions configurations, tensor additions and ReLU (is implemented)

Thanks for the answer!

However I only see convolution and addition here:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/etnaviv/etnaviv_ml.c?ref_type=heads but much more in etnaviv driver. The latter do contains the mentioned instructions, rocket driver AFAIU not so, maybe I dont see something; August 18, 2025 at 3:33 PM
Krisztián Szegi said...: > which means most convolutions configurations, tensor additions and ReLU

AFAIU in the rocket backend for teflon, only addition and convolution is supported, although etnaviv is much more capable:

https://gitlab.freedesktop.org/mesa/mesa/-/blame/main/src/gallium/drivers/rocket/rkt_ml.c?ref_type=heads#L264
VS
https://gitlab.freedesktop.org/mesa/mesa/-/blame/main/src/gallium/drivers/etnaviv/etnaviv_ml.c?ref_type=heads#L650

Respectfully asking, is there still some work in the pipeline? Or maybe I just don't understand at all the code?; August 19, 2025 at 9:30 AM
Anonymous said...: ReLU is supported on Rocket, but only as fused with a convolution operation, atm; August 19, 2025 at 3:41 PM
Anonymous said...: The dream here is to run the vision head of a modern-ish VLLM (e.g. https://huggingface.co/OpenGVLab/InternVL3-8B) on the NPU and the text head on the CPU. Do you think we're getting close to that point?; August 28, 2025 at 8:24 AM
Tomeu Vizoso said...: What operations are part of the vision head?; August 28, 2025 at 9:02 AM
Anonymous said...: Hi Tomeu, thanks for open source the Rockchip NPU registers and how to implement it.

So far did you managed is there a way to perform codegen for rockchip driver, such as generate an opencl kernel that the NPU can process. With workgroup separations (local and global size) that will perform the operations in batch.

More like how does mesa teflon works in backend, that allows tflite to perform codegen that can be passed over to the RKNPU driver with proper grouping.

And is it possible to perform RKNPU operations with compiler such as tinygrad, apache tvm, xla, etc. Since from the mesa teflon it's mostly used for CNN Image, but want to see if it can be generalized to only basic operator such as tensor addition, multiplication, matmul, etc.

Can refer to the WIP project here.
https://github.com/liej6799/tinygrad/blob/3588-new/tinygrad/runtime/ops_rockchip.py

Thank you for your time.; September 28, 2025 at 9:45 AM
Tomeu Vizoso said...: Hi Anonymous, the hardware in the Rockchip NPU cannot execute individual OpenCL instructions. The interface that it provides is that of convolutions and other high level tensor operations. The lowest level we can do are matrix multiplications, and that will already be with some level of awkwardness.

Teflon will take a convolution or other high-level tensor operation from TFLite and pass it to the HW-specific driver. That one will figure out how to execute it on the specific HW.

The operations that you mention should be fine for this HW, but I'm not sure if Tinygrad and the other frameworks that you mention support loading an external backend in the same way that TFLite and Executorch do.; September 28, 2025 at 6:36 PM
Anonymous said...: Hello, this is really impressive code. I hope I can learn to write code at this level someday.

When I run the MobileNetV1 sample from:
https://docs.mesa3d.org/teflon.html
I get the following error:
python: ../src/gallium/drivers/rocket/rkt_ml.c:335: rkt_ml_subgraph_create: Assertion `input_op_1' failed

Is the crash happening because assert(input_op_1) fails when input_op_1 == NULL?
If I comment out assert(input_op_1);, the program runs.

Also, I tried to enable Rocket’s debug logs, but even when building with -Dbuildtype=debug and setting the ROCKET_DEBUG environment variable, no logs appear.
I even modified the source code like this:
// rocket_debug = debug_get_option_rocket_debug();
rocket_debug = 0x7;

but still no logs are shown.
Could you please tell me the correct way to enable and view the debug logs?
Thank you.; October 3, 2025 at 9:54 AM