Wednesday, January 24, 2024

Etnaviv NPU update 15: We are upstream!

Today the initial merge request for Teflon was merged into Mesa, along with the first hardware driver, for VeriSilicon's Vivante NPU.

For those who don't know, Teflon is a TensorFlow Lite delegate that aims to support several AI accelerators (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the MIT license.

This will have the following advantages for the project:

  1. The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the release calendar.
  2. Contribution to the project will happen within the development process of Mesa. This is a well-established process in which employees from companies such as Google, Valve, Imagination, Intel, Microsoft and AMD work together on their GPU drivers.
  3. The project has great technical infrastructure, maintained by awesome sysadmins:
  4. More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:
    • The NIR intermediate representation with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.
    • The Gallium internal API that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as Android NNAPI.
  5. And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: XDC.

The story so far

In 2022, while still at Collabora, I started adding OpenCL support to the Etnaviv driver in Mesa. Etnaviv is a userspace and kernel driver for VeriSilicon's Vivante NPUs.

The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.

I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (systolic arrays, as I realized later).

Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since Google released their first TPU.

With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.

During the summer and with Libre Computer's sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.

By autumn I was able to run that same object classification model (MobileNet V1) 3 times faster than the CPU was able to. A month later I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.

Afterwards I got to work on object detection models, and by the start of 2024 I managed to run SSDLite MobileDet at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!

The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.

Next steps

Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!

I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.

There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.

There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.

And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.


One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:

Collabora for allowing me to start playing with this while I still worked with them.

Libre Computer and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.

Igalia for letting Christian Gmeiner spend time reviewing all my code and answering my questions about Etnaviv.

Embedded Recipes for giving me the opportunity to present my work last autumn in Paris.

Lucas Stach from Pengutronix for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.

Neil Armstrong from Linaro for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.

And a collective thanks to the DRI/Mesa community for being so awesome!


Cajer said...

Any chance of adding a lightweight super resolution model to the list of models you’re planning to implement?

Tomeu Vizoso said...

I would like to. Do you have in mind any specific model?

Cajer said...

I believe the NPU should be powerful enough to do one of the lightweight models in real time. The one detailed in this git and paper should be runnable and is designed to be run on the Coral TPU.

Something else that would be interesting is a noise reduction code similar to this:

These could be good for allowing edge devices to clean up some video for further classification.

Tomeu Vizoso said...

These are very good ideas. I think the hardware should be able to do a good job there and it shouldn't be that much work to add the missing operations.


Jas said...

I made progress on the RK3588 npu Its different to VivanteNPUIP, more info on my blog

Tomeu Vizoso said...

That is amazing, Jas. I would love to give a hand on this. Please join #ml-mainline at OFTC, or send me an email so we can coordinate. Thanks!

Serhii M. said...


First of all, thank you for this project. It's a shame that Amlogic has not been able to properly implement and open source the A311D software for interaction with NPUs in 5 years, and everything available is very outdated.

Our goal is to run real-time neural networks on the A311D while using the up to date kernel. As it turned out, devboards manufacturers currently support a 4.x.x. kernels only. The final goal is to use NPU accelerated OpenCV - (the supported kernel for outdated TIM-VX's prebuilt closed-source galcore.ko is really unstable).

That is why I was happy when I heard about your project. Currently, I'm trying to run it and have a few questions. I will be very grateful if you help.

We use devboard BananaPi CM4 IO that is based on A311D chip, custom built Debian (Linux Armbian 24.5.0-trunk) bookworm (6.8.6 Linux kernel).

Following this tutorial: I have changed and compiled DTB, built mesa.

Now I have issue with determining, how can I install kernel-level hardware driver for VeriSilicon's Vivante NPU. Running `meson install -C build` seems to install user-space libraries only, and no .ko modules are copied. Please give me a tip on how to install and load kernel side etnaviv driver, and verify it is correct one and running.

The second question is connected with `Do some inference with MobileNetV1` part of the tutorial. There is no src/gallium/frontends/teflon/tests/ in, it seems you have it locally only. Would be great if can you share it.

Thank you!

Tomeu Vizoso said...

Hi Serhii,

Everything you need is in the 6.8.6 Linux kernel, you may need to manually insert the etnaviv module by running `modprobe etnaviv`.

Sorry about not having pushed the script mentioned in the docs. While it gets to git, in the meantime you can just use this script from the official TensorFlow documentation:

Serhii M. said...

Hello! Thank you for the help!

I managed to run it, but there is a performance problem.

To begin with, running interference on NPU has caused the following problem:

root@bananapicm4io:~/mesa# TEFLON_DEBUG=verbose ETNA_MESA_DEBUG=ml_dbgs python3 …/ -i ~/grace_hopper.bmp -m …/mobilenet_v1_1.0_224_quant.tflite -l …/labels_mobilenet_quant_v1_224.txt -e …/
Loading external delegate from build/src/gallium/targets/teflon/ with args: {}
Teflon delegate: loaded etnaviv driver
python3: ../src/gallium/drivers/etnaviv/etnaviv_state.c:565: etna_vertex_elements_state_create: Assertion `buffer_idx < screen->specs.stream_count' failed.


There were no further errors after removing this check (in etnaviv_state.c).

Though, the NPU interference on the standard grace_hopper.bmp with mobilenet_v1_1.0_224_quant.tflite (the one you have provided in mesa repo) takes 355.632ms, while CPU (all-cores) interference takes 16ms only, which is obviously NPU incorrect behavior.

Gist with program logs (no new dmesg logs on running), edited device tree:

I would appreciate it if you could help figure out what's wrong.

Thank you!

Tomeu Vizoso said...

> I would appreciate it if you could help figure out what's wrong.

Hi Serhii, can you please try with latest mesa/main? I have fixed a few issues. Should work with latest 6.9-rc, or with earlier kernels if you backport patches.

Pitounet said...

Hello, I'm currently trying with a Raxda zero 2 Pro which is based on the Amlogic A311D.
I've managed to build an Armbian with the Kernel 6.9.0-rc6, I'm nex to kernel building so it took me the whole weekend. I'm now trying your tutorial, hopefully it works.