For those who don't know, Teflon is a TensorFlow Lite delegate that aims to support several AI accelerators (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the MIT license.
This will have the following advantages for the project:
- The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the release calendar.
- Contribution to the project will happen within the development process of Mesa. This is a well-established process in which employees from companies such as Google, Valve, Imagination, Intel, Microsoft and AMD work together on their GPU drivers.
- The project has great technical infrastructure, maintained by awesome sysadmins:
- A well-maintained Gitlab instance,
- extensive CI, for both build and runtime testing, on real hardware,
- mailing list, web server, etc.
- More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:
- The NIR intermediate representation with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.
- The Gallium internal API that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as Android NNAPI.
- And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: XDC.
The story so far
The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.
I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (systolic arrays, as I realized later).
Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since Google released their first TPU.
With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.
During the summer and with Libre Computer's sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.
By autumn I was able to run that same object classification model (MobileNet V1) 3 times faster than the CPU was able to. A month later I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.
Afterwards I got to work on object detection models, and by the start of 2024 I managed to run SSDLite MobileDet at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!
The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.
Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!
I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.
There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.
There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.
And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.
One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:
Collabora for allowing me to start playing with this while I still worked with them.
Libre Computer and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.
Embedded Recipes for giving me the opportunity to present my work last autumn in Paris.
Lucas Stach from Pengutronix for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.
Neil Armstrong from Linaro for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.
And a collective thanks to the DRI/Mesa community for being so awesome!