Cannot believe it has been years since my last update here!
There are two things that I would like to tell people about:
The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.
They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.
And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.
tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket
That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.
The story behind this work
Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.
This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.
Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.
Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.
Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.
We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.
The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:
Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:
status = "okay";
Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.
Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.
First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.
Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.
And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.
With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.
Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.
Current performance situation (label_image):
- OpenCL delegate with Etnaviv: 1261.99 ms
- OpenCL delegate with Galcore: 787.733 ms
- CPU: 149.19 ms
- TIM-VX delegate: 2.567 ms (!)
The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.
In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.
There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:
- The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.
- The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.
- And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.
Last but not least, there are some individuals to whom I was able to turn when I needed help:
- Christian Gmeiner (austriancoder)
- Lucas Stach (lynxeye)
- Neil Armstrong (narmstrong)
- Faith Ekstrand (gfxstrand)
- Karol Herbst (karolherbst)