During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to MobileDet, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example Frigate NVR.
I haven't mentioned it in a few updates, but all this work keeps being sponsored by Libre Computer, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out Alta and Solitude for the first such boards in the market.
Igalia's Christian Gmeiner has been giving me great feedback at the merge request, and as part of that I submitted a patch to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded.
This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.
Convolutions with 5x5 weights
Until now I had implemented support only for weights with dimensions 1x1 (aka pointwise convolutions) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.
I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.
I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.
This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.
One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.
The blob implements these operations on the programmable core, with CL-like kernels.
So I undusted the work that I did in the first half of 2023 and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in Etnaviv to make use of the programmable core.
Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.
With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.
A new test suite
With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.
To get around that, I reimplemented the tests in C++ with GoogleTest, which is supported by Emma Anholt's deqp-runner and will allow me to run the tests in parallel, making full use of the CPU cores in the board.
That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.
With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.
Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.
There is a kernel module parameter to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.
I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.
So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.