Monday, June 26, 2023

Etnaviv NPU update 3: Deeper into the convolution units

What two weeks!

Programming of the convolution units

Taking from where I left at the last update, I made progress in understanding the format of the buffer that contains the weights and biases.

The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.

With that, I  was able to correctly run a convolution on a small matrix with arbitrary weights and biases.

The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.

That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.

A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.

A new Gallium frontend: Teflon

If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:

TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.

Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).

But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.

Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.

How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.

Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.

And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!

Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.

Next steps

I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z > 1).

After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!

No comments: