Monday, June 26, 2023

Etnaviv NPU update 3: Deeper into the convolution units

What two weeks!

Programming of the convolution units

Taking from where I left at the last update, I made progress in understanding the format of the buffer that contains the weights and biases.

The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.

With that, I  was able to correctly run a convolution on a small matrix with arbitrary weights and biases.

The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.

That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.

A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.

A new Gallium frontend: Teflon

If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:

TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.

Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).

But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.

Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.

How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.

Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.

And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!

Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.

Next steps

I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z > 1).

After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!

Saturday, June 10, 2023

Etnaviv NPU update 2: Diving into the convolution units

In the previous update I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.

Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.

By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of programming this HW with a trivial job (a 2x2x1 kernel with a single bias value).

Took some boring work to translate what the code does to a C struct, but this was the initial one:

struct etna_nn_params {
   uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */
   uint32_t no_z_offset : 1;
   uint32_t kernel_x_size : 4;
   uint32_t kernel_z_size : 14; /* & 0x3FFF */
   uint32_t kernels_per_core : 7;
   uint32_t zero1 : 2;
   uint32_t zero2 : 1;
   uint32_t zero3 : 1;
   uint32_t nn_layer_flush : 1;

   uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_x_size : 13;
   uint32_t in_image_y_size : 13;

   uint32_t zero4 : 3;
   uint32_t zero5 : 3;
   uint32_t unused0 : 1;
   uint32_t zero6 : 16;
   uint32_t zero7 : 1;
   uint32_t enable_relu : 1;
   uint32_t zero9 : 1;
   uint32_t post_shift : 6;

   uint32_t unused1 : 2;
   uint32_t zero10 : 1;
   uint32_t zero11 : 1;
   uint32_t unused2 : 2;
   uint32_t out_image_x_size : 13;
   uint32_t out_image_y_size : 13;

   uint32_t out_image_z_size : 14;
   uint32_t zero12 : 2; /* 0x0 */
   uint32_t zero13 : 1; /* (0 >> 3) & 0x1 */
   uint32_t zero14 : 1; /* (0 >> 3) & 0x1 */
   uint32_t unk0 : 7;  /* 1 */
   uint32_t unk1 : 7;  /* 1 */

   uint32_t kernel_address : 26; /* >> 6 */
   uint32_t kernel_z_size2 : 6; /* >> 14 */

   uint32_t in_image_address;

   uint32_t out_image_address;

   uint32_t unused3 : 12;
   uint32_t kernel_y_size : 4;
   uint32_t out_image_y_size2 : 16;  /* maybe stride? */

   uint32_t zero15;

   uint32_t zero16;

   uint32_t zero17;

   uint32_t kernel_cache_end_address;

   uint32_t zero19;

   uint32_t image_end_address;

   uint32_t zero20 : 2;
   uint32_t zero21 : 16;
   uint32_t kernel_data_type_bit_2 : 1;
   uint32_t in_image_data_type_bit_2 : 1;
   uint32_t out_image_data_type_bit_2 : 1;
   uint32_t zero22 : 6;
   uint32_t post_shift_bit_5_6 : 2;
   uint32_t unused4 : 3;

   uint32_t in_image_stride : 16;
   uint32_t in_image_y_size2 : 16; /* again? */

   uint32_t out_image_stride : 16;
   uint32_t unused5 : 8;
   uint32_t zero23 : 8;

   uint32_t zero24 : 26; /* 0 >> 6 */
   uint32_t zero25 : 1;
   uint32_t zero26 : 1;
   uint32_t zero27 : 1; /* 0 >> 4 */
   uint32_t zero28 : 1; /* 0 >> 4 */
   uint32_t zero29 : 1;
   uint32_t kernel_data_type_bit_3 : 1;

   uint32_t unk2 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused6 : 4;
   uint32_t zero30 : 1;
   uint32_t in_image_data_type_bit_3 : 1;

   uint32_t zero31 : 26; /* 0 >> 6 */
   uint32_t out_image_data_type_bit_3 : 1;
   uint32_t unused7 : 6;

   uint32_t unk3 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused8 : 6;

   uint32_t coef_zero_point : 8;
   uint32_t out_zero_point : 8;
   uint32_t zero32 : 1;
   uint32_t zero33 : 1;
   uint32_t zero34 : 8;
   uint32_t unused9 : 6;

   uint32_t zero35;

   uint32_t zero36 : 4;
   uint32_t zero37 : 28;  /* 0 >> 4 */

   uint32_t zero38 : 4;
   uint32_t zero39 : 28;  /* 0 >> 4 */

   uint32_t further1;
   uint32_t further2;
   uint32_t further3;
   uint32_t further4;
   uint32_t further5;
   uint32_t further6;
   uint32_t further7;
   uint32_t further8;

As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.

I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.

By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.

Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.

To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.

What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.

Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!