Thursday, August 24, 2023

Etnaviv NPU update 5: Harder convolutions!

Progress

Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.

Since the last update I have:

  • implemented support for strided convolutions with more than one input channel, and
  • Implemented support for more than one output channel, but for now only for a single input channel.

Next steps are  to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.

As a reminder, I'm pushing all the code to this branch: https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/.

IRC channel

A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.

For those of you that may not have a IRC bouncer setup yet, you can easily join with the web chat UI, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:

https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/

Embedded recipes

I have been invited to give a talk about all this ML with mainline effort at Embedded Recipes 2023, Paris 28-29 September. Slides and a recording will be published after the conference ends.

Sponsor

Last but not least, if I am able to invest so much effort on this is because the folks at LibreComputer have been supporting me financially this last couple of months.

Thanks to Da Xue for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.

Monday, August 7, 2023

Etnaviv NPU update 4: It's convoluting!

Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the last update.

TL;DR

The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.

3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.

The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.

There are three features left before I can run a real, full-fledged commercially interesting model:

  1. 3D inputs for strided convolutions
  2. Multiple output channels
  3. Padded convolutions

Re-quantization

The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.

After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:

That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.

But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.

3D input tensor

This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.

This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.

For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:

+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt
--- /home/tomeu/mesa.txt    2023-08-07 18:28:29.939750225 +0200
+++ /home/tomeu/galcore.txt    2023-08-07 18:28:42.116625362 +0200
@@ -1,176 +1,273 @@
 {
-    0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */
-    0x00000011, /*   PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */
-    0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */
-    0x00000002, /*   GL.API_MODE := OPENCL */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /*  */
     0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_START := 0x0 */
     0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_END := 0x0 */
     0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */
     0x00000010, /*   GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */
     0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */
-    0xffff3000, /*   PS.NN_INST_ADDR := *0xffff3000 */
+    0x3348e780, /*   PS.NN_INST_ADDR := *0x3348e780 */
     0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */
     0x00000000, /*   0x010A4 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /*  */
 }
 map->layer_type = 0x0;  /* (0) */
 map->no_z_offset = 0x0;  /* (0) */
 map->kernel_xy_size = 0x2;  /* (2) */
 map->kernel_z_size = 0x4;  /* (4) */
 map->kernels_per_core = 0x1;  /* (1) */
 map->pooling = 0x0;  /* (0) */
 map->pooling_xy_size = 0x1;  /* (1) */
 map->prelu = 0x0;  /* (0) */
 map->nn_layer_flush = 0x1;  /* (1) */
 map->kernel_data_type = 0x0;  /* (0) */
 map->in_image_data_type = 0x0;  /* (0) */
 map->out_image_data_type = 0x0;  /* (0) */
 map->in_image_x_size = 0x4;  /* (4) */
 map->in_image_y_size = 0x4;  /* (4) */
 map->in_image_x_offset = 0x0;  /* (0) */
 map->in_image_y_offset = 0x0;  /* (0) */
 map->unused0 = 0x0;  /* (0) */
 map->brick_mode = 0x0;  /* (0) */
 map->brick_distance = 0x0;  /* (0) */
 map->relu = 0x0;  /* (0) */
 map->unused1 = 0x0;  /* (0) */
 map->post_multiplier = 0x0;  /* (0) */
 map->post_shift = 0x17;  /* (23) */
 map->unused2 = 0x0;  /* (0) */
 map->no_flush = 0x0;  /* (0) */
 map->unused3 = 0x0;  /* (0) */
 map->out_image_x_size = 0x3;  /* (3) */
 map->out_image_y_size = 0x3;  /* (3) */
 map->out_image_z_size = 0x1;  /* (1) */
 map->rounding_mode = 0x1;  /* (1) */
 map->in_image_x_offset_bit_3 = 0x0;  /* (0) */
 map->in_image_y_offset_bit_3 = 0x0;  /* (0) */
 map->out_image_tile_x_size = 0x3;  /* (3) */
 map->out_image_tile_y_size = 0x3;  /* (3) */
-map->kernel_address = 0x3fffd00;  /* (67108096) */
+map->kernel_address = 0xcd237f;  /* (13443967) */
 map->kernel_z_size2 = 0x0;  /* (0) */
-map->in_image_address = 0xffff6000;  /* (4294926336) */
-map->out_image_address = 0xffff7000;  /* (4294930432) */
+map->in_image_address = 0x3348e240;  /* (860414528) */
+map->out_image_address = 0x89ffc500;  /* (2315240704) */
 map->image_caching_mode = 0x0;  /* (0) */
 map->kernel_caching_mode = 0x1;  /* (1) */
 map->partial_cache_data_unit = 0x0;  /* (0) */
 map->kernel_pattern_msb = 0x0;  /* (0) */
 map->kernel_y_size = 0x2;  /* (2) */
 map->out_image_y_stride = 0x3;  /* (3) */
 map->kernel_pattern_low = 0x0;  /* (0) */
 map->kernel_pattern_high = 0x0;  /* (0) */
 map->kernel_cache_start_address = 0x800;  /* (2048) */
 map->kernel_cache_end_address = 0xa00;  /* (2560) */
 map->image_start_address = 0x0;  /* (0) */
 map->image_end_address = 0x800;  /* (2048) */
 map->in_image_border_mode = 0x0;  /* (0) */
 map->in_image_border_const = 0x7d;  /* (125) */
 map->unused4 = 0x0;  /* (0) */
 map->kernel_data_type_bit_2 = 0x0;  /* (0) */
 map->in_image_data_type_bit_2 = 0x0;  /* (0) */
 map->out_image_data_type_bit_2 = 0x0;  /* (0) */
 map->post_multiplier_1_to_6 = 0x1f;  /* (31) */
 map->post_shift_bit_5_6 = 0x0;  /* (0) */
 map->unused5 = 0x0;  /* (0) */
 map->in_image_x_stride = 0x4;  /* (4) */
 map->in_image_y_stride = 0x4;  /* (4) */
 map->out_image_x_stride = 0x3;  /* (3) */
 map->unused6 = 0x0;  /* (0) */
 map->post_multiplier_7_to_14 = 0x61;  /* (97) */
 map->out_image_circular_buf_size = 0x0;  /* (0) */
 map->unused7 = 0x0;  /* (0) */
 map->per_channel_post_mul = 0x0;  /* (0) */
 map->out_image_circular_buf_end_addr_plus_1 = 0x3ffffff;  /* (67108863) */
 map->unused8 = 0x0;  /* (0) */
 map->in_image_circular_buf_size = 0x0;  /* (0) */
 map->unused9 = 0x0;  /* (0) */
 map->in_image_circular_buf_end_addr_plus_1 = 0x3ffffff;  /* (67108863) */
 map->unused10 = 0x0;  /* (0) */
 map->coef_zero_point = 0x80;  /* (128) */
 map->out_zero_point = 0x77;  /* (119) */
 map->kernel_direct_stream_from_VIP_sram = 0x0;  /* (0) */
 map->depthwise = 0x0;  /* (0) */
 map->unused11 = 0x0;  /* (0) */
 map->unused12 = 0x0;  /* (0) */
 map->unused13 = 0x0;  /* (0) */
 map->unused14 = 0x0;  /* (0) */
 map->unused15 = 0x0;  /* (0) */
 map->unused16 = 0x0;  /* (0) */
 map->further1 = 0x0;  /* (0) */
 map->further2 = 0x0;  /* (0) */
 map->further3 = 0x3ffffff;  /* (67108863) */
 map->further4 = 0x7f800000;  /* (2139095040) */
 map->further5 = 0xff800000;  /* (4286578688) */
 map->further6 = 0x0;  /* (0) */
 map->further7 = 0x0;  /* (0) */
 map->further8 = 0x0;  /* (0) */
   0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,
   0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,
   0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00
   0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,
   0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,
   0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,
   0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,
   0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,
   0xf7, 0x64, 0xc3, 0xca
   0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77

This corresponds to a convolution with the following parameters:

  • 8x8x1 input tensor
  • 3x3x1 weight tensor
  • stride == 2

The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result. 

At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.

When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.

Strided convolutions

The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:

By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.

For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency.

Test suite

With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.

I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.

At some point I will add a CI job, probably before sending the initial merge request.