Thursday, September 7, 2023

Etnaviv NPU update 6: Almost there!

Progress

 This week started quite fruitfully, these features were added:

  • Convolutions with multiple input and output channels (input and output feature maps)
  • "Same" padding in convolutions

And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.

One more roadblock

Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps.

Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.

So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.

Want to try it out?

I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.

I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:

https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst

You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.


Thursday, August 24, 2023

Etnaviv NPU update 5: Harder convolutions!

Progress

Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.

Since the last update I have:

  • implemented support for strided convolutions with more than one input channel, and
  • Implemented support for more than one output channel, but for now only for a single input channel.

Next steps are  to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.

As a reminder, I'm pushing all the code to this branch: https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/.

IRC channel

A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.

For those of you that may not have a IRC bouncer setup yet, you can easily join with the web chat UI, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:

https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/

Embedded recipes

I have been invited to give a talk about all this ML with mainline effort at Embedded Recipes 2023, Paris 28-29 September. Slides and a recording will be published after the conference ends.

Sponsor

Last but not least, if I am able to invest so much effort on this is because the folks at LibreComputer have been supporting me financially this last couple of months.

Thanks to Da Xue for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.

Monday, August 7, 2023

Etnaviv NPU update 4: It's convoluting!

Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the last update.

TL;DR

The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.

3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.

The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.

There are three features left before I can run a real, full-fledged commercially interesting model:

  1. 3D inputs for strided convolutions
  2. Multiple output channels
  3. Padded convolutions

Re-quantization

The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.

After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:

That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.

But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.

3D input tensor

This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.

This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.

For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:

+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt
--- /home/tomeu/mesa.txt    2023-08-07 18:28:29.939750225 +0200
+++ /home/tomeu/galcore.txt    2023-08-07 18:28:42.116625362 +0200
@@ -1,176 +1,273 @@
 {
-    0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */
-    0x00000011, /*   PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */
-    0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */
-    0x00000002, /*   GL.API_MODE := OPENCL */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /*  */
     0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_START := 0x0 */
     0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_END := 0x0 */
     0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */
     0x00000010, /*   GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */
     0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */
-    0xffff3000, /*   PS.NN_INST_ADDR := *0xffff3000 */
+    0x3348e780, /*   PS.NN_INST_ADDR := *0x3348e780 */
     0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */
     0x00000000, /*   0x010A4 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /*  */
 }
 map->layer_type = 0x0;  /* (0) */
 map->no_z_offset = 0x0;  /* (0) */
 map->kernel_xy_size = 0x2;  /* (2) */
 map->kernel_z_size = 0x4;  /* (4) */
 map->kernels_per_core = 0x1;  /* (1) */
 map->pooling = 0x0;  /* (0) */
 map->pooling_xy_size = 0x1;  /* (1) */
 map->prelu = 0x0;  /* (0) */
 map->nn_layer_flush = 0x1;  /* (1) */
 map->kernel_data_type = 0x0;  /* (0) */
 map->in_image_data_type = 0x0;  /* (0) */
 map->out_image_data_type = 0x0;  /* (0) */
 map->in_image_x_size = 0x4;  /* (4) */
 map->in_image_y_size = 0x4;  /* (4) */
 map->in_image_x_offset = 0x0;  /* (0) */
 map->in_image_y_offset = 0x0;  /* (0) */
 map->unused0 = 0x0;  /* (0) */
 map->brick_mode = 0x0;  /* (0) */
 map->brick_distance = 0x0;  /* (0) */
 map->relu = 0x0;  /* (0) */
 map->unused1 = 0x0;  /* (0) */
 map->post_multiplier = 0x0;  /* (0) */
 map->post_shift = 0x17;  /* (23) */
 map->unused2 = 0x0;  /* (0) */
 map->no_flush = 0x0;  /* (0) */
 map->unused3 = 0x0;  /* (0) */
 map->out_image_x_size = 0x3;  /* (3) */
 map->out_image_y_size = 0x3;  /* (3) */
 map->out_image_z_size = 0x1;  /* (1) */
 map->rounding_mode = 0x1;  /* (1) */
 map->in_image_x_offset_bit_3 = 0x0;  /* (0) */
 map->in_image_y_offset_bit_3 = 0x0;  /* (0) */
 map->out_image_tile_x_size = 0x3;  /* (3) */
 map->out_image_tile_y_size = 0x3;  /* (3) */
-map->kernel_address = 0x3fffd00;  /* (67108096) */
+map->kernel_address = 0xcd237f;  /* (13443967) */
 map->kernel_z_size2 = 0x0;  /* (0) */
-map->in_image_address = 0xffff6000;  /* (4294926336) */
-map->out_image_address = 0xffff7000;  /* (4294930432) */
+map->in_image_address = 0x3348e240;  /* (860414528) */
+map->out_image_address = 0x89ffc500;  /* (2315240704) */
 map->image_caching_mode = 0x0;  /* (0) */
 map->kernel_caching_mode = 0x1;  /* (1) */
 map->partial_cache_data_unit = 0x0;  /* (0) */
 map->kernel_pattern_msb = 0x0;  /* (0) */
 map->kernel_y_size = 0x2;  /* (2) */
 map->out_image_y_stride = 0x3;  /* (3) */
 map->kernel_pattern_low = 0x0;  /* (0) */
 map->kernel_pattern_high = 0x0;  /* (0) */
 map->kernel_cache_start_address = 0x800;  /* (2048) */
 map->kernel_cache_end_address = 0xa00;  /* (2560) */
 map->image_start_address = 0x0;  /* (0) */
 map->image_end_address = 0x800;  /* (2048) */
 map->in_image_border_mode = 0x0;  /* (0) */
 map->in_image_border_const = 0x7d;  /* (125) */
 map->unused4 = 0x0;  /* (0) */
 map->kernel_data_type_bit_2 = 0x0;  /* (0) */
 map->in_image_data_type_bit_2 = 0x0;  /* (0) */
 map->out_image_data_type_bit_2 = 0x0;  /* (0) */
 map->post_multiplier_1_to_6 = 0x1f;  /* (31) */
 map->post_shift_bit_5_6 = 0x0;  /* (0) */
 map->unused5 = 0x0;  /* (0) */
 map->in_image_x_stride = 0x4;  /* (4) */
 map->in_image_y_stride = 0x4;  /* (4) */
 map->out_image_x_stride = 0x3;  /* (3) */
 map->unused6 = 0x0;  /* (0) */
 map->post_multiplier_7_to_14 = 0x61;  /* (97) */
 map->out_image_circular_buf_size = 0x0;  /* (0) */
 map->unused7 = 0x0;  /* (0) */
 map->per_channel_post_mul = 0x0;  /* (0) */
 map->out_image_circular_buf_end_addr_plus_1 = 0x3ffffff;  /* (67108863) */
 map->unused8 = 0x0;  /* (0) */
 map->in_image_circular_buf_size = 0x0;  /* (0) */
 map->unused9 = 0x0;  /* (0) */
 map->in_image_circular_buf_end_addr_plus_1 = 0x3ffffff;  /* (67108863) */
 map->unused10 = 0x0;  /* (0) */
 map->coef_zero_point = 0x80;  /* (128) */
 map->out_zero_point = 0x77;  /* (119) */
 map->kernel_direct_stream_from_VIP_sram = 0x0;  /* (0) */
 map->depthwise = 0x0;  /* (0) */
 map->unused11 = 0x0;  /* (0) */
 map->unused12 = 0x0;  /* (0) */
 map->unused13 = 0x0;  /* (0) */
 map->unused14 = 0x0;  /* (0) */
 map->unused15 = 0x0;  /* (0) */
 map->unused16 = 0x0;  /* (0) */
 map->further1 = 0x0;  /* (0) */
 map->further2 = 0x0;  /* (0) */
 map->further3 = 0x3ffffff;  /* (67108863) */
 map->further4 = 0x7f800000;  /* (2139095040) */
 map->further5 = 0xff800000;  /* (4286578688) */
 map->further6 = 0x0;  /* (0) */
 map->further7 = 0x0;  /* (0) */
 map->further8 = 0x0;  /* (0) */
   0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,
   0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,
   0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00
   0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,
   0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,
   0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,
   0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,
   0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,
   0xf7, 0x64, 0xc3, 0xca
   0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77

This corresponds to a convolution with the following parameters:

  • 8x8x1 input tensor
  • 3x3x1 weight tensor
  • stride == 2

The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result. 

At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.

When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.

Strided convolutions

The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:

By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.

For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency.

Test suite

With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.

I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.

At some point I will add a CI job, probably before sending the initial merge request.

Monday, June 26, 2023

Etnaviv NPU update 3: Deeper into the convolution units

What two weeks!

Programming of the convolution units

Taking from where I left at the last update, I made progress in understanding the format of the buffer that contains the weights and biases.

The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.

With that, I  was able to correctly run a convolution on a small matrix with arbitrary weights and biases.

The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.

That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.

A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.

A new Gallium frontend: Teflon

If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:

https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/

TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.

Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).

But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.

Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.

How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.

Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.

And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!

Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.

Next steps

I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z > 1).

After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!

Saturday, June 10, 2023

Etnaviv NPU update 2: Diving into the convolution units

In the previous update I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.

Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.

By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of programming this HW with a trivial job (a 2x2x1 kernel with a single bias value).

Took some boring work to translate what the code does to a C struct, but this was the initial one:

struct etna_nn_params {
   uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */
   uint32_t no_z_offset : 1;
   uint32_t kernel_x_size : 4;
   uint32_t kernel_z_size : 14; /* & 0x3FFF */
   uint32_t kernels_per_core : 7;
   uint32_t zero1 : 2;
   uint32_t zero2 : 1;
   uint32_t zero3 : 1;
   uint32_t nn_layer_flush : 1;

   uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_x_size : 13;
   uint32_t in_image_y_size : 13;

   uint32_t zero4 : 3;
   uint32_t zero5 : 3;
   uint32_t unused0 : 1;
   uint32_t zero6 : 16;
   uint32_t zero7 : 1;
   uint32_t enable_relu : 1;
   uint32_t zero9 : 1;
   uint32_t post_shift : 6;

   uint32_t unused1 : 2;
   uint32_t zero10 : 1;
   uint32_t zero11 : 1;
   uint32_t unused2 : 2;
   uint32_t out_image_x_size : 13;
   uint32_t out_image_y_size : 13;

   uint32_t out_image_z_size : 14;
   uint32_t zero12 : 2; /* 0x0 */
   uint32_t zero13 : 1; /* (0 >> 3) & 0x1 */
   uint32_t zero14 : 1; /* (0 >> 3) & 0x1 */
   uint32_t unk0 : 7;  /* 1 */
   uint32_t unk1 : 7;  /* 1 */

   uint32_t kernel_address : 26; /* >> 6 */
   uint32_t kernel_z_size2 : 6; /* >> 14 */

   uint32_t in_image_address;

   uint32_t out_image_address;

   uint32_t unused3 : 12;
   uint32_t kernel_y_size : 4;
   uint32_t out_image_y_size2 : 16;  /* maybe stride? */

   uint32_t zero15;

   uint32_t zero16;

   uint32_t zero17;

   uint32_t kernel_cache_end_address;

   uint32_t zero19;

   uint32_t image_end_address;

   uint32_t zero20 : 2;
   uint32_t zero21 : 16;
   uint32_t kernel_data_type_bit_2 : 1;
   uint32_t in_image_data_type_bit_2 : 1;
   uint32_t out_image_data_type_bit_2 : 1;
   uint32_t zero22 : 6;
   uint32_t post_shift_bit_5_6 : 2;
   uint32_t unused4 : 3;

   uint32_t in_image_stride : 16;
   uint32_t in_image_y_size2 : 16; /* again? */

   uint32_t out_image_stride : 16;
   uint32_t unused5 : 8;
   uint32_t zero23 : 8;

   uint32_t zero24 : 26; /* 0 >> 6 */
   uint32_t zero25 : 1;
   uint32_t zero26 : 1;
   uint32_t zero27 : 1; /* 0 >> 4 */
   uint32_t zero28 : 1; /* 0 >> 4 */
   uint32_t zero29 : 1;
   uint32_t kernel_data_type_bit_3 : 1;

   uint32_t unk2 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused6 : 4;
   uint32_t zero30 : 1;
   uint32_t in_image_data_type_bit_3 : 1;

   uint32_t zero31 : 26; /* 0 >> 6 */
   uint32_t out_image_data_type_bit_3 : 1;
   uint32_t unused7 : 6;

   uint32_t unk3 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused8 : 6;

   uint32_t coef_zero_point : 8;
   uint32_t out_zero_point : 8;
   uint32_t zero32 : 1;
   uint32_t zero33 : 1;
   uint32_t zero34 : 8;
   uint32_t unused9 : 6;

   uint32_t zero35;

   uint32_t zero36 : 4;
   uint32_t zero37 : 28;  /* 0 >> 4 */

   uint32_t zero38 : 4;
   uint32_t zero39 : 28;  /* 0 >> 4 */

   uint32_t further1;
   uint32_t further2;
   uint32_t further3;
   uint32_t further4;
   uint32_t further5;
   uint32_t further6;
   uint32_t further7;
   uint32_t further8;
};

As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.

I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.

By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.

Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.

To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.

What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.

Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!

Monday, May 29, 2023

Etnaviv NPU update 1: Planning for performance

As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.

In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).

When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.

After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).

With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.

Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.

Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.

The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:

  Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
  Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...

[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]

...

  Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
  Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
  Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
  Op#30 SOFTMAX(T#85) -> [T#87]

As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end. 

By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:

  operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...

[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN] 

...

  operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
  operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH

From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.

So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...

Next steps 

What I will be working on next:

  1. Adapt the existing RE tooling to dump information regarding NN and TP workflows
  2. Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.
  3. Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.
  4. Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.
  5. Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".
  6. Get things working for this specific network and measure performance.

If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.

The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).

The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.

That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.

Wednesday, April 26, 2023

A long overdue update

Cannot believe it has been years since my last update here!

There are two things that I would like to tell people about:

The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.

They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.

And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.

TL;DR

tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1

[snip]
INFO: invoked
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket

That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.

The story behind this work

Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.

This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.

Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.

Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.

Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.

We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.

The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:

[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU

Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:

&npu {
       status = "okay";
};

Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.

Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.

First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.

At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.

Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.

And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.

With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.

Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.

Current performance situation (label_image):

  • OpenCL delegate with Etnaviv: 1261.99 ms
  • OpenCL delegate with Galcore: 787.733 ms
  • CPU: 149.19 ms
  • TIM-VX delegate: 2.567 ms (!)

The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.

Next steps

Italo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.

In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.

Thanks

There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:

  • The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.
  • The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.
  • And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.

Last but not least, there are some individuals to whom I was able to turn when I needed help:

  • Christian Gmeiner (austriancoder)
  • Lucas Stach (lynxeye)
  • Neil Armstrong (narmstrong)
  • Faith Ekstrand (gfxstrand)
  • Karol Herbst (karolherbst)
A big thanks, it has been a lot of fun!