Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the last update.
TL;DR
The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.
3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.
The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.
There are three features left before I can run a real, full-fledged commercially interesting model:
- 3D inputs for strided convolutions
- Multiple output channels
- Padded convolutions
Re-quantization
The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.
After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:
- Integer-Only Neural Network Quantization Scheme
Based on Shift-Batch-Normalization - Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference
That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.
But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.
3D input tensor
This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.
This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.
For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:
+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt
--- /home/tomeu/mesa.txt 2023-08-07 18:28:29.939750225 +0200
+++ /home/tomeu/galcore.txt 2023-08-07 18:28:42.116625362 +0200
@@ -1,176 +1,273 @@
{
- 0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */
- 0x00000011, /* PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */
- 0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */
- 0x00000002, /* GL.API_MODE := OPENCL */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
0x00000000, /* UNKNOWN (0) */
0x00000000, /* */
0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */
0x00000000, /* GL.OCB_REMAP_START := 0x0 */
0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */
0x00000000, /* GL.OCB_REMAP_END := 0x0 */
0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */
0x00000010, /* GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */
0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */
- 0xffff3000, /* PS.NN_INST_ADDR := *0xffff3000 */
+ 0x3348e780, /* PS.NN_INST_ADDR := *0x3348e780 */
0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */
0x00000000, /* 0x010A4 */
0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
0x00000c23, /* GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
0x00000c23, /* GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
0x00000000, /* UNKNOWN (0) */
0x00000000, /* */
}
map->layer_type = 0x0; /* (0) */
map->no_z_offset = 0x0; /* (0) */
map->kernel_xy_size = 0x2; /* (2) */
map->kernel_z_size = 0x4; /* (4) */
map->kernels_per_core = 0x1; /* (1) */
map->pooling = 0x0; /* (0) */
map->pooling_xy_size = 0x1; /* (1) */
map->prelu = 0x0; /* (0) */
map->nn_layer_flush = 0x1; /* (1) */
map->kernel_data_type = 0x0; /* (0) */
map->in_image_data_type = 0x0; /* (0) */
map->out_image_data_type = 0x0; /* (0) */
map->in_image_x_size = 0x4; /* (4) */
map->in_image_y_size = 0x4; /* (4) */
map->in_image_x_offset = 0x0; /* (0) */
map->in_image_y_offset = 0x0; /* (0) */
map->unused0 = 0x0; /* (0) */
map->brick_mode = 0x0; /* (0) */
map->brick_distance = 0x0; /* (0) */
map->relu = 0x0; /* (0) */
map->unused1 = 0x0; /* (0) */
map->post_multiplier = 0x0; /* (0) */
map->post_shift = 0x17; /* (23) */
map->unused2 = 0x0; /* (0) */
map->no_flush = 0x0; /* (0) */
map->unused3 = 0x0; /* (0) */
map->out_image_x_size = 0x3; /* (3) */
map->out_image_y_size = 0x3; /* (3) */
map->out_image_z_size = 0x1; /* (1) */
map->rounding_mode = 0x1; /* (1) */
map->in_image_x_offset_bit_3 = 0x0; /* (0) */
map->in_image_y_offset_bit_3 = 0x0; /* (0) */
map->out_image_tile_x_size = 0x3; /* (3) */
map->out_image_tile_y_size = 0x3; /* (3) */
-map->kernel_address = 0x3fffd00; /* (67108096) */
+map->kernel_address = 0xcd237f; /* (13443967) */
map->kernel_z_size2 = 0x0; /* (0) */
-map->in_image_address = 0xffff6000; /* (4294926336) */
-map->out_image_address = 0xffff7000; /* (4294930432) */
+map->in_image_address = 0x3348e240; /* (860414528) */
+map->out_image_address = 0x89ffc500; /* (2315240704) */
map->image_caching_mode = 0x0; /* (0) */
map->kernel_caching_mode = 0x1; /* (1) */
map->partial_cache_data_unit = 0x0; /* (0) */
map->kernel_pattern_msb = 0x0; /* (0) */
map->kernel_y_size = 0x2; /* (2) */
map->out_image_y_stride = 0x3; /* (3) */
map->kernel_pattern_low = 0x0; /* (0) */
map->kernel_pattern_high = 0x0; /* (0) */
map->kernel_cache_start_address = 0x800; /* (2048) */
map->kernel_cache_end_address = 0xa00; /* (2560) */
map->image_start_address = 0x0; /* (0) */
map->image_end_address = 0x800; /* (2048) */
map->in_image_border_mode = 0x0; /* (0) */
map->in_image_border_const = 0x7d; /* (125) */
map->unused4 = 0x0; /* (0) */
map->kernel_data_type_bit_2 = 0x0; /* (0) */
map->in_image_data_type_bit_2 = 0x0; /* (0) */
map->out_image_data_type_bit_2 = 0x0; /* (0) */
map->post_multiplier_1_to_6 = 0x1f; /* (31) */
map->post_shift_bit_5_6 = 0x0; /* (0) */
map->unused5 = 0x0; /* (0) */
map->in_image_x_stride = 0x4; /* (4) */
map->in_image_y_stride = 0x4; /* (4) */
map->out_image_x_stride = 0x3; /* (3) */
map->unused6 = 0x0; /* (0) */
map->post_multiplier_7_to_14 = 0x61; /* (97) */
map->out_image_circular_buf_size = 0x0; /* (0) */
map->unused7 = 0x0; /* (0) */
map->per_channel_post_mul = 0x0; /* (0) */
map->out_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused8 = 0x0; /* (0) */
map->in_image_circular_buf_size = 0x0; /* (0) */
map->unused9 = 0x0; /* (0) */
map->in_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused10 = 0x0; /* (0) */
map->coef_zero_point = 0x80; /* (128) */
map->out_zero_point = 0x77; /* (119) */
map->kernel_direct_stream_from_VIP_sram = 0x0; /* (0) */
map->depthwise = 0x0; /* (0) */
map->unused11 = 0x0; /* (0) */
map->unused12 = 0x0; /* (0) */
map->unused13 = 0x0; /* (0) */
map->unused14 = 0x0; /* (0) */
map->unused15 = 0x0; /* (0) */
map->unused16 = 0x0; /* (0) */
map->further1 = 0x0; /* (0) */
map->further2 = 0x0; /* (0) */
map->further3 = 0x3ffffff; /* (67108863) */
map->further4 = 0x7f800000; /* (2139095040) */
map->further5 = 0xff800000; /* (4286578688) */
map->further6 = 0x0; /* (0) */
map->further7 = 0x0; /* (0) */
map->further8 = 0x0; /* (0) */
0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,
0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,
0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00
0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,
0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,
0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,
0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,
0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,
0xf7, 0x64, 0xc3, 0xca
0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77
This corresponds to a convolution with the following parameters:
- 8x8x1 input tensor
- 3x3x1 weight tensor
- stride == 2
The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result.
At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.
When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.
Strided convolutions
The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:
By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.
For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency.
Test suite
With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.
I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.
At some point I will add a CI job, probably before sending the initial merge request.
No comments:
Post a Comment