In the previous update I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.
Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.
By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of programming this HW with a trivial job (a 2x2x1 kernel with a single bias value).
Took some boring work to translate what the code does to a C struct, but this was the initial one:
struct etna_nn_params {
uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */
uint32_t no_z_offset : 1;
uint32_t kernel_x_size : 4;
uint32_t kernel_z_size : 14; /* & 0x3FFF */
uint32_t kernels_per_core : 7;
uint32_t zero1 : 2;
uint32_t zero2 : 1;
uint32_t zero3 : 1;
uint32_t nn_layer_flush : 1;
uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t in_image_x_size : 13;
uint32_t in_image_y_size : 13;
uint32_t zero4 : 3;
uint32_t zero5 : 3;
uint32_t unused0 : 1;
uint32_t zero6 : 16;
uint32_t zero7 : 1;
uint32_t enable_relu : 1;
uint32_t zero9 : 1;
uint32_t post_shift : 6;
uint32_t unused1 : 2;
uint32_t zero10 : 1;
uint32_t zero11 : 1;
uint32_t unused2 : 2;
uint32_t out_image_x_size : 13;
uint32_t out_image_y_size : 13;
uint32_t out_image_z_size : 14;
uint32_t zero12 : 2; /* 0x0 */
uint32_t zero13 : 1; /* (0 >> 3) & 0x1 */
uint32_t zero14 : 1; /* (0 >> 3) & 0x1 */
uint32_t unk0 : 7; /* 1 */
uint32_t unk1 : 7; /* 1 */
uint32_t kernel_address : 26; /* >> 6 */
uint32_t kernel_z_size2 : 6; /* >> 14 */
uint32_t in_image_address;
uint32_t out_image_address;
uint32_t unused3 : 12;
uint32_t kernel_y_size : 4;
uint32_t out_image_y_size2 : 16; /* maybe stride? */
uint32_t zero15;
uint32_t zero16;
uint32_t zero17;
uint32_t kernel_cache_end_address;
uint32_t zero19;
uint32_t image_end_address;
uint32_t zero20 : 2;
uint32_t zero21 : 16;
uint32_t kernel_data_type_bit_2 : 1;
uint32_t in_image_data_type_bit_2 : 1;
uint32_t out_image_data_type_bit_2 : 1;
uint32_t zero22 : 6;
uint32_t post_shift_bit_5_6 : 2;
uint32_t unused4 : 3;
uint32_t in_image_stride : 16;
uint32_t in_image_y_size2 : 16; /* again? */
uint32_t out_image_stride : 16;
uint32_t unused5 : 8;
uint32_t zero23 : 8;
uint32_t zero24 : 26; /* 0 >> 6 */
uint32_t zero25 : 1;
uint32_t zero26 : 1;
uint32_t zero27 : 1; /* 0 >> 4 */
uint32_t zero28 : 1; /* 0 >> 4 */
uint32_t zero29 : 1;
uint32_t kernel_data_type_bit_3 : 1;
uint32_t unk2 : 26; /* 0xFFFFFFFF >> 6 */
uint32_t unused6 : 4;
uint32_t zero30 : 1;
uint32_t in_image_data_type_bit_3 : 1;
uint32_t zero31 : 26; /* 0 >> 6 */
uint32_t out_image_data_type_bit_3 : 1;
uint32_t unused7 : 6;
uint32_t unk3 : 26; /* 0xFFFFFFFF >> 6 */
uint32_t unused8 : 6;
uint32_t coef_zero_point : 8;
uint32_t out_zero_point : 8;
uint32_t zero32 : 1;
uint32_t zero33 : 1;
uint32_t zero34 : 8;
uint32_t unused9 : 6;
uint32_t zero35;
uint32_t zero36 : 4;
uint32_t zero37 : 28; /* 0 >> 4 */
uint32_t zero38 : 4;
uint32_t zero39 : 28; /* 0 >> 4 */
uint32_t further1;
uint32_t further2;
uint32_t further3;
uint32_t further4;
uint32_t further5;
uint32_t further6;
uint32_t further7;
uint32_t further8;
};
As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.
I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.
By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.
Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.
To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.
What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.
Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!
No comments:
Post a Comment