Monday, May 29, 2023

Etnaviv NPU update 1: Planning for performance

As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.

In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).

When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.

After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).

With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.

Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.

Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.

The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:

  Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
  Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...

[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]

...

  Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
  Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
  Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
  Op#30 SOFTMAX(T#85) -> [T#87]

As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end. 

By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:

  operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...

[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN] 

...

  operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
  operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH

From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.

So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...

Next steps 

What I will be working on next:

  1. Adapt the existing RE tooling to dump information regarding NN and TP workflows
  2. Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.
  3. Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.
  4. Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.
  5. Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".
  6. Get things working for this specific network and measure performance.

If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.

The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).

The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.

That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.