Monday, May 29, 2023

Etnaviv NPU update 1: Planning for performance

As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.

In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).

When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.

After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).

With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.

Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.

Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.

The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:

  Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
  Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...

[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]

...

  Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
  Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
  Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
  Op#30 SOFTMAX(T#85) -> [T#87]

As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end. 

By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:

  operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...

[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN] 

...

  operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
  operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH

From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.

So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...

Next steps 

What I will be working on next:

  1. Adapt the existing RE tooling to dump information regarding NN and TP workflows
  2. Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.
  3. Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.
  4. Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.
  5. Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".
  6. Get things working for this specific network and measure performance.

If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.

The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).

The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.

That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.

Wednesday, April 26, 2023

A long overdue update

Cannot believe it has been years since my last update here!

There are two things that I would like to tell people about:

The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.

They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.

And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.

TL;DR

tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1

[snip]
INFO: invoked
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket

That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.

The story behind this work

Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.

This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.

Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.

Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.

Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.

We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.

The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:

[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU

Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:

&npu {
       status = "okay";
};

Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.

Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.

First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.

At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.

Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.

And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.

With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.

Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.

Current performance situation (label_image):

  • OpenCL delegate with Etnaviv: 1261.99 ms
  • OpenCL delegate with Galcore: 787.733 ms
  • CPU: 149.19 ms
  • TIM-VX delegate: 2.567 ms (!)

The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.

Next steps

Italo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.

In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.

Thanks

There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:

  • The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.
  • The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.
  • And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.

Last but not least, there are some individuals to whom I was able to turn when I needed help:

  • Christian Gmeiner (austriancoder)
  • Lucas Stach (lynxeye)
  • Neil Armstrong (narmstrong)
  • Faith Ekstrand (gfxstrand)
  • Karol Herbst (karolherbst)
A big thanks, it has been a lot of fun!