Thursday, December 21, 2023

Etnaviv NPU update 13: Don't cross the tensors

"Don't cross the streams. It would be bad."

IR refactorings

A big part of what I have been up to in the past two weeks has been a serious refactoring of the data structures that hold the model data in the different phases until the HW configurations is generated.

What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.

The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:

A small subsection of SSDLite MobileDet

The adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.

As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.

For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.

I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.

For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.

Tensor addition

Also put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.

Lowering the strides

Similarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.

Status and next steps

Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.

The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.

The whole of SSDLite MobileDet

Wednesday, December 6, 2023

Etnaviv NPU update 12: Towards SSDLite MobileDet

During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to MobileDet, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example Frigate NVR.

I haven't mentioned it in a few updates, but all this work keeps being sponsored by Libre Computer, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out Alta and Solitude for the first such boards in the market.

Upstreaming

Igalia's Christian Gmeiner has been giving me great feedback at the merge request, and as part of that I submitted a patch to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded.

This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.

Convolutions with 5x5 weights

Until now I had implemented support only for weights with dimensions 1x1 (aka pointwise convolutions) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.

I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.

Tensor addition

I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.

This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.

Softmax pooling

One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.

The blob implements these operations on the programmable core, with CL-like kernels.

So I undusted the work that I did in the first half of 2023 and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in Etnaviv to make use of the programmable core.

Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.

With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.

A new test suite

With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.

To get around that, I reimplemented the tests in C++ with GoogleTest, which is supported by Emma Anholt's deqp-runner and will allow me to run the tests in parallel, making full use of the CPU cores in the board.

That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.

With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.

Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.

There is a kernel module parameter to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.

Next steps

I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.

So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.

Friday, November 17, 2023

Etnaviv NPU update 11: Now twice as fast!

Progress

This update's highlight is that last week I finally got the TP jobs working, which allows us to make the tensor manipulation in the HW, removing 18ms from the tensor preprocessing. We can currently use them for transposing tensors from the format that TensorFlow prefers to that which the HW expects and the other way around, and for lowering strided convolutions to regular ones.

This makes our image classification benchmark twice as fast, as expected:

tomeu@arm-64:~/mesa$ ETNA_MESA_DEBUG=ml_msgs python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
Running the NN job took 13 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 15.650ms

60 FPS is already quite interesting for many use cases, but the proprietary driver is able to do the same at around 8 ms, so there is still plenty of room for improvements.

Some preliminary testing indicates that enabling zero-run length compression in the weight buffers will make the biggest difference, so that is what I will be working on when I get back to performance work.

Additionally, I also got some experimental jobs running on the programmable core in this NPU, which will allow us to run more advanced models, which tend to use operations that the hardware couldn't be designed for back then.

Upstreaming is going well, those interested can follow it here:

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714.

Next steps

These will be my priorities during the next couple of weeks, in order:

Upstreaming
Get the Mobilenet SSD V1 model running on the HW, for object detection
Performance

Monday, November 6, 2023

Etnaviv NPU update 10: Upstreaming and TP jobs update

If you remember the last update two weeks ago, I got MobileNetV1 working with good performance, and I was planning to move to upstreaming my changes to the Linux kernel and Mesa.

One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.

Regarding Mesa, I have made several cleanups and have started getting great review comments from Christian Gmeiner.

While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster than the naive code I was running on the CPU for this.

Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.

I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.

During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.

Monday, October 23, 2023

Etnaviv NPU update 9: We got there!

Progress

Since the last update I finally got the whole of MobileNetv1 running at full-accuracy on the NPU with Mesa:

tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Processing the input took 18 ms.
Running the NN job took 13 ms.
Processing the output took 1 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 33.094ms

That takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.

Most of the time (18 ms.) is spent in my naive manipulation of the input tensor, transposing and reshuffling it to match what the HW expects. Once we learn to do these operations on the 4 tensor manipulation cores, this time should be brought close to zero.

The 13 ms. that the convolutions take in the NPU is still sensibly higher than the 8 ms. that the blob achieves, but the optimizations mentioned in previous updates in this blog should bring us pretty close.

Next steps

Now that we have something that people can use in their products, I will switch to upstreaming mode.

I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found here.

I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.

Friday, October 6, 2023

Etnaviv NPU update 8: Finally some inference

Progress

Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.

Refactored the Gallium front-end

As commented in the previous update, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.

I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.

As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.

Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.

Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.

Got MobileNet V1 to run

As part of the refactoring from above, I got multiple operations in the same model to work, which got us to correctly running some inferences, even if at low accuracy rates:

by Julien Langlois CC BY-SA 3.0

tomeu@arm-64:~/mesa$ LD_PRELOAD=libtensorflow_lite.so python3.10 class_device.py -i hen.bmp -m mobilenet_v1_0.25_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so

Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
tflite_plugin_create_delegate
Teflon delegate: loaded etnaviv driver
INFO: Initialized TensorFlow Lite runtime.
PrepareDelegate
VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph.
0.960784: hen
0.015686: cock
0.007843: goose
0.003922: Pembroke
0.003922: Ibizan hound
time: 22.802ms
tflite_plugin_destroy_delegate

This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.

Presented this work at Embedded Recipes 2023

Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.

You can find the slides here, and listen to the talk at:

Next steps

The previous update got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:

Get input and output channels working at the 512 level, so we can run a higher accuracy version of the MobileNet V1 network
Learn to use the TP units to remove those costly transpositions and reshuffles in the CPU (at this point, we would have something useful to people on the field)
Upstream changes to the Linux kernel
Propose Teflon to the Mesa folks

Tuesday, September 26, 2023

Etnaviv NPU update 7: Summer is over

Progress

With the kids back in school I have been able to work on the Vivante VIP NPU driver full-time during the two weeks after the last update, with quite some work coming out of the pipeline:

Found the problem with enabling the 8th NN core

Though I don't know exactly yet what the problem is, I found that by going back to a previous brute-force approach to powering up the NPU, the 8th core works just fine.

For now this unblocks the work and gets me closer to the initial goal of running a MobileNetv1 inference and seeing what the performance is like, so I'm leaving a proper fix for this for later.

I bet there's either a register that is being written in the wrong order, or a delay between register writes that is too short. Will have to delve into the power domain subsystem and/or the common clock framework in the Linux kernel to fix this one.

Added support for depthwise convolutions

MobileNetV1 introduced Separable Depthwise Convolutions (see the linked paper for an in-depth description), which are layers that contain a depthwise convolution to process each depth level separately, plus a pointwise convolution to rejoin them again. This offers the same result with 23x less multiplications, so it's very attractive for mobile use-cases.

This hardware doesn't support depthwise convolutions directly, but we can lower them to regular convolutions after modifying the weight tensor to cover each IFM/depth separately.

Added support for pointwise convolutions

For the second half of a Separable Depthwise Convolution, I just had to take into account that 1x1 kernels are packed in a different format in memory, as otherwise it would be very inefficient for each NN core to pull each 1-byte kernel separately from the memory bus.

Added support for unsigned weights

TensorFlow Lite has moved towards implementing a new quantization specification which gives preference to signed weights because of convenience, as symmetric quantization is simpler to implement. Unfortunately for us, our hardware works natively with unsigned weights so we would need to convert them if we were to use TFLite's new quantization.

But the models that Google themselves publish make use of the ancient tooling that still support the old, unsigned quantization scheme, so I had to find a way of producing models with unsigned quantization for our test suite, to match what MobileNetV1 does.

That also implied moving to per-tensor quantization, instead of per-axis.

Added support for higher IFMs and OFMs (up to 256 each)

In the previous update I explained how support for multiple input and output channels (or feature maps) was added, but I wasn't able to test with more than 7 output channels because the 8th NN core was MIA.

With that solved, I was able to see what would be needed for convolutions with higher channel counts, such as those that MobileNetV1 use (32, 64, 128, 256, 512 and 1024).

Each level implied revisiting the tiled format in which weights and biases are laid out in memory, making it more and more complex.

I got to 256, with 512 and 1024 bringing more changes in the tiled format that I still need to reverse engineer.

Next steps

Model partition compilation and resource management

I'm facing problems with testing coverage as we support so many different parameters that need to be tested in combination, with a explosion in the number of individual tests. Because of the hacky current state of the TFLite delegate (and Gallium state tracker) I'm not able to run all the tests because I don't have proper resource management implemented and so we reach OOM before the end.

So my next task after I get back from Embedded Recipes will be to refactor the delegate implementation so we have a proper compilation of the model partitions. These will own the weight+bias buffers as well as the intermediate tensors, with each inference just feeding an input tensor to the partition and retrieving an output tensor at the end.

This will allow me to scale up the automated testing further, so I can keep adding new features with confidence, knowing that I'm not adding regressions.

Move development to Cottonwood A311D board

Da Xue of LibreComputer has got Etnaviv and Teflon working on the new boards that his company is releasing soon. One of them contain a A311D SoC, the same as the VIM3 I'm currently using for development. I will be initially targeting that one, and later make sure that it also works on the Cottonwood boards that will have the S905D3 SoC, which has a VIP Pico instead of a VIP Nano.

Besides being in general a great FOSS champion and specifically being supportive of ML inference with open source, Da is directly sponsoring this work, so I look forward to meet him in Paris this week and exchange notes.

Bigger coefficient tensors

The last known features missing before being able to run MobileNetV1 are IFMs and OFMs of 512 and 1024, each.

Hopefully it will only require some further tweaking of the tiled memory representation of the coefficient buffer.

Medium term goals

I don't expect performance to be that great yet, so I plan on switching the focus to it after the above has been accomplished. I expect for the features below making the most impact in improving performance:

Avoid copies in and out of the model partition, by mapping user buffers to the NPU
Use the TP units for tensor manipulation (transposing, mostly)
Properly configuring the automatic caching of kernels and images in the internal on-chip SRAM
Use the external SRAM for intermediate tensor data
Chain all TP and NN jobs in a model partition in the same command stream
Enable zero-run-length compression in the coefficient buffer
Tune the tiling parameters for reduced memory bandwidth usage

Thursday, September 7, 2023

Etnaviv NPU update 6: Almost there!

Progress

This week started quite fruitfully, these features were added:

Convolutions with multiple input and output channels (input and output feature maps)
"Same" padding in convolutions

And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.

One more roadblock

Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps.

Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.

So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.

Want to try it out?

I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.

I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:

https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst

You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.

Thursday, August 24, 2023

Etnaviv NPU update 5: Harder convolutions!

Progress

Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.

Since the last update I have:

implemented support for strided convolutions with more than one input channel, and
Implemented support for more than one output channel, but for now only for a single input channel.

Next steps are to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.

As a reminder, I'm pushing all the code to this branch: https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/.

IRC channel

A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.

For those of you that may not have a IRC bouncer setup yet, you can easily join with the web chat UI, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:

https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/

Embedded recipes

I have been invited to give a talk about all this ML with mainline effort at Embedded Recipes 2023, Paris 28-29 September. Slides and a recording will be published after the conference ends.

Sponsor

Last but not least, if I am able to invest so much effort on this is because the folks at LibreComputer have been supporting me financially this last couple of months.

Thanks to Da Xue for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.

Monday, August 7, 2023

Etnaviv NPU update 4: It's convoluting!

Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the last update.

TL;DR

The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.

3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.

The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.

There are three features left before I can run a real, full-fledged commercially interesting model:

3D inputs for strided convolutions
Multiple output channels
Padded convolutions

Re-quantization

The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.

After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:

That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.

But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.

3D input tensor

This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.

This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.

For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:

+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt
--- /home/tomeu/mesa.txt    2023-08-07 18:28:29.939750225 +0200
+++ /home/tomeu/galcore.txt    2023-08-07 18:28:42.116625362 +0200
@@ -1,176 +1,273 @@
{
-    0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */
-    0x00000011, /*   PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */
-    0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */
-    0x00000002, /*   GL.API_MODE := OPENCL */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /* */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /* */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /* */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /* */
     0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_START := 0x0 */
     0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_END := 0x0 */
     0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */
     0x00000010, /*   GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */
     0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */
-    0xffff3000, /*   PS.NN_INST_ADDR := *0xffff3000 */
+    0x3348e780, /*   PS.NN_INST_ADDR := *0x3348e780 */
     0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */
     0x00000000, /*   0x010A4 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /* */
}
map->layer_type = 0x0; /* (0) */
map->no_z_offset = 0x0; /* (0) */
map->kernel_xy_size = 0x2; /* (2) */
map->kernel_z_size = 0x4; /* (4) */
map->kernels_per_core = 0x1; /* (1) */
map->pooling = 0x0; /* (0) */
map->pooling_xy_size = 0x1; /* (1) */
map->prelu = 0x0; /* (0) */
map->nn_layer_flush = 0x1; /* (1) */
map->kernel_data_type = 0x0; /* (0) */
map->in_image_data_type = 0x0; /* (0) */
map->out_image_data_type = 0x0; /* (0) */
map->in_image_x_size = 0x4; /* (4) */
map->in_image_y_size = 0x4; /* (4) */
map->in_image_x_offset = 0x0; /* (0) */
map->in_image_y_offset = 0x0; /* (0) */
map->unused0 = 0x0; /* (0) */
map->brick_mode = 0x0; /* (0) */
map->brick_distance = 0x0; /* (0) */
map->relu = 0x0; /* (0) */
map->unused1 = 0x0; /* (0) */
map->post_multiplier = 0x0; /* (0) */
map->post_shift = 0x17; /* (23) */
map->unused2 = 0x0; /* (0) */
map->no_flush = 0x0; /* (0) */
map->unused3 = 0x0; /* (0) */
map->out_image_x_size = 0x3; /* (3) */
map->out_image_y_size = 0x3; /* (3) */
map->out_image_z_size = 0x1; /* (1) */
map->rounding_mode = 0x1; /* (1) */
map->in_image_x_offset_bit_3 = 0x0; /* (0) */
map->in_image_y_offset_bit_3 = 0x0; /* (0) */
map->out_image_tile_x_size = 0x3; /* (3) */
map->out_image_tile_y_size = 0x3; /* (3) */
-map->kernel_address = 0x3fffd00; /* (67108096) */
+map->kernel_address = 0xcd237f; /* (13443967) */
map->kernel_z_size2 = 0x0; /* (0) */
-map->in_image_address = 0xffff6000; /* (4294926336) */
-map->out_image_address = 0xffff7000; /* (4294930432) */
+map->in_image_address = 0x3348e240; /* (860414528) */
+map->out_image_address = 0x89ffc500; /* (2315240704) */
map->image_caching_mode = 0x0; /* (0) */
map->kernel_caching_mode = 0x1; /* (1) */
map->partial_cache_data_unit = 0x0; /* (0) */
map->kernel_pattern_msb = 0x0; /* (0) */
map->kernel_y_size = 0x2; /* (2) */
map->out_image_y_stride = 0x3; /* (3) */
map->kernel_pattern_low = 0x0; /* (0) */
map->kernel_pattern_high = 0x0; /* (0) */
map->kernel_cache_start_address = 0x800; /* (2048) */
map->kernel_cache_end_address = 0xa00; /* (2560) */
map->image_start_address = 0x0; /* (0) */
map->image_end_address = 0x800; /* (2048) */
map->in_image_border_mode = 0x0; /* (0) */
map->in_image_border_const = 0x7d; /* (125) */
map->unused4 = 0x0; /* (0) */
map->kernel_data_type_bit_2 = 0x0; /* (0) */
map->in_image_data_type_bit_2 = 0x0; /* (0) */
map->out_image_data_type_bit_2 = 0x0; /* (0) */
map->post_multiplier_1_to_6 = 0x1f; /* (31) */
map->post_shift_bit_5_6 = 0x0; /* (0) */
map->unused5 = 0x0; /* (0) */
map->in_image_x_stride = 0x4; /* (4) */
map->in_image_y_stride = 0x4; /* (4) */
map->out_image_x_stride = 0x3; /* (3) */
map->unused6 = 0x0; /* (0) */
map->post_multiplier_7_to_14 = 0x61; /* (97) */
map->out_image_circular_buf_size = 0x0; /* (0) */
map->unused7 = 0x0; /* (0) */
map->per_channel_post_mul = 0x0; /* (0) */
map->out_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused8 = 0x0; /* (0) */
map->in_image_circular_buf_size = 0x0; /* (0) */
map->unused9 = 0x0; /* (0) */
map->in_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused10 = 0x0; /* (0) */
map->coef_zero_point = 0x80; /* (128) */
map->out_zero_point = 0x77; /* (119) */
map->kernel_direct_stream_from_VIP_sram = 0x0; /* (0) */
map->depthwise = 0x0; /* (0) */
map->unused11 = 0x0; /* (0) */
map->unused12 = 0x0; /* (0) */
map->unused13 = 0x0; /* (0) */
map->unused14 = 0x0; /* (0) */
map->unused15 = 0x0; /* (0) */
map->unused16 = 0x0; /* (0) */
map->further1 = 0x0; /* (0) */
map->further2 = 0x0; /* (0) */
map->further3 = 0x3ffffff; /* (67108863) */
map->further4 = 0x7f800000; /* (2139095040) */
map->further5 = 0xff800000; /* (4286578688) */
map->further6 = 0x0; /* (0) */
map->further7 = 0x0; /* (0) */
map->further8 = 0x0; /* (0) */
   0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,
   0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,
   0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00
   0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,
   0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,
   0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,
   0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,
   0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,
   0xf7, 0x64, 0xc3, 0xca
   0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77

This corresponds to a convolution with the following parameters:

8x8x1 input tensor
3x3x1 weight tensor
stride == 2

The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result.

At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.

When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.

Strided convolutions

The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:

Take it in your stride: Do we need striding in CNNs?

By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.

For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency.

Test suite

With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.

I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.

At some point I will add a CI job, probably before sending the initial merge request.

Monday, June 26, 2023

Etnaviv NPU update 3: Deeper into the convolution units

What two weeks!

Programming of the convolution units

Taking from where I left at the last update, I made progress in understanding the format of the buffer that contains the weights and biases.

The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.

With that, I was able to correctly run a convolution on a small matrix with arbitrary weights and biases.

The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.

That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.

A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.

A new Gallium frontend: Teflon

If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:

https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/

TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.

Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).

But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.

Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.

How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.

Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.

And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!

Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.

Next steps

I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z > 1).

After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!

Saturday, June 10, 2023

Etnaviv NPU update 2: Diving into the convolution units

In the previous update I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.

Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.

By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of programming this HW with a trivial job (a 2x2x1 kernel with a single bias value).

Took some boring work to translate what the code does to a C struct, but this was the initial one:

struct etna_nn_params {
   uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */
   uint32_t no_z_offset : 1;
   uint32_t kernel_x_size : 4;
   uint32_t kernel_z_size : 14; /* & 0x3FFF */
   uint32_t kernels_per_core : 7;
   uint32_t zero1 : 2;
   uint32_t zero2 : 1;
   uint32_t zero3 : 1;
   uint32_t nn_layer_flush : 1;

   uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_x_size : 13;
   uint32_t in_image_y_size : 13;

   uint32_t zero4 : 3;
   uint32_t zero5 : 3;
   uint32_t unused0 : 1;
   uint32_t zero6 : 16;
   uint32_t zero7 : 1;
   uint32_t enable_relu : 1;
   uint32_t zero9 : 1;
   uint32_t post_shift : 6;

   uint32_t unused1 : 2;
   uint32_t zero10 : 1;
   uint32_t zero11 : 1;
   uint32_t unused2 : 2;
   uint32_t out_image_x_size : 13;
   uint32_t out_image_y_size : 13;

   uint32_t out_image_z_size : 14;
   uint32_t zero12 : 2; /* 0x0 */
   uint32_t zero13 : 1; /* (0 >> 3) & 0x1 */
   uint32_t zero14 : 1; /* (0 >> 3) & 0x1 */
   uint32_t unk0 : 7; /* 1 */
   uint32_t unk1 : 7; /* 1 */

   uint32_t kernel_address : 26; /* >> 6 */
   uint32_t kernel_z_size2 : 6; /* >> 14 */

   uint32_t in_image_address;

   uint32_t out_image_address;

   uint32_t unused3 : 12;
   uint32_t kernel_y_size : 4;
   uint32_t out_image_y_size2 : 16; /* maybe stride? */

   uint32_t zero15;

   uint32_t zero16;

   uint32_t zero17;

   uint32_t kernel_cache_end_address;

   uint32_t zero19;

   uint32_t image_end_address;

   uint32_t zero20 : 2;
   uint32_t zero21 : 16;
   uint32_t kernel_data_type_bit_2 : 1;
   uint32_t in_image_data_type_bit_2 : 1;
   uint32_t out_image_data_type_bit_2 : 1;
   uint32_t zero22 : 6;
   uint32_t post_shift_bit_5_6 : 2;
   uint32_t unused4 : 3;

   uint32_t in_image_stride : 16;
   uint32_t in_image_y_size2 : 16; /* again? */

   uint32_t out_image_stride : 16;
   uint32_t unused5 : 8;
   uint32_t zero23 : 8;

   uint32_t zero24 : 26; /* 0 >> 6 */
   uint32_t zero25 : 1;
   uint32_t zero26 : 1;
   uint32_t zero27 : 1; /* 0 >> 4 */
   uint32_t zero28 : 1; /* 0 >> 4 */
   uint32_t zero29 : 1;
   uint32_t kernel_data_type_bit_3 : 1;

   uint32_t unk2 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused6 : 4;
   uint32_t zero30 : 1;
   uint32_t in_image_data_type_bit_3 : 1;

   uint32_t zero31 : 26; /* 0 >> 6 */
   uint32_t out_image_data_type_bit_3 : 1;
   uint32_t unused7 : 6;

   uint32_t unk3 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused8 : 6;

   uint32_t coef_zero_point : 8;
   uint32_t out_zero_point : 8;
   uint32_t zero32 : 1;
   uint32_t zero33 : 1;
   uint32_t zero34 : 8;
   uint32_t unused9 : 6;

   uint32_t zero35;

   uint32_t zero36 : 4;
   uint32_t zero37 : 28; /* 0 >> 4 */

   uint32_t zero38 : 4;
   uint32_t zero39 : 28; /* 0 >> 4 */

   uint32_t further1;
   uint32_t further2;
   uint32_t further3;
   uint32_t further4;
   uint32_t further5;
   uint32_t further6;
   uint32_t further7;
   uint32_t further8;
};

As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.

I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.

By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.

Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.

To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.

What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.

Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!

Monday, May 29, 2023

Etnaviv NPU update 1: Planning for performance

As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.

In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).

When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.

After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).

With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.

Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.

Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.

The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:

Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...

[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]

...

Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
Op#30 SOFTMAX(T#85) -> [T#87]

As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end.

By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:

operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...

[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN]

...

operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH

From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.

So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...

Next steps

What I will be working on next:

Adapt the existing RE tooling to dump information regarding NN and TP workflows
Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.
Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.
Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.
Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".
Get things working for this specific network and measure performance.

If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.

The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).

The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.

That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.

Wednesday, April 26, 2023

A long overdue update

Cannot believe it has been years since my last update here!

There are two things that I would like to tell people about:

The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.

They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.

And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.

TL;DR

tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1

[snip]
INFO: invoked
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket

That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.

The story behind this work

Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.

This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.

Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.

Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.

Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.

We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.

The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:

[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU

Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:

&npu {
status = "okay";
};

Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.

Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.

First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.

At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.

Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.

And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.

With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.

Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.

Current performance situation (label_image):

OpenCL delegate with Etnaviv: 1261.99 ms
OpenCL delegate with Galcore: 787.733 ms
CPU: 149.19 ms
TIM-VX delegate: 2.567 ms (!)

The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.

Next steps

Italo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.

In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.

Thanks

There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:

The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.
The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.
And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.

Last but not least, there are some individuals to whom I was able to turn when I needed help:

Christian Gmeiner (austriancoder)
Lucas Stach (lynxeye)
Neil Armstrong (narmstrong)
Faith Ekstrand (gfxstrand)
Karol Herbst (karolherbst)

A big thanks, it has been a lot of fun!