Monday, May 29, 2023

Etnaviv NPU update 1: Planning for performance

As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.

In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).

When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.

After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).

With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.

Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.

Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.

The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:

  Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
  Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...

[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]

...

  Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
  Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
  Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
  Op#30 SOFTMAX(T#85) -> [T#87]

As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end. 

By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:

  operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...

[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN] 

...

  operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
  operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH

From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.

So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...

Next steps 

What I will be working on next:

  1. Adapt the existing RE tooling to dump information regarding NN and TP workflows
  2. Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.
  3. Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.
  4. Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.
  5. Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".
  6. Get things working for this specific network and measure performance.

If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.

The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).

The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.

That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.

Wednesday, April 26, 2023

A long overdue update

Cannot believe it has been years since my last update here!

There are two things that I would like to tell people about:

The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.

They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.

And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.

TL;DR

tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1

[snip]
INFO: invoked
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket

That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.

The story behind this work

Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.

This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.

Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.

Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.

Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.

We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.

The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:

[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU

Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:

&npu {
       status = "okay";
};

Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.

Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.

First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.

At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.

Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.

And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.

With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.

Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.

Current performance situation (label_image):

  • OpenCL delegate with Etnaviv: 1261.99 ms
  • OpenCL delegate with Galcore: 787.733 ms
  • CPU: 149.19 ms
  • TIM-VX delegate: 2.567 ms (!)

The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.

Next steps

Italo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.

In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.

Thanks

There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:

  • The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.
  • The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.
  • And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.

Last but not least, there are some individuals to whom I was able to turn when I needed help:

  • Christian Gmeiner (austriancoder)
  • Lucas Stach (lynxeye)
  • Neil Armstrong (narmstrong)
  • Faith Ekstrand (gfxstrand)
  • Karol Herbst (karolherbst)
A big thanks, it has been a lot of fun!

Monday, March 4, 2019

Panfrost update: a new kernel driver

The video

Below you can see the same scene that I recorded in January, which was rendered by Panfrost in Mesa but using Arm's kernel driver. This time, Panfrost is using a new kernel driver that is in a form close to be acceptable in the mainline kernel:

The history behind it

During the past two months Rob Herring and I have been working on a new driver for Midgard and Bifrost GPUs that could be accepted mainline.

Arm already maintains a driver out of tree with an acceptable open source license, but it doesn't implement the DRM ABI and several design considerations make it unsuitable for inclusion in mainline Linux.

The absence of a driver in mainline prevents users from keeping their kernels up-to-date and hurts integration with other parts of the free software stack. It also discourages SoC and BSP vendors from submitting their code to mainline, and hurts their ability to track mainline closely.

Besides the code of the driver itself, there's one more condition for mainline inclusion: an open source implementation of the userspace library needs to exist, so other kernel contributors can help verifying, debugging and maintaining the kernel driver. It's an enormous pile of difficult work to reverse engineer the inner workings of a GPU and then implement a compiler and command submission infrastructure, so big thanks to Alyssa Rosenzweig for leading that effort.

Upstream status

Most of the Panfrost code is already part of mainline Mesa, with the code that directly interacts with the new DRM driver being in the review stage. Currently targeted GPUs are T760 and T860, with the RK3399 being the SoC more often used for testing.

The kernel driver is being developed in the open and though we are trying to follow the best practices as displayed by other DRM drivers, there's a number of tasks that need to be done before we consider it ready for submission.

The work ahead

In the kernel:
- Make MMU code more complete for correctness and better performance
- Handle errors and hangs and correctly reset the GPU
- Improve fence handling
- Test with compute shaders (to check completeness of the ABI)
- Lots of cleanups and bug fixing!

In Mesa:
- Get GNOME Shell working
- Get Chromium working with accelerated WebGL
- Get all of glmark2 working
- Get a decent subset of dEQP passing and use it in CI
- Keep refactoring the code
- Support more hardware

Get the code

The exact bits used for the demo recorded above are in various stages of getting upstreamed to the various upstreams, but here are in branches for easier reproduction:


Monday, January 7, 2019

A Panfrost milestone

The video

Below you can see glmark2 running as a Wayland client in Weston, on a NanoPC -T4 (so a RK3399 SoC with a Mali T-864 GPU)). It's much smoother than on the video, which is limited to 5FPS by the webcam.


Weston is running with the DRM backend and the GL renderer.

The history behind it


For more than 10 years, at Collabora we have been happily helping our customers to make the most of their hardware by running free software.

One area some of us have specially enjoyed working on has been open drivers for GPUs, which for a long time have been considered the next frontier in the quest to have a full software platform that companies and individuals can understand, improve and fix without having to ask for permission first.

Something that has saddened me a bit has been our reduced ability to help those customers that for one reason or another had chosen a hardware platform with ARM Mali GPUs, as no open driver was available for those.

While our biggest customers were able to get a high level of support from the vendors in order to have the Mali graphics stack well integrated with the rest of their product, the smaller ones had a much harder time in achieving that level of integration, which manifested in reduced performance, increased power consumption and slipped milestones.

That's why we have been following with great interest the several efforts that aimed to come up with an open driver for GPUs in the Mali family, one similar to those already existing for Qualcomm, NVIDIA and Vivante.

At XDC last year we had the chance of meeting the people involved in the latest effort to develop such a driver: Panfrost. And in the months that followed I made some room in my backlog to come up with a plan to give the effort a boost.

At that point, Panfrost was only able to get its bits in the screen by an elaborate hack that involved copying each frame into a X11 SHM buffer, which besides making the setup of the development environment much more cumbersome, invalidated any performance analysis. It also limited testing to demos such as glmark2.

Due to my previous work on Etnaviv I was already familiar with the abstractions in Mesa for setups in which the display of buffers is performed by a device different from the GPU so it was just a matter of seeing how we could get the kernel driver for the Mali GPU to play well with the rest of the stack.

So during the past month or so I have come up with a proper implementation of the winsys abstraction that makes use of ARM's kernel driver. The result is that now developers have a better base on which to work on the rendering side of things.

By properly creating, exporting and importing buffers, we can now run applications on GBM, from demos such as kmscube and glmark2 to compositors such as Weston, but also big applications such as Kodi. We are also supporting zero-copy display of GPU-rendered clients in Weston.

This should make it much easier to work on the rendering side of things, and work on a proper DRM driver in the mainline kernel can proceed in parallel.

For those interested in joining to the effort, Alyssa has graciously taken the time to update the instructions to build and test Panfrost. You can join us at #panfrost in Freenode and can start sending merge requests to Gitlab.

Thanks to Collabora for sponsoring this work and to Alyssa Rosenzweig and Lyude Paul for their previous work and for answering my questions.

Monday, November 6, 2017

Experiments with crosvm

Last week I played a bit with crosvm, a KVM monitor used within Chromium OS for application isolation. My goal is to learn more about the current limits of virtualization for isolating applications in mainline. Two of crosvm's defining characteristics is that it's written in Rust for increased security, and that uses namespaces extensively to reduce the attack surface of the monitor itself.

It was quite easy to get it running outside Chromium OS (have been testing with Fedora 26), with the only complication being that minijail isn't widely packaged in distros. In the instructions below we hack around the issue with linker environment variables so we don't have to install it properly. Instructions are in form of shell commands for illustrative purposes only.

Build kernel:
$ cd ~/src
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git checkout v4.12
$ make x86_64_defconfig
$ make bzImage
$ cd ..
Build minijail:
$ git clone https://android.googlesource.com/platform/external/minijail
$ cd minijail
$ make
$ cd ..
Build crosvm:
$ git clone https://chromium.googlesource.com/a/chromiumos/platform/crosvm
$ cd crosvm
$ LIBRARY_PATH=~/src/minijail cargo build
Generate rootfs:
$ cd ~/src/crosvm
$ dd if=/dev/zero of=rootfs.ext4 bs=1K count=1M
$ mkfs.ext4 rootfs.ext4
$ mkdir rootfs/
$ sudo mount rootfs.ext4 rootfs/
$ debootstrap testing rootfs/
$ sudo umount rootfs/
Run crosvm:
$ LD_LIBRARY_PATH=~/src/minijail ./target/debug/crosvm run -r rootfs.ext4 --seccomp-policy-dir=./seccomp/x86_64/ ~/src/linux/arch/x86/boot/compressed/vmlinux.bin
The work ahead includes figuring out the best way for Wayland clients in the guest interact with the compositor in the host, and also for guests to make efficient use of the GPU.

Thursday, December 22, 2016

Slides on the Chamelium board

Yesterday I gave a short talk about the Chamelium board from the ChromeOS team, and thought that the slides could be useful for others as this board gets used more and more outside of Google.

https://people.collabora.com/~tomeu/Chamelium_Overview.odp


If you are interested in how this board can help you automate the testing of your display (and not only!) code and hardware, a new mailing list has been created to discuss its uses. We at Collabora will be happy to help you integrate this board in your CI lab as well.

Thanks go to Intel for sponsoring the preparation of these slides and for allowing me to share them under an open license.

And of course, thanks to Google's ChromeOS team for releasing the hardware design with an open hardware license along with the code they are running on it and with it.

Tuesday, November 8, 2016

How continuous integration can help you keep pace with the Linux kernel

Almost all of Collabora's customers use the Linux kernel on their products. Often they will use the exact code as delivered by the SBC vendors and we'll work with them in other parts of their software stack. But it's becoming increasingly common for our customers to adapt the kernel sources to the specific needs of their particular products.

A very big problem most of them have is that the kernel version they based on isn't getting security updates any more because it's already several years old. And the reason why companies are shipping kernels so old is that they have been so heavily modified compared to the upstream versions, that rebasing their trees on top of newer mainline releases is so expensive that is very hard to budget and plan for it.

To avoid that, we always recommend our customers to stay close to their upstreams, which implies rebasing often on top of new releases (typically LTS releases, with long term support). For the budgeting of that work to become possible, the size of the delta between mainline and downstream sources needs to be manageable, which is why we recommend contributing back any changes that aren't strictly specific to their products.

But even for those few companies that already have processes in place for upstreaming their changes and are rebasing regularly on top of new LTS releases, keeping up with mainline can be a substantial disruption of their production schedules. This is in part because new bugs will be in the new mainline release, and new bugs will be in the downstream changes as they get applied to the new version.

Those companies that are already keeping close to their upstreams typically have advanced QA infrastructure that will detect those bugs long before production, but a long stabilization phase after every rebase can significantly slow product development.

To improve this situation and encourage more companies to keep their efforts close to upstream we at Collabora have been working for a few years already in continuous integration of FOSS components across a diverse array of hardware. The initial work was sponsored by Bosch for one of their automotive projects, and since the start of 2016 Google has been sponsoring work on continuous integration of the mainline kernel.

One of the major efforts to continuously integrate the mainline Linux kernel codebase is kernelci.org, which builds several configurations of different trees and submits boot jobs to several labs around the world, collating the results. This is being of great help already in detecting at a very early stage any changes that either break the builds, or prevent a specific piece of hardware from completing the boot stage.

Though kernelci.org can easily detect when an update to a source code repository has introduced a bug, such updates can have several dozens of new commits, and without knowing which specific commit introduced the bug, we cannot identify culprits to notify of the problem. This means that either someone needs to monitor the dashboard for problems, or email notifications are sent to the owners of the repositories who then have to manually look for suspicious commits before getting in contact with their author.

To address this limitation, Google has asked us to look into improving the existing code for automatic bisection so it can be used right away when a regression is detected, so the possible culprits are notified right away without any manual intervention.

Another area in which kernelci.org is currently lacking is in the coverage of the testing. Build and boot regressions are very annoying for developers because they impact negatively everybody who work in the affected configurations and hardware, but the consequences of regressions in peripheral support or other subsystems that aren't involved critically during boot can still make rebases much costlier.

At Collabora we have had a strong interest in having the DRM subsystem under continuous integration and some time ago started a R&D project for making the test suite in IGT generically useful for all the DRM drivers. IGT started out being i915-specific, but as most of the tests exercise the generic DRM ABI, they could as well test other drivers with a moderate amount of effort. Early in 2016 Google started sponsoring this work and as of today submitters of new drivers are using it to validate their code.

Another related effort has been the addition to DRM of a generic ABI for retrieving CRCs of frames from different components in the graphics pipeline, so two frames can be compared when we know that they should match. And another one is adding support to IGT for the Chamelium board, which can simulate several display connections and hotplug events.

A side-effect of having continuous integration of changes in mainline is that when downstreams are sending back changes to reduce their delta, the risk of introducing regressions is much smaller and their contributions can be accepted faster and with less effort.

We believe that improved QA of FOSS components will expand the base of companies that can benefit from involvement in development upstream and are very excited by the changes that this will bring to the industry. If you are an engineer who cares about QA and FOSS, and would like to work with us on projects such as kernelci.org, LAVA, IGT and Chamelium, get in touch!