Monday, June 26, 2023

Etnaviv NPU update 3: Deeper into the convolution units

What two weeks!

Programming of the convolution units

Taking from where I left at the last update, I made progress in understanding the format of the buffer that contains the weights and biases.

The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.

With that, I  was able to correctly run a convolution on a small matrix with arbitrary weights and biases.

The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.

That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.

A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.

A new Gallium frontend: Teflon

If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:

https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/

TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.

Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).

But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.

Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.

How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.

Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.

And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!

Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.

Next steps

I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z > 1).

After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!

Saturday, June 10, 2023

Etnaviv NPU update 2: Diving into the convolution units

In the previous update I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.

Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.

By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of programming this HW with a trivial job (a 2x2x1 kernel with a single bias value).

Took some boring work to translate what the code does to a C struct, but this was the initial one:

struct etna_nn_params {
   uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */
   uint32_t no_z_offset : 1;
   uint32_t kernel_x_size : 4;
   uint32_t kernel_z_size : 14; /* & 0x3FFF */
   uint32_t kernels_per_core : 7;
   uint32_t zero1 : 2;
   uint32_t zero2 : 1;
   uint32_t zero3 : 1;
   uint32_t nn_layer_flush : 1;

   uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
   uint32_t in_image_x_size : 13;
   uint32_t in_image_y_size : 13;

   uint32_t zero4 : 3;
   uint32_t zero5 : 3;
   uint32_t unused0 : 1;
   uint32_t zero6 : 16;
   uint32_t zero7 : 1;
   uint32_t enable_relu : 1;
   uint32_t zero9 : 1;
   uint32_t post_shift : 6;

   uint32_t unused1 : 2;
   uint32_t zero10 : 1;
   uint32_t zero11 : 1;
   uint32_t unused2 : 2;
   uint32_t out_image_x_size : 13;
   uint32_t out_image_y_size : 13;

   uint32_t out_image_z_size : 14;
   uint32_t zero12 : 2; /* 0x0 */
   uint32_t zero13 : 1; /* (0 >> 3) & 0x1 */
   uint32_t zero14 : 1; /* (0 >> 3) & 0x1 */
   uint32_t unk0 : 7;  /* 1 */
   uint32_t unk1 : 7;  /* 1 */

   uint32_t kernel_address : 26; /* >> 6 */
   uint32_t kernel_z_size2 : 6; /* >> 14 */

   uint32_t in_image_address;

   uint32_t out_image_address;

   uint32_t unused3 : 12;
   uint32_t kernel_y_size : 4;
   uint32_t out_image_y_size2 : 16;  /* maybe stride? */

   uint32_t zero15;

   uint32_t zero16;

   uint32_t zero17;

   uint32_t kernel_cache_end_address;

   uint32_t zero19;

   uint32_t image_end_address;

   uint32_t zero20 : 2;
   uint32_t zero21 : 16;
   uint32_t kernel_data_type_bit_2 : 1;
   uint32_t in_image_data_type_bit_2 : 1;
   uint32_t out_image_data_type_bit_2 : 1;
   uint32_t zero22 : 6;
   uint32_t post_shift_bit_5_6 : 2;
   uint32_t unused4 : 3;

   uint32_t in_image_stride : 16;
   uint32_t in_image_y_size2 : 16; /* again? */

   uint32_t out_image_stride : 16;
   uint32_t unused5 : 8;
   uint32_t zero23 : 8;

   uint32_t zero24 : 26; /* 0 >> 6 */
   uint32_t zero25 : 1;
   uint32_t zero26 : 1;
   uint32_t zero27 : 1; /* 0 >> 4 */
   uint32_t zero28 : 1; /* 0 >> 4 */
   uint32_t zero29 : 1;
   uint32_t kernel_data_type_bit_3 : 1;

   uint32_t unk2 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused6 : 4;
   uint32_t zero30 : 1;
   uint32_t in_image_data_type_bit_3 : 1;

   uint32_t zero31 : 26; /* 0 >> 6 */
   uint32_t out_image_data_type_bit_3 : 1;
   uint32_t unused7 : 6;

   uint32_t unk3 : 26; /* 0xFFFFFFFF >> 6 */
   uint32_t unused8 : 6;

   uint32_t coef_zero_point : 8;
   uint32_t out_zero_point : 8;
   uint32_t zero32 : 1;
   uint32_t zero33 : 1;
   uint32_t zero34 : 8;
   uint32_t unused9 : 6;

   uint32_t zero35;

   uint32_t zero36 : 4;
   uint32_t zero37 : 28;  /* 0 >> 4 */

   uint32_t zero38 : 4;
   uint32_t zero39 : 28;  /* 0 >> 4 */

   uint32_t further1;
   uint32_t further2;
   uint32_t further3;
   uint32_t further4;
   uint32_t further5;
   uint32_t further6;
   uint32_t further7;
   uint32_t further8;
};

As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.

I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.

By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.

Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.

To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.

What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.

Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!

Monday, May 29, 2023

Etnaviv NPU update 1: Planning for performance

As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.

In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).

When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.

After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).

With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.

Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.

Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.

The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:

  Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
  Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...

[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]

...

  Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
  Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
  Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
  Op#30 SOFTMAX(T#85) -> [T#87]

As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end. 

By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:

  operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...

[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN] 

...

  operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
  operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
  operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH

From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.

So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...

Next steps 

What I will be working on next:

  1. Adapt the existing RE tooling to dump information regarding NN and TP workflows
  2. Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.
  3. Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.
  4. Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.
  5. Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".
  6. Get things working for this specific network and measure performance.

If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.

The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).

The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.

That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.

Wednesday, April 26, 2023

A long overdue update

Cannot believe it has been years since my last update here!

There are two things that I would like to tell people about:

The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.

They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.

And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.

TL;DR

tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1

[snip]
INFO: invoked
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket

That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.

The story behind this work

Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.

This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.

Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.

Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.

Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.

We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.

The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:

[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU

Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:

&npu {
       status = "okay";
};

Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.

Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.

First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.

At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.

Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.

And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.

With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.

Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.

Current performance situation (label_image):

  • OpenCL delegate with Etnaviv: 1261.99 ms
  • OpenCL delegate with Galcore: 787.733 ms
  • CPU: 149.19 ms
  • TIM-VX delegate: 2.567 ms (!)

The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.

Next steps

Italo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.

In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.

Thanks

There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:

  • The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.
  • The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.
  • And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.

Last but not least, there are some individuals to whom I was able to turn when I needed help:

  • Christian Gmeiner (austriancoder)
  • Lucas Stach (lynxeye)
  • Neil Armstrong (narmstrong)
  • Faith Ekstrand (gfxstrand)
  • Karol Herbst (karolherbst)
A big thanks, it has been a lot of fun!

Monday, March 4, 2019

Panfrost update: a new kernel driver

The video

Below you can see the same scene that I recorded in January, which was rendered by Panfrost in Mesa but using Arm's kernel driver. This time, Panfrost is using a new kernel driver that is in a form close to be acceptable in the mainline kernel:

The history behind it

During the past two months Rob Herring and I have been working on a new driver for Midgard and Bifrost GPUs that could be accepted mainline.

Arm already maintains a driver out of tree with an acceptable open source license, but it doesn't implement the DRM ABI and several design considerations make it unsuitable for inclusion in mainline Linux.

The absence of a driver in mainline prevents users from keeping their kernels up-to-date and hurts integration with other parts of the free software stack. It also discourages SoC and BSP vendors from submitting their code to mainline, and hurts their ability to track mainline closely.

Besides the code of the driver itself, there's one more condition for mainline inclusion: an open source implementation of the userspace library needs to exist, so other kernel contributors can help verifying, debugging and maintaining the kernel driver. It's an enormous pile of difficult work to reverse engineer the inner workings of a GPU and then implement a compiler and command submission infrastructure, so big thanks to Alyssa Rosenzweig for leading that effort.

Upstream status

Most of the Panfrost code is already part of mainline Mesa, with the code that directly interacts with the new DRM driver being in the review stage. Currently targeted GPUs are T760 and T860, with the RK3399 being the SoC more often used for testing.

The kernel driver is being developed in the open and though we are trying to follow the best practices as displayed by other DRM drivers, there's a number of tasks that need to be done before we consider it ready for submission.

The work ahead

In the kernel:
- Make MMU code more complete for correctness and better performance
- Handle errors and hangs and correctly reset the GPU
- Improve fence handling
- Test with compute shaders (to check completeness of the ABI)
- Lots of cleanups and bug fixing!

In Mesa:
- Get GNOME Shell working
- Get Chromium working with accelerated WebGL
- Get all of glmark2 working
- Get a decent subset of dEQP passing and use it in CI
- Keep refactoring the code
- Support more hardware

Get the code

The exact bits used for the demo recorded above are in various stages of getting upstreamed to the various upstreams, but here are in branches for easier reproduction:


Monday, January 7, 2019

A Panfrost milestone

The video

Below you can see glmark2 running as a Wayland client in Weston, on a NanoPC -T4 (so a RK3399 SoC with a Mali T-864 GPU)). It's much smoother than on the video, which is limited to 5FPS by the webcam.


Weston is running with the DRM backend and the GL renderer.

The history behind it


For more than 10 years, at Collabora we have been happily helping our customers to make the most of their hardware by running free software.

One area some of us have specially enjoyed working on has been open drivers for GPUs, which for a long time have been considered the next frontier in the quest to have a full software platform that companies and individuals can understand, improve and fix without having to ask for permission first.

Something that has saddened me a bit has been our reduced ability to help those customers that for one reason or another had chosen a hardware platform with ARM Mali GPUs, as no open driver was available for those.

While our biggest customers were able to get a high level of support from the vendors in order to have the Mali graphics stack well integrated with the rest of their product, the smaller ones had a much harder time in achieving that level of integration, which manifested in reduced performance, increased power consumption and slipped milestones.

That's why we have been following with great interest the several efforts that aimed to come up with an open driver for GPUs in the Mali family, one similar to those already existing for Qualcomm, NVIDIA and Vivante.

At XDC last year we had the chance of meeting the people involved in the latest effort to develop such a driver: Panfrost. And in the months that followed I made some room in my backlog to come up with a plan to give the effort a boost.

At that point, Panfrost was only able to get its bits in the screen by an elaborate hack that involved copying each frame into a X11 SHM buffer, which besides making the setup of the development environment much more cumbersome, invalidated any performance analysis. It also limited testing to demos such as glmark2.

Due to my previous work on Etnaviv I was already familiar with the abstractions in Mesa for setups in which the display of buffers is performed by a device different from the GPU so it was just a matter of seeing how we could get the kernel driver for the Mali GPU to play well with the rest of the stack.

So during the past month or so I have come up with a proper implementation of the winsys abstraction that makes use of ARM's kernel driver. The result is that now developers have a better base on which to work on the rendering side of things.

By properly creating, exporting and importing buffers, we can now run applications on GBM, from demos such as kmscube and glmark2 to compositors such as Weston, but also big applications such as Kodi. We are also supporting zero-copy display of GPU-rendered clients in Weston.

This should make it much easier to work on the rendering side of things, and work on a proper DRM driver in the mainline kernel can proceed in parallel.

For those interested in joining to the effort, Alyssa has graciously taken the time to update the instructions to build and test Panfrost. You can join us at #panfrost in Freenode and can start sending merge requests to Gitlab.

Thanks to Collabora for sponsoring this work and to Alyssa Rosenzweig and Lyude Paul for their previous work and for answering my questions.

Monday, November 6, 2017

Experiments with crosvm

Last week I played a bit with crosvm, a KVM monitor used within Chromium OS for application isolation. My goal is to learn more about the current limits of virtualization for isolating applications in mainline. Two of crosvm's defining characteristics is that it's written in Rust for increased security, and that uses namespaces extensively to reduce the attack surface of the monitor itself.

It was quite easy to get it running outside Chromium OS (have been testing with Fedora 26), with the only complication being that minijail isn't widely packaged in distros. In the instructions below we hack around the issue with linker environment variables so we don't have to install it properly. Instructions are in form of shell commands for illustrative purposes only.

Build kernel:
$ cd ~/src
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git checkout v4.12
$ make x86_64_defconfig
$ make bzImage
$ cd ..
Build minijail:
$ git clone https://android.googlesource.com/platform/external/minijail
$ cd minijail
$ make
$ cd ..
Build crosvm:
$ git clone https://chromium.googlesource.com/a/chromiumos/platform/crosvm
$ cd crosvm
$ LIBRARY_PATH=~/src/minijail cargo build
Generate rootfs:
$ cd ~/src/crosvm
$ dd if=/dev/zero of=rootfs.ext4 bs=1K count=1M
$ mkfs.ext4 rootfs.ext4
$ mkdir rootfs/
$ sudo mount rootfs.ext4 rootfs/
$ debootstrap testing rootfs/
$ sudo umount rootfs/
Run crosvm:
$ LD_LIBRARY_PATH=~/src/minijail ./target/debug/crosvm run -r rootfs.ext4 --seccomp-policy-dir=./seccomp/x86_64/ ~/src/linux/arch/x86/boot/compressed/vmlinux.bin
The work ahead includes figuring out the best way for Wayland clients in the guest interact with the compositor in the host, and also for guests to make efficient use of the GPU.