Monday, January 7, 2019

A Panfrost milestone

The video

Below you can see glmark2 running as a Wayland client in Weston, on a NanoPC -T4 (so a RK3399 SoC with a Mali T-864 GPU)). It's much smoother than on the video, which is limited to 5FPS by the webcam.


Weston is running with the DRM backend and the GL renderer.

The history behind it


For more than 10 years, at Collabora we have been happily helping our customers to make the most of their hardware by running free software.

One area some of us have specially enjoyed working on has been open drivers for GPUs, which for a long time have been considered the next frontier in the quest to have a full software platform that companies and individuals can understand, improve and fix without having to ask for permission first.

Something that has saddened me a bit has been our reduced ability to help those customers that for one reason or another had chosen a hardware platform with ARM Mali GPUs, as no open driver was available for those.

While our biggest customers were able to get a high level of support from the vendors in order to have the Mali graphics stack well integrated with the rest of their product, the smaller ones had a much harder time in achieving that level of integration, which manifested in reduced performance, increased power consumption and slipped milestones.

That's why we have been following with great interest the several efforts that aimed to come up with an open driver for GPUs in the Mali family, one similar to those already existing for Qualcomm, NVIDIA and Vivante.

At XDC last year we had the chance of meeting the people involved in the latest effort to develop such a driver: Panfrost. And in the months that followed I made some room in my backlog to come up with a plan to give the effort a boost.

At that point, Panfrost was only able to get its bits in the screen by an elaborate hack that involved copying each frame into a X11 SHM buffer, which besides making the setup of the development environment much more cumbersome, invalidated any performance analysis. It also limited testing to demos such as glmark2.

Due to my previous work on Etnaviv I was already familiar with the abstractions in Mesa for setups in which the display of buffers is performed by a device different from the GPU so it was just a matter of seeing how we could get the kernel driver for the Mali GPU to play well with the rest of the stack.

So during the past month or so I have come up with a proper implementation of the winsys abstraction that makes use of ARM's kernel driver. The result is that now developers have a better base on which to work on the rendering side of things.

By properly creating, exporting and importing buffers, we can now run applications on GBM, from demos such as kmscube and glmark2 to compositors such as Weston, but also big applications such as Kodi. We are also supporting zero-copy display of GPU-rendered clients in Weston.

This should make it much easier to work on the rendering side of things, and work on a proper DRM driver in the mainline kernel can proceed in parallel.

For those interested in joining to the effort, Alyssa has graciously taken the time to update the instructions to build and test Panfrost. You can join us at #panfrost in Freenode and can start sending merge requests to Gitlab.

Thanks to Collabora for sponsoring this work and to Alyssa Rosenzweig and Lyude Paul for their previous work and for answering my questions.

Monday, November 6, 2017

Experiments with crosvm

Last week I played a bit with crosvm, a KVM monitor used within Chromium OS for application isolation. My goal is to learn more about the current limits of virtualization for isolating applications in mainline. Two of crosvm's defining characteristics is that it's written in Rust for increased security, and that uses namespaces extensively to reduce the attack surface of the monitor itself.

It was quite easy to get it running outside Chromium OS (have been testing with Fedora 26), with the only complication being that minijail isn't widely packaged in distros. In the instructions below we hack around the issue with linker environment variables so we don't have to install it properly. Instructions are in form of shell commands for illustrative purposes only.

Build kernel:
$ cd ~/src
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git checkout v4.12
$ make x86_64_defconfig
$ make bzImage
$ cd ..
Build minijail:
$ git clone https://android.googlesource.com/platform/external/minijail
$ cd minijail
$ make
$ cd ..
Build crosvm:
$ git clone https://chromium.googlesource.com/a/chromiumos/platform/crosvm
$ cd crosvm
$ LIBRARY_PATH=~/src/minijail cargo build
Generate rootfs:
$ cd ~/src/crosvm
$ dd if=/dev/zero of=rootfs.ext4 bs=1K count=1M
$ mkfs.ext4 rootfs.ext4
$ mkdir rootfs/
$ sudo mount rootfs.ext4 rootfs/
$ debootstrap testing rootfs/
$ sudo umount rootfs/
Run crosvm:
$ LD_LIBRARY_PATH=~/src/minijail ./target/debug/crosvm run -r rootfs.ext4 --seccomp-policy-dir=./seccomp/x86_64/ ~/src/linux/arch/x86/boot/compressed/vmlinux.bin
The work ahead includes figuring out the best way for Wayland clients in the guest interact with the compositor in the host, and also for guests to make efficient use of the GPU.

Thursday, December 22, 2016

Slides on the Chamelium board

Yesterday I gave a short talk about the Chamelium board from the ChromeOS team, and thought that the slides could be useful for others as this board gets used more and more outside of Google.

https://people.collabora.com/~tomeu/Chamelium_Overview.odp


If you are interested in how this board can help you automate the testing of your display (and not only!) code and hardware, a new mailing list has been created to discuss its uses. We at Collabora will be happy to help you integrate this board in your CI lab as well.

Thanks go to Intel for sponsoring the preparation of these slides and for allowing me to share them under an open license.

And of course, thanks to Google's ChromeOS team for releasing the hardware design with an open hardware license along with the code they are running on it and with it.

Tuesday, November 8, 2016

How continuous integration can help you keep pace with the Linux kernel

Almost all of Collabora's customers use the Linux kernel on their products. Often they will use the exact code as delivered by the SBC vendors and we'll work with them in other parts of their software stack. But it's becoming increasingly common for our customers to adapt the kernel sources to the specific needs of their particular products.

A very big problem most of them have is that the kernel version they based on isn't getting security updates any more because it's already several years old. And the reason why companies are shipping kernels so old is that they have been so heavily modified compared to the upstream versions, that rebasing their trees on top of newer mainline releases is so expensive that is very hard to budget and plan for it.

To avoid that, we always recommend our customers to stay close to their upstreams, which implies rebasing often on top of new releases (typically LTS releases, with long term support). For the budgeting of that work to become possible, the size of the delta between mainline and downstream sources needs to be manageable, which is why we recommend contributing back any changes that aren't strictly specific to their products.

But even for those few companies that already have processes in place for upstreaming their changes and are rebasing regularly on top of new LTS releases, keeping up with mainline can be a substantial disruption of their production schedules. This is in part because new bugs will be in the new mainline release, and new bugs will be in the downstream changes as they get applied to the new version.

Those companies that are already keeping close to their upstreams typically have advanced QA infrastructure that will detect those bugs long before production, but a long stabilization phase after every rebase can significantly slow product development.

To improve this situation and encourage more companies to keep their efforts close to upstream we at Collabora have been working for a few years already in continuous integration of FOSS components across a diverse array of hardware. The initial work was sponsored by Bosch for one of their automotive projects, and since the start of 2016 Google has been sponsoring work on continuous integration of the mainline kernel.

One of the major efforts to continuously integrate the mainline Linux kernel codebase is kernelci.org, which builds several configurations of different trees and submits boot jobs to several labs around the world, collating the results. This is being of great help already in detecting at a very early stage any changes that either break the builds, or prevent a specific piece of hardware from completing the boot stage.

Though kernelci.org can easily detect when an update to a source code repository has introduced a bug, such updates can have several dozens of new commits, and without knowing which specific commit introduced the bug, we cannot identify culprits to notify of the problem. This means that either someone needs to monitor the dashboard for problems, or email notifications are sent to the owners of the repositories who then have to manually look for suspicious commits before getting in contact with their author.

To address this limitation, Google has asked us to look into improving the existing code for automatic bisection so it can be used right away when a regression is detected, so the possible culprits are notified right away without any manual intervention.

Another area in which kernelci.org is currently lacking is in the coverage of the testing. Build and boot regressions are very annoying for developers because they impact negatively everybody who work in the affected configurations and hardware, but the consequences of regressions in peripheral support or other subsystems that aren't involved critically during boot can still make rebases much costlier.

At Collabora we have had a strong interest in having the DRM subsystem under continuous integration and some time ago started a R&D project for making the test suite in IGT generically useful for all the DRM drivers. IGT started out being i915-specific, but as most of the tests exercise the generic DRM ABI, they could as well test other drivers with a moderate amount of effort. Early in 2016 Google started sponsoring this work and as of today submitters of new drivers are using it to validate their code.

Another related effort has been the addition to DRM of a generic ABI for retrieving CRCs of frames from different components in the graphics pipeline, so two frames can be compared when we know that they should match. And another one is adding support to IGT for the Chamelium board, which can simulate several display connections and hotplug events.

A side-effect of having continuous integration of changes in mainline is that when downstreams are sending back changes to reduce their delta, the risk of introducing regressions is much smaller and their contributions can be accepted faster and with less effort.

We believe that improved QA of FOSS components will expand the base of companies that can benefit from involvement in development upstream and are very excited by the changes that this will bring to the industry. If you are an engineer who cares about QA and FOSS, and would like to work with us on projects such as kernelci.org, LAVA, IGT and Chamelium, get in touch!

Thursday, April 21, 2016

Validating changes to KMS drivers with IGT

New DRM drivers are being added to almost each new kernel release, and because the mode setting API is so rich and complex, bugs do slip in that translate to differences in behaviour between drivers.

There have been previous attempts at writing test suites for validating changes and preventing regressions, but they have typically happened downstream and focused on the specific needs of specific products and limited to one or at most a few of different hardware platforms.

Writing these tests from scratch would have been an enormous amount of work, and gathering previous efforts and joining them wouldn't be much worth it because they were written using different test frameworks and in different programming languages. Also, there would be great overlap on the basic tests, and little would remain of the trickier stuff.

Of the existing test suites, the one with most coverage is intel-gpu-tools, used by the Intel graphics team. Though a big part is specific to the i915 driver, what uses the generic APIs is pretty much driver-independent and can be made to work with the other drivers without much effort. Also, Broadcom's Eric Anholt has already started adding tests for IOCTLs specific to the VideoCore-IV driver.

Collabora's Micah Fedke and Daniel Stone had added a facility for selecting DRM device files other than i915's and I improved the abstraction for creating buffers so it works for drivers without GEM buffers. Next I removed a bunch of superfluous dependencies on i915-only stuff and got a useful subset of tests to run on a Radxa Rock2 board (with the Rockchip 3288 SoC). Around half of these patches have been merged already and the other half are awaiting review. Meanwhile, Collabora's Robert Foss is running the ported tests on a Raspberry Pi 2 and has started sending patches to account for its peculiarities.

The next two big chunks of work are abstracting CRC checksums of frames (on drivers other than i915 this could be done with Google's Chamelium or with a board similar to Numato Opsis), and the buffer management API from libdrm that is currently i915-only (bufmgr). Something that will have to be dealt with in the future is abstracting the submittal of specific loads on the GPU as that's currently very much driver-specific.

Additionally, I will be scheduling jobs in our LAVA instance to run these tests on the boards we have in there.

Thanks to Google for sponsoring my time, to the Intel OTC folks for their support and reviews, and to Collabora for sponsoring Robert's, Micah's and Daniel's time.

Friday, May 1, 2015

Lucid sleep in the free desktop

For the past year I have been working on the kernel side to bring some ChromeOS features to upstream.

One of the areas I'm currently working on is what Google calls Lucid Sleep, which is basically the ability of performing work while the machine is in a low power state such as suspend. I'm writing this blog post because there has been interest on this in different communities and the discussion is currently a bit dispersed.

Small mobile devices have been able to do that since basically always and this feature brings it to bigger devices that traditionally have been either on or off. It's similar to what Microsoft calls InstantGo (previously Connected Standby).

A few examples of tasks that the system could perform while apparently sleeping are:
  • Checking if the battery level is so low that it would be better to completely power down the machine
  • Starting a network backup if the present connectivity allows it (a known access point may have become accessible)
  • Downloading email
  • Checking for new instant messages

With regards to functionality and leaving performance considerations aside, userspace could implement this without requiring any new support in the kernel as illustrated in this scenario:
  • We assume that a video is currently playing in YouTube
  • User closes the lid
  • PM daemon notifies userspace of an impending sleep
  • Browser pauses playback
  • Compositor switches off the screen
  • Kernel freezes userspace, suspends devices and puts the CPUs to idle
  • Time passes...
  • RTC alarm fires off
  • Kernel resumes devices and unfreezes userspace
  • Userspace realizes there hasn't been any user activity since it went to sleep last, so stays in "dark resume" mode
  • Userspace does any lucid tasks it wants, then goes back to sleep again
  • Kernel freezes userspace, suspends devices and puts the CPUs to idle
  • Time passes...
  • User opens lid
  • Kernel resumes devices and unfreezes userspace
  • PM daemon notices the SW_LID event, so notifies userspace that this is a full-on resume
  • Compositor switches screen on
  • Browser resumes playback

No changes needed in the kernel is always good news, but there's two issues.

Lost input events


Sometimes the event from the input device that woke the system up gets lost before it reaches userspace, so we don't know if we can stay dark and do our lucid stuff, or if the user expects the machine to power completely on.

This is in any case a bug, but if it needs to be fixed in the firmware, we may not be able to do much about it. At most we could get the kernel to synthesize an input event, but sometimes it may not have enough information to do so.


Performance


When the system wakes up, there tends to be a lot to do in the kernel and userspace, so it could take several seconds for the screen to come up from the moment the user opened the lid in the scenario presented above.

For ChromeOS this isn't acceptable so they are carrying some patches in their kernel that make some shortcuts possible (the screen is left on at suspend time, and the kernel knows at resume time whether it has to power it on based on which was the wakeup source, thus not having to wait for userspace).

Fortunately, there have been some changes recently in the kernel PM subsystem that can speed up resumes quite a bit and we can make use of them to offset the penalty of dropping those shortcuts.

The first is idling the CPUs instead of suspending to firmware, which on modern SoCs should be quite efficient and much faster, by a few tenths of seconds.

The other is to leave idle devices that are already in a low power state alone when suspending, which means that we don't have to wait for them to resume when the system wakes up. In every system I have seen there's always a few devices that take a long time to resume, so this can shave several tenths of seconds from the total resume time.

Both need some amount of support in either the platform or in device drivers, and that's what I'm currently working on for the Tegra-based Chromebooks.

Wednesday, August 13, 2014

Dynamic scaling of the memory bus


The problem


These days there's quite good support for CPU scaling in the mainline kernel, and many ARM SoCs are making use of it already. But in modern hardware with lots of very fast external memory, running the memory bus at its maximum frequency drastically reduces the amount of time that the device can run when on battery.

A problem that many teams are finding when trying to upstream their power management code is that there's currently no way for several clock consumers to influence the frequency of the memory bus. There has been a few tries to upstream the solutions currently in vendor trees, but so far no acceptable solution has been found.

I'm helping to upstream some of the stuff in the ChromeOS tree, and this issue is currently blocking very interesting work from reaching mainline.

The past


In the vendor tree for Tegra this is addressed by creating virtual clocks that are child of the clock that wants to be influenced. Depending on the type of the virtual clock, setting its rate will influence the rate of its parent clock by setting a floor or ceiling value.

In Qualcomm's vendor tree for the Snapdragon family of SoCs, the concept of a voter clock is introduced. Drivers can vote on the rate of a given clock by "voting" through a child clock, so not that different to how Tegra does it.

Both approaches have the critical disadvantage of adding clk instances for things that aren't real clocks, thus making the API considerably more confusing for relatively little gain.

Both vendor trees have additional API for registering bandwidth needs: tegra_isomgr and msm_bus_scale. They bear quite some resemblance with each other and with pm_qos_interface, but both are tightly tied to specificities of their platforms.

The discussion was brought back to life a couple of months ago when a patch was posted for allowing the tegra-drm driver to set the frequency rate of the external memory controller based on the amount of bandwidth that was needed by the display controller for refreshing the display. Of course, that patch was rejected because there are other components that need to have a say in the frequency rate of the memory bus.

But in that discussion some kind of plan took form and I have been working on making something from it that can be merged upstream.

A possible future


There's so far two main additions to existing frameworks, with the rationale being explained further below:
  • Add per-user floor and ceiling constraints to the Common Clock Framework, so drivers can set maximum and minimum frequency rates that the clock should respect. Patchset here.
  • Add a PM_QOS_MEMORY_BANDWIDTH class to pm_qos, for drivers to register their expected bandwidth needs. Patchset here.
The idea is for the following agents to be able to influence the current frequency of the memory bus:
  • Thermal: a cooling device would call clk_set_ceiling_rate to cap the memory bus to a frequency based on the current temperature.
  • Power: a battery driver would set a ceiling in the same way, based on the remaining capacity.
  • Devfreq: a devfreq driver wrapping a power management unit such as the ACTMON on Tegra or the PPMU on Exynos would set a floor frequency based on the current load stats.
  • Cpufreq: a cpufreq driver would set a floor frequency based on the current CPU frequency.
  • Devices that can anticipate how much memory bandwidth will need (such as the display controller, the camera, multimedia codecs, an ISP, USB, etc) would register their requirements in the PM_QOS_MEMORY_BANDWIDTH class. The EMC driver would be listening for notifications and setting a floor frequency based on the aggregated bandwidth that is needed.
The impression so far is that this approach matches the needs of the Tegra and Exynos SoCs, and people working on Rockchip upstreaming are evaluating it. Others working on other SoCs are very welcome to look at it and comment, so the result is also useful to them and they can improve their power management in mainline without having to refactor things later.