Friday, June 28, 2024

Etnaviv NPU update 19: Ideas On Board sponsors support for the NXP i.MX 8M Plus SoC

Last week I started work on adding support to the Etnaviv driver for the NPU inside the NXP i.MX 8M Plus SoC (VeriSilicon's VIPNano-SI+).

This work is sponsored by the open source consultancy Ideas On Boards, and will include the same level of support as for the Amlogic A311D SoC, which means full acceleration for the SSDLite MobileDet object detection model.


Right now all kinds of basic convolutions are supported, and work is well on its way for strided convolutions.

For basic convolutions, most of the work was switching to a totally different way of encoding weights. At the low-level, the weights are encoded with Huffman, and zero run length encoding on top. This low level encoding has been already reverse engineered and implemented by Philipp Zabel of Pengutronix, as mentioned in my previous update on the variant of this NPU shipped inside the Amlogic S905D3.

How weights are laid on top of the encoding is also different, so I had to reverse engineer that and implement it in the Mesa driver. That plus some changes on how tiling is computed got basic convolutions working, then I moved to strided convolutions. Pointwise convolutions got supported at the same time as basic convolutions, as they are not any different on this particular hardware.

Strided convolutions are still not natively supported by the hardware, so I reused the code that lowers them to basic convolutions. But the existing jobs that use the tensor manipulation cores to transform the input tensor for strides contained many assumptions that don't hold valid in this hardware.

So I have been reverse engineering these differences and now I have all kinds of strided convolutions supported up to 32 output channels. I feel that these will be done after addressing a couple of details about how the tensor reshuffle jobs are distributed among the available TP cores.

Afterwards I will look at depthwise convolutions, which may be supported natively by the hardware, while on the A311D these were lowered to basic convolutions.

Then on to tensor addition operations, and that should be all that is needed to get SSDLite MobileDet running, hopefully close to the performance of the closed source driver.

I'm very grateful to Ideas On Board for sponsoring this work, for their trust on me to get it done, and for their vision of a fully featured mainline platform that all companies can base their products on without being held captive by any single vendor.

I'm testing all this on a Verdin iMX8M Plus board that was kindly offered by Daniel Lang at Toradex, thanks!