Found the problem with enabling the 8th NN core
Though I don't know exactly yet what the problem is, I found that by going back to a previous brute-force approach to powering up the NPU, the 8th core works just fine.
For now this unblocks the work and gets me closer to the initial goal of running a MobileNetv1 inference and seeing what the performance is like, so I'm leaving a proper fix for this for later.
I bet there's either a register that is being written in the wrong order, or a delay between register writes that is too short. Will have to delve into the power domain subsystem and/or the common clock framework in the Linux kernel to fix this one.
Added support for depthwise convolutions
MobileNetV1 introduced Separable Depthwise Convolutions (see the linked paper for an in-depth description), which are layers that contain a depthwise convolution to process each depth level separately, plus a pointwise convolution to rejoin them again. This offers the same result with 23x less multiplications, so it's very attractive for mobile use-cases.
This hardware doesn't support depthwise convolutions directly, but we can lower them to regular convolutions after modifying the weight tensor to cover each IFM/depth separately.
Added support for pointwise convolutions
For the second half of a Separable Depthwise Convolution, I just had to take into account that 1x1 kernels are packed in a different format in memory, as otherwise it would be very inefficient for each NN core to pull each 1-byte kernel separately from the memory bus.
Added support for unsigned weights
TensorFlow Lite has moved towards implementing a new quantization specification which gives preference to signed weights because of convenience, as symmetric quantization is simpler to implement. Unfortunately for us, our hardware works natively with unsigned weights so we would need to convert them if we were to use TFLite's new quantization.
But the models that Google themselves publish make use of the ancient tooling that still support the old, unsigned quantization scheme, so I had to find a way of producing models with unsigned quantization for our test suite, to match what MobileNetV1 does.
That also implied moving to per-tensor quantization, instead of per-axis.
Added support for higher IFMs and OFMs (up to 256 each)
In the previous update I explained how support for multiple input and output channels (or feature maps) was added, but I wasn't able to test with more than 7 output channels because the 8th NN core was MIA.
With that solved, I was able to see what would be needed for convolutions with higher channel counts, such as those that MobileNetV1 use (32, 64, 128, 256, 512 and 1024).
Each level implied revisiting the tiled format in which weights and biases are laid out in memory, making it more and more complex.
I got to 256, with 512 and 1024 bringing more changes in the tiled format that I still need to reverse engineer.
Model partition compilation and resource management
I'm facing problems with testing coverage as we support so many different parameters that need to be tested in combination, with a explosion in the number of individual tests. Because of the hacky current state of the TFLite delegate (and Gallium state tracker) I'm not able to run all the tests because I don't have proper resource management implemented and so we reach OOM before the end.
So my next task after I get back from Embedded Recipes will be to refactor the delegate implementation so we have a proper compilation of the model partitions. These will own the weight+bias buffers as well as the intermediate tensors, with each inference just feeding an input tensor to the partition and retrieving an output tensor at the end.
This will allow me to scale up the automated testing further, so I can keep adding new features with confidence, knowing that I'm not adding regressions.
Move development to Cottonwood A311D board
Da Xue of LibreComputer has got Etnaviv and Teflon working on the new boards that his company is releasing soon. One of them contain a A311D SoC, the same as the VIM3 I'm currently using for development. I will be initially targeting that one, and later make sure that it also works on the Cottonwood boards that will have the S905D3 SoC, which has a VIP Pico instead of a VIP Nano.
Besides being in general a great FOSS champion and specifically being supportive of ML inference with open source, Da is directly sponsoring this work, so I look forward to meet him in Paris this week and exchange notes.
Bigger coefficient tensors
The last known features missing before being able to run MobileNetV1 are IFMs and OFMs of 512 and 1024, each.
Hopefully it will only require some further tweaking of the tiled memory representation of the coefficient buffer.
Medium term goals
I don't expect performance to be that great yet, so I plan on switching the focus to it after the above has been accomplished. I expect for the features below making the most impact in improving performance:
- Avoid copies in and out of the model partition, by mapping user buffers to the NPU
- Use the TP units for tensor manipulation (transposing, mostly)
- Properly configuring the automatic caching of kernels and images in the internal on-chip SRAM
- Use the external SRAM for intermediate tensor data
- Chain all TP and NN jobs in a model partition in the same command stream
- Enable zero-run-length compression in the coefficient buffer
- Tune the tiling parameters for reduced memory bandwidth usage