Since the last update I finally got the whole of MobileNetv1 running at full-accuracy on the NPU with Mesa:
tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so Loading external delegate from libteflon.so with args: {} Processing the input took 18 ms. Running the NN job took 13 ms. Processing the output took 1 ms. 0.866667: military uniform 0.031373: Windsor tie 0.015686: mortarboard 0.007843: bow tie 0.007843: academic gown time: 33.094ms
That takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.
Most of the time (18 ms.) is spent in my naive manipulation of the input tensor, transposing and reshuffling it to match what the HW expects. Once we learn to do these operations on the 4 tensor manipulation cores, this time should be brought close to zero.
The 13 ms. that the convolutions take in the NPU is still sensibly higher than the 8 ms. that the blob achieves, but the optimizations mentioned in previous updates in this blog should bring us pretty close.
Next steps
Now that we have something that people can use in their products, I will switch to upstreaming mode.
I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found here.
I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.
Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.
Refactored the Gallium front-end
As commented in the previous update, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.
I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.
As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.
Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.
Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.
Got MobileNet V1 to run
As part of the refactoring from above, I got multiple operations in the same model to work, which got us to correctly running some inferences, even if at low accuracy rates:
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {} tflite_plugin_create_delegate Teflon delegate: loaded etnaviv driver INFO: Initialized TensorFlow Lite runtime. PrepareDelegate VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph. 0.960784: hen 0.015686: cock 0.007843: goose 0.003922: Pembroke 0.003922: Ibizan hound time: 22.802ms tflite_plugin_destroy_delegate
This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.
Presented this work at Embedded Recipes 2023
Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.
You can find the slides here, and listen to the talk at:
Next steps
The previous update got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:
Get input and output channels working at the 512 level, so we can run a higher accuracy version of the MobileNet V1 network
Learn to use the TP units to remove those costly transpositions and reshuffles in the CPU (at this point, we would have something useful to people on the field)