Progress
For the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.
I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.
And it not only runs flawlessly, but at the same performance level as the blob.
It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.
by Julien Langlois CC BY-SA 3.0 |
tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Teflon delegate: loaded rknpu driver
teflon: compiling graph: 89 tensors 27 operations
...
teflon: compiled graph, took 413 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 10 ms
teflon: invoked graph, took 10 ms
0.984314: hen
0.019608: cock
0.000000: toilet tissue
0.000000: sea cucumber
0.000000: wood rabbit
time: 10.776ms
Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.
Thanks to TL Lim of PINE64 for sending me a QuartzPro64 board for me to hack on.
Next steps
I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.
After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.