Progress
For the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.
I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.
And it not only runs flawlessly, but at the same performance level as the blob.
It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.
by Julien Langlois CC BY-SA 3.0 |
tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Teflon delegate: loaded rknpu driver
teflon: compiling graph: 89 tensors 27 operations
...
teflon: compiled graph, took 413 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 10 ms
teflon: invoked graph, took 10 ms
0.984314: hen
0.019608: cock
0.000000: toilet tissue
0.000000: sea cucumber
0.000000: wood rabbit
time: 10.776ms
Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.
Thanks to TL Lim of PINE64 for sending me a QuartzPro64 board for me to hack on.
Next steps
I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.
After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.
Unbelievable how fast you got this done! Thank you for your work and making it open source so we all can use it
ReplyDeleteWow, things go really fast! As someone who has been trying to create a patch of the proprietary Rockchip NPU kernel driver for mainline Linux, thanks for your work on the open source driver! Excited about your upcoming development releases.
ReplyDeleteRead about your work (this project) on Phoronix. Super dope work my man. Much appreciated.
ReplyDeleteVery great stuff here. When you run your mobilenet example, is this targeted through Vulkan/OpenCL?
ReplyDelete> Very great stuff here. When you run your mobilenet example, is this targeted through Vulkan/OpenCL?
ReplyDeleteNo, my drivers go straight from the TFLite representation of the network to programming the hardware.
Is there a README on reproducing this? Interested in getting it working with other frameworks
ReplyDeleteThat's a great idea, Jorge, must confess it didn't occur to me. I guess that, besides USB, is one is able to design their own board, it could made to interface with the host computer via PCIe, like Coral does.
ReplyDeleteI would probably choose though a SoC that gives the most TOPS per $. The RK3588 has way more powerful CPUs than would be needed to just drive the NPU and interface with the host.
I'm seeing the RK3588 to retail from 40 USD, and I guess an older SoC with less powerful CPUs such as Amlogic's A311D should cost quite less. I guess there are other SoCs (or even MCUs) out there with NPUs similarly powerful to the Coral Edge that are much cheaper.
Maybe somebody who designs boards would see an opportunity here?
I have been searching for alternatives for the Coral. It may be a better fit to use oculink, thunderbolt, or usb 4 instead. As far as hardware goes, radxa has a new AI module coming out, but has a strange 141p b2b interface. 4x pci in there somewhere though.
Deletewhat would be the advantage of focusing on the userspace driver over working on a mainline kernel driver? Speed of development?
ReplyDeleteawesome work by the way!
Delete> what would be the advantage of focusing on the userspace driver over working on a mainline kernel driver? Speed of development?
ReplyDeleteThe most important reason IMHO is that I need to do as much as possible so the UABI is right at the first try. And to verify that the UABI is right, I would like to have all functionality being used by the userspace driver.
I think the only thing missing atm is using multiple cores, once I have the userspace driver doing that, I can move to the kernel driver.