This week started quite fruitfully, these features were added:
- Convolutions with multiple input and output channels (input and output feature maps)
- "Same" padding in convolutions
And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.
One more roadblock
Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps.
Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.
So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.
Want to try it out?
I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.
I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:
You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.