In the last update I explained how compression of zero weights gave our driver such a big performance improvement.
Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.
TL;DR
Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.
On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!
Enable more convolutions
Our driver
was rejecting convolutions with a number of output channels that is not
divisible by the number of convolution cores in the NPU because at the
start of the development the code that lays the weights out in memory
didn't support that. That caused TensorFlow Lite to run the convolutions
in CPU, and some of them were big enough to take a few milliseconds,
several times more than on the NPU.
When implementing support for bigger kernels I had to add improvements to the tiling of the convolutions and that included adding support for these other convolutions. So by just removing the rejection of these, we got a nice speed up on SSD MobileDet: from 32.7ms to 27ms!
That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.
Caching of the input tensor
So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.
That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.
Next steps
At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.
But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.