|"Don't cross the streams. It would be bad."
A big part of what I have been up to in the past two weeks has been a serious refactoring of the data structures that hold the model data in the different phases until the HW configurations is generated.
What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.
The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:
|A small subsection of SSDLite MobileDet
The adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.
As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.
For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.
I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.
For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.
Also put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.
Lowering the strides
Similarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.
Status and next steps
Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.
The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.
|The whole of SSDLite MobileDet