One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.
While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster than the naive code I was running on the CPU for this.
Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.
I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.
During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.