z0m3le You have mentioned a few times that Ampere FLOPS don't do half precision and thus Maxwell FLOPS punch above their weight a bit. I was wondering with this chart, can a game use the Tensor cores for bulk 16FP computations that Maxwell performs on the ALUs? That would mean that Ampere FLOPS would punch above their weight as well to some extent. Or are the Tensor cores much more restricted in the form of input they can handle such that they cannot in any way be used for the GPU tasks that Maxwell uses 16FP for?
I was wrong, the tensor cores do half precision, I think someone who has looked into this more like
Thraktor might be a better user to give correct numbers here, as I still don't have a tensor core powered piece of hardware to play around with, so I'm not as familiar with them.
It’s a little complex, as an Ampere SM has quite a bit more going on inside than the Maxwell SM in Switch.
For the Maxwell SMs on TX1 in Switch, each SM can run either:
128 FP32 operations OR 256 FP16 operations (or some INT operations, I’m not quite sure what the INT pipeline is like on Maxwell)
per clock.
For Ampere, it’s a bit more complicated, partly because the config is different between A100, the desktop cards, and Orin, and we don’t know exactly what configuration Nintendo would use. For the desktop cards, it’s:
64 FP32 ops
AND
64 FP32 ops OR 64 INT32 ops
AND
128 FP16 ops OR 512 FP16 tensor ops (or a variety of lower precision ops)
For Orin, as far as I can tell, each SM would run per clock:
64 FP32 ops
AND
64 FP32 ops OR 64 INT32 ops
AND
256 FP16 ops OR 1024 FP16 tensor ops (or a variety of lower precision ops)
So if we’re to assume that Nintendo’s SoC will also use the Orin configuration, then a single SM could run both 128 FP32 and 256 FP16 ops per clock, rather than choosing between them on TX1. So, if we were to look at the 4 SM @1.1GHz config posted previously, it could in theory hit both 1.1 Tflops of FP32 and 2.2 Tflops of FP16 simultaneously, whereas on TX1 a mixed-precision workload would have had to split the GPU to get around 200 Gflops FP32 and 400 Gflops FP16.
Of course the problem is the “in theory” part. If DLSS is running, then it’s going make pretty heavy use of the tensor cores, so the FP16 performance available to shaders would drop accordingly, which is also true for any other applications of the tensor cores. There’s also going to be some amount of integer operations which would cut into the FP32 performance. Finally there could be many other bottlenecks which could limit performance aside from floating point capabilities. In PC games we don’t see a jump in performance in line with the raw Tflops numbers between Turing and Ampere, because the raw floating point performance wasn’t the only bottleneck, and when running on Ampere the game might be limited by texture units, ROPs, bandwidth, local data storage, etc., etc. In fact on PC I’d say floating point performance is probably almost never the bottleneck with Ampere so far.
On a console like Switch, I would expect developers to work around those bottlenecks as best they can if they‘re putting a reasonable amount of work into optimising a game for the device, so they may choose approaches which require more floating point performance but less bandwidth (let’s say), whereas on another system that has lots of bandwidth they may choose a different approach. I’d argue that the Ampere SM gives a lot of scope for optimisation in a console setting, as there’s a lot of floating point performance to work with, and quite a bit of flexibility on how to use it, but there are still many other potential bottlenecks that developers would have to work around.