Do we know why Ampere SMs aren so much less efficient at attaining their theoretical peak performance numbers than the Turing SMs? Normally architectural efficiency improves the ratio between theoretical and actual performance. From what I'm reading, it seems they put in double the amount of cores in each SM. Is that the cause of the decreased efficiency?
It's important to note that when people talk about Gflops or Tflops, they're talking about them as a proxy for real-world performance (or at least they should be). The actual performance of a GPU (ie the speed at which it renders a given frame) is going to be limited by a number of potential bottlenecks, including floating point performance, integer performance, memory bandwidth, TMU performance, ROP performance, register memory, cache, etc., etc. Even within the rendering of a given frame, those bottlenecks could change, so for example one millisecond it could be integer-limited, the next it could be ROP limited. As such, the actual render time for a frame is going to be dependent on a number of different bottlenecks impacting different parts of the rendering pipeline.
So when I compare, say a 1Tflop Turing GPU to a 2Tflop Turing GPU, I'm really comparing the first GPU to one where
every bottleneck is doubled, because alongside those extra FP ALUs in the added SMs, there's twice the ROP performance, twice the TMU performance, etc. (Bandwidth isn't strictly linked to the number of SMs and clock like this, but you would expect any GPU designer would avoid a bandwidth-limited design if at all possible.) So the second GPU isn't twice as powerful as the first
because it's capable of 2Tflops, it's twice as powerful because a whole range of bottlenecks have been doubled, just one of which is the floating point performance. We just use the floating point performance as a proxy because it's a relatively simple and straight-forward measure. It's also generally a very big number (in the trillions these days), so it's a useful advertising tool for GPU makers.
The issue with using flops as a measure of relative performance is that it only really makes sense when comparing two GPUs on the same architecture. If you compare a 1Tflop Maxwell GPU to a 1Tflop Turing GPU, they're not going to have the same level of performance, because the new architecture includes changes to many of those other bottlenecks which haven't scaled linearly alongside the floating point performance. We would only expect the two GPUs to have the same performance in a game which is entirely floating-point bottlenecked 100% of the time, which is never the case in the real world.
What makes this even less useful for Ampere is the way they changed the shader ALUs in the Ampere SMs over Turing. In a Turing SM there was one bank of floating point ALUs ("cores") and one bank of integer ALUs which could operate simultaneously. The integer ALUs weren't being very heavily utilised, so Nvidia changed the bank of integer ALUs to one which could either operate on integers or floating point numbers. So, assuming no integer workloads, an Ampere SM could operate at double the theoretical floating-point performance of a Turing SM. However, this was a relatively isolated change, and Nvidia didn't double everything else in the Ampere architecture alongside this. What this means is that you can't make any meaningful comparisons of floating point performance between Turing and Ampere. In fact, it's likely that Ampere will basically never be floating-point limited, which means using Tflops as a measure of an Ampere GPU's performance at all is rather pointless.
This is why I'm talking about SMs and clock speeds rather than "cores" and Gflops/Tflops when I talk about any potential Ampere-powered future Nintendo device. An Ampere GPU with a given number of SMs at a given clock should outperform a Turing GPU with a similar configuration, but the comparison gives you a much better idea of the real-world performance than comparing theoretical Tflop counts. When I suggested a 6 SM Ampere GPU at 1.3GHz in docked mode would be needed to facilitate DLSS at 4K/60, if I had said a 2Tflop Ampere GPU it would have given a very unrealistic expectation of performance compared to any existing non-Ampere GPU. In reality it's a GPU that's 50% "bigger" than the one in the original Switch, running on a newer architecture and at a higher clockspeed. Certainly a decent jump in performance over the original model, but nothing as crazy as 2 Tflops would imply if that were the only thing you were looking at.
Right. It seems prone to problems, unless it was dealt with as purely an SSD cache.
I think it's super interesting that Microsoft has basically created the market for CFExpress Type B cards. It makes it a lot easier for Nintendo to maybe do the same.
I wonder if Orin has PCIe lanes. If it does, then we might be seeing an on-board NVMe SSD as the main storage, with a CFExpress Type B slot for expanded storage. Even CFExpress Type A would probably be sufficient, especially if it were PCIe 4.0.
Whatever they do the price target really shouldn't be more that $0.22/GB.
I wonder if camera enthusiasts will schuck the XBOX ones to use in their cameras if they're 1/2 to 1/3 the price of the ones made for cameras.
I don't think the Xbox expansion cards are literally CFExpress Type B cards. They may use the same interface (which is basically just NVMe over PCIe), but the physical size seems different, the pin layout could be different, etc. In theory you may be able to make an adaptor, but keep in mind that the cameras currently supporting CFExpress Type B are professional cameras for the most part, and if a card isn't 100% guaranteed to work with your camera, it's basically useless to a professional. A day of work lost because of a card failure is worth a lot more than the difference in price between this and a "real" CFExpress card, which is one of the reasons CFExpress cards are still so expensive, they're genuinely worth that much to a professional.
What I think this does indicate, though, is the huge differences in price for memory cards aimed at a small professional market vs those aimed at a mass market consumer audience. The Xbox card made by Seagate is basically exactly the same thing as a CFExpress Type B (actually faster than the ones currently on the market), but is priced far lower because they're expecting to sell to a very large group of people who are very price-conscious. The irritating thing for photographers (and Xbox owners, to be honest), is that they didn't just put an actual CFExpress Type B slot on there and partner with Seagate to bring out a branded 1TB CFExpress card at launch for an aggressive price point (ie exactly the he same thing in a different case). This would have given photographers a much better-priced option for their cameras, but in the long term would have resulted in competition for the Xbox end of the market by Sandisk, Samsung, etc. who could have undercut the official card on price and/or offered different capacities, etc.
That's basically what I'd like to see happen with the Switch 2 and CFExpress Type A. The PCIe 4 version will allow speeds up to 1.7GB/s, in a form-factor smaller than standard SD cards (actually about the same size and shape as Switch game cards, as it happens). There may be a variety of mid and high-end cameras out there with CFExpress Type A by that time, where cards would still be relatively expensive with a smaller, less price-conscious audience, but Nintendo themselves would be big enough to force a lot more competition into the market just by releasing a device which supports the card format. They could partner with Samsung or Sandisk or whoever and make sure there's a range of competitively priced (and Nintendo branded) cards out by launch, and then rely on competition to keep prices and capacities going in the right direction over the life of the device.