I'm not a game programmer, but I do develop GPU accelerated scientific models, so I have a few years experience in optimizing different gpu architectures to accelerate computation. So I have sense of what makes something fast or slow as I've seen it play out on various hardware.
The notion of a 'sustained TF rate' doesn't really make sense. The TF number is a peak theoretical number that will in all likelyhood never be actually hit on either console(possibly occassionally for a few milliseconds at a time, but generally they'll not be computing at that rate). It's not a benchmark, it's just a way of counting components and clocks within a single number. It's actually a lot like a the way a business considers the number of 'man hours' it'll take to perform a task. The actual computational throughput will be determined by things like thread occupancy and how much shared and local memory each thread requires. Since both components use the same architecture, this in theory effects both equally, but it's not nonsense to suggest the higher clockrate gives a bit of help to the PS5 here as it's able to utilize local data (the data sitting in the GPU cache) and shuffle it out for the next piece of data that it needs more readily. It is an effective bandwidth increase on the ram->cache->computation pipeline.
More technical version:
The biggest factor on speeding up a GPU is how much you can saturate the compute units (those CUs that we keep hearing about). If you can get a concurrent thread on each ALU within each CU, and have minimal or no requirement to reach back to VRAM within a kernel call to swap data in and out of the CU cache, then you can get pretty close to your peak throughput. This is, in practice, not common, as the available local storage within the CU is tiny, a few tens of KB shared between all the ALUs. For example, in the Nvidia volta architecutre (which i'm most familiar with), there is a single 256KB block of memory (arranged in 32bit registers) for every thread running on that SM to use for data exclusive to that thread. In a perfect world, every one of the 64 CUDA cores in a single Volta SM would have it's own thread, meaning each one gets ~4KB per thread to store useful data. (There is 96KB of shared memory as well, but I'll ignore this for the moment. It's extremely useful but somewhat immaterial for this explanation). This is not generally practical, in my experience, so you're left with two options, not mutually exclusive. You can reduce the number of concurrent threads, or you can periodically swap data in and out of registers by calling back to VRAM. The former is what Cerny was alluding to when he said it's hard to fill more CUs than fewer, although I somewhat disagree with his characterization in the case where all CUs in both platforms have access to the same relative register and shared memory. I'm not familiar with RDNA2 though so I don't want to comment on that too much. In the latter case, which is almost always necessary to some degree, you can think of an analogy to screen tearing as to what happens under the hood.
Several threads are going about their business, making computations, thread a says 'oh, i need something from vram'. Thread a then gets paused, and thread b gets moved into his place to keep computing while the data for thread a is fetched. Meanwhile, thread b finishes his work, and thread a either is ready to keep going or isnt. This is determined by the latency in access to vram. Now if thread a isn't ready, most likely another thread gets moved in to thread b's place and keeps going. If thread a is ready, then it'll get shuffled back in to pick up the computation where it left off with its new data. The analogy to screen tearing is this. If you have ever played with v-sync on a 60hz monitor, and then on a 144hz monitor, you've probably noticed that screen tearing is far less noticeable than on a slower refresh. This is because the gap between when getting data and being able to use it is smaller. A similar analogy holds with clock speeds in a GPU. A faster clock speed will generally lead to less 'down time' in any given ALU as it is more likely to be ready to go sooner when the requisite data is available.
What I want to point out, is that NONE of this shows up in a TF metric. The underlying reality of swapping data in and out, the various bottlenecks, tradeoffs, etc, all of that is presumed to essentially not exist when discussing TF. However this is one of the biggest considerations when doing optimization, as you have to take into account these facts of life about having threads 'stall', so to speak.
Will this make the PS5 faster in computations than the XSX? In a few cases, possibly, but in general, no it won't. However it does mean that the story on what that gap is isn't as simple as many here are claiming. I expect that the PS5 may generally run at a slightly lower resolution (some quick calculations would put the resolution of 3504x1971 at a hair more than 16% reduction in pixel count), but in many cases I think the gap will be closer than you'd expect from a raw TF count, because the higher clock speed does help 'in the real world' in a somewhat non-linear fashion as compared to raw TF numbers in that it makes the penalty of moving data in and out of local storage smaller. It's not a HUGE difference (at least not in most cases), but it's not nothing.
This was a good read, thanks for that.
It's just sad that posts like these get buried beneath the huge pile that is console warring, drive-by snark and concern trolling.