• Ever wanted an RSS feed of all your favorite gaming news sites? Go check out our new Gaming Headlines feed! Read more about it here.
Status
Not open for further replies.

Mr.Black

Member
Apr 28, 2018
114
Finland
I'm not a game programmer, but I do develop GPU accelerated scientific models, so I have a few years experience in optimizing different gpu architectures to accelerate computation. So I have sense of what makes something fast or slow as I've seen it play out on various hardware.

The notion of a 'sustained TF rate' doesn't really make sense. The TF number is a peak theoretical number that will in all likelyhood never be actually hit on either console(possibly occassionally for a few milliseconds at a time, but generally they'll not be computing at that rate). It's not a benchmark, it's just a way of counting components and clocks within a single number. It's actually a lot like a the way a business considers the number of 'man hours' it'll take to perform a task. The actual computational throughput will be determined by things like thread occupancy and how much shared and local memory each thread requires. Since both components use the same architecture, this in theory effects both equally, but it's not nonsense to suggest the higher clockrate gives a bit of help to the PS5 here as it's able to utilize local data (the data sitting in the GPU cache) and shuffle it out for the next piece of data that it needs more readily. It is an effective bandwidth increase on the ram->cache->computation pipeline.

More technical version:
The biggest factor on speeding up a GPU is how much you can saturate the compute units (those CUs that we keep hearing about). If you can get a concurrent thread on each ALU within each CU, and have minimal or no requirement to reach back to VRAM within a kernel call to swap data in and out of the CU cache, then you can get pretty close to your peak throughput. This is, in practice, not common, as the available local storage within the CU is tiny, a few tens of KB shared between all the ALUs. For example, in the Nvidia volta architecutre (which i'm most familiar with), there is a single 256KB block of memory (arranged in 32bit registers) for every thread running on that SM to use for data exclusive to that thread. In a perfect world, every one of the 64 CUDA cores in a single Volta SM would have it's own thread, meaning each one gets ~4KB per thread to store useful data. (There is 96KB of shared memory as well, but I'll ignore this for the moment. It's extremely useful but somewhat immaterial for this explanation). This is not generally practical, in my experience, so you're left with two options, not mutually exclusive. You can reduce the number of concurrent threads, or you can periodically swap data in and out of registers by calling back to VRAM. The former is what Cerny was alluding to when he said it's hard to fill more CUs than fewer, although I somewhat disagree with his characterization in the case where all CUs in both platforms have access to the same relative register and shared memory. I'm not familiar with RDNA2 though so I don't want to comment on that too much. In the latter case, which is almost always necessary to some degree, you can think of an analogy to screen tearing as to what happens under the hood.

Several threads are going about their business, making computations, thread a says 'oh, i need something from vram'. Thread a then gets paused, and thread b gets moved into his place to keep computing while the data for thread a is fetched. Meanwhile, thread b finishes his work, and thread a either is ready to keep going or isnt. This is determined by the latency in access to vram. Now if thread a isn't ready, most likely another thread gets moved in to thread b's place and keeps going. If thread a is ready, then it'll get shuffled back in to pick up the computation where it left off with its new data. The analogy to screen tearing is this. If you have ever played with v-sync on a 60hz monitor, and then on a 144hz monitor, you've probably noticed that screen tearing is far less noticeable than on a slower refresh. This is because the gap between when getting data and being able to use it is smaller. A similar analogy holds with clock speeds in a GPU. A faster clock speed will generally lead to less 'down time' in any given ALU as it is more likely to be ready to go sooner when the requisite data is available.

What I want to point out, is that NONE of this shows up in a TF metric. The underlying reality of swapping data in and out, the various bottlenecks, tradeoffs, etc, all of that is presumed to essentially not exist when discussing TF. However this is one of the biggest considerations when doing optimization, as you have to take into account these facts of life about having threads 'stall', so to speak.

Will this make the PS5 faster in computations than the XSX? In a few cases, possibly, but in general, no it won't. However it does mean that the story on what that gap is isn't as simple as many here are claiming. I expect that the PS5 may generally run at a slightly lower resolution (some quick calculations would put the resolution of 3504x1971 at a hair more than 16% reduction in pixel count), but in many cases I think the gap will be closer than you'd expect from a raw TF count, because the higher clock speed does help 'in the real world' in a somewhat non-linear fashion as compared to raw TF numbers in that it makes the penalty of moving data in and out of local storage smaller. It's not a HUGE difference (at least not in most cases), but it's not nothing.

This was a good read, thanks for that.

It's just sad that posts like these get buried beneath the huge pile that is console warring, drive-by snark and concern trolling.
 

True_fan

Banned
Mar 19, 2020
391
Articles like this would not be necessary if there wasn't such a massive defense of a poor initial presentation by Sony. We have supposed neutral journalist and gaming fans moving goal posts and changing the importance of teraflops. MSFT screwed up with the X1 and they had to eat it, they were crucified for "secret sauce" The ps5 is not some revolutionary system in comparison to the series X, it's inferior in most areas. Sony is being vague and getting a pass by most of the media.
 

Deleted member 22750

Oct 28, 2017
13,267
When you are talking about specifications, you have to be, you know, specific. Saying it runs most of the time at this clock but it is variable is not specific. "Typical" and "most of the time" are not specifics, nobody gives specifications like that. You have to give the range of frequency which is the standard practice for every piece of component being sold out there. The fact that he didn't is very telling.

The presentation obviously went through PR before going public and it is apparent that Sony thinks of it as a weakness and they don't want it to be seen as below 10TF. That is what would happen if they give the actual frequency range. The thing is, it doesn't really matter in practice if it is like 9.5TF compared with 10.2, it is just that a single digit number looks bad, it is purely marketing and it is really silly. Think like $9.99 and $11.

What I actually have issues with, is Cerny calling out wider GPUs in his presentation as being difficult to work with. I get his point trying to pitch his own design, but cmon. Every high end GPU design goes wide instead of higher clock because GPUs are parallel processors and vector operations are easily parallelized, that is the entire point of a GPU design. To imply Micorsoft is going to have difficulties extracting performance from their 52CU to make your console look closer is just nonsense. That is all.
Agree
 

Betelgeuse

Member
Nov 2, 2017
2,941
NAVI 10 lite, and people already picked up that Cerni never once mentioned variable rate shading which is associated with RDNA 2. if you dont mention that to a developer briefing then when?
Matt says they both have VRS:
www.resetera.com

PS5 and Xbox Series speculation |OT11| Cry Havoc and Let Slip the Dogs of War [NEW NEWS, NEW THREAD - CHECK OUT THE STAFF POST]

Uh that sounds like the plan? Reads to me like "see you at E3". They specifically saved stuff like the price and release date which they'll announce at E3 obviously
 

nib95

Contains No Misinformation on Philly Cheesesteaks
Banned
Oct 28, 2017
18,498
2. Cerny also stated that upping clocks would work wonders and allow for the GPU to punch above its weight. We know from RX 5700 that upping clocks does not lead to a commensurate return in gains. What are the chances that the same is true in RDNA2?

Do we?

Here's an article where they overclocked their 5700 from the default 1,850 MHz to 2,005 MHz, a clock frequency increase of 8.4%.

This was their gaming performance increase across different titles.

3D Mark: Increase in GPU score of 12.5%
Assassin's Creed Odyssey: 12% fps increase
Far Cry New Dawn: 10.1% fps increase
F1 2019: 27.4% fps increase
Metro Exodus: 11.8% fps increase
SoT Tomb Raider: 10% fps increase.

These results show that the gaming performance percentage increase, actually exceeds the overclock percentage increase, which discredits your notion and gives further credence to why Sony chose a higher clockspeed. Remember, the GPU in the PS5 is clocked 22% faster at peak than the Series X's (2.23Ghz vs 1.825GHz). We don't really know how RDNA2 will fare.

www.pcgamesn.com

AMD hid the true power of the Radeon RX 5700… here’s how to unlock it

Unlocking the frequency and power shackles on the second-string Navi GPU makes it a fantastic card

Further to that, a 5700 at a core clockspeed of 1,920 MHz gets a Fire Strike GPU score of 25,205, whilst a 5700 with a core clockspeed of 2,194 MHz (closer to the PS5's speed) gets a GPU score of 28,806.

Meaning, a clock speed increase of 14.3% leads to a Fire Strike GPU benchmark score increase of 14.3%. Obviously this isn't a gaming test, but it is interesting nonetheless and shows an almost perfect correlation with increases in frequency clock to actual benchmarking (not Tflop) results.

www.3dmark.com

I scored 23 590 in Fire Strike

AMD Ryzen 5 3600, AMD Radeon RX 5700 x 1, 16384 MB, 64-bit Windows 10}

www.3dmark.com

I scored 23 737 in Fire Strike

Intel Core i7-6700K Processor, AMD Radeon RX 5700 x 1, 16384 MB, 64-bit Windows 10}

On a side note, a 5700 at near PS5 clock speeds actually gets a similar GPU score as a 2080 Super which gets 29.3k, and that's a 2,055 MHz clocked GPU at 12.6 turing Tflops that costs a whopping £600-£700 alone.

www.3dmark.com

I scored 25 400 in Fire Strike

AMD Ryzen 7 3800X, NVIDIA GeForce RTX 2080 SUPER x 1, 16384 MB, 64-bit Windows 10}
 
Last edited:

asmith906

Member
Oct 27, 2017
27,354
If microsoft had a 14 teraflops machine dont you think they'd be screaming that from the heavens
 

LebGuns

Member
Oct 25, 2017
4,127
Am I missing something. I thought all GPUs work on a variable clock. Like... My current desktop GPU isn't going full blast while I'm posting/browsing here.
That's what I thought as well. NX Gamer had a good video about this too showing that standard in PC, CPU and GPU are working in variable frequencies.
 

mutantmagnet

Member
Oct 28, 2017
12,401
Articles like this would not be necessary if there wasn't such a massive defense of a poor initial presentation by Sony. We have supposed neutral journalist and gaming fans moving goal posts and changing the importance of teraflops. MSFT screwed up with the X1 and they had to eat it, they were crucified for "secret sauce" The ps5 is not some revolutionary system in comparison to the series X, it's inferior in most areas. Sony is being vague and getting a pass by most of the media.
It isn't so much that they are getting a pass as much as they are showing their own ignorance as tech journalists. WCCF article is good as well as the digital Foundry videos but so far I've been disappointed by other tech websites that have commented on this so far.

Very few people in tech media are really putting the effort to be well versed on the technologies they are reporting on.
 

FancyPants

Banned
Nov 1, 2017
707
Honestly, all of these numeral differences probably won't mean much for multiplat titles as the lowest common denominator will be the target.

Not a single console generation hasn't taken advantage of the more powerful consoles. Games on XOX in many cases looks significantly better than PS4 Pro games.
 

Phellps

Member
Oct 25, 2017
10,799
Wow, people are really desperate to keep the narrative that the PS5 is supposedly that much weaker based on numbers alone.
If spec numbers were all there was to hardware performance, people wouldn't read GPU reviews before buying anything. Whenever someone claims that we will only know the whole story once these consoles are out, people just brush them aside.
And then we got people here questioning the neutrality of a journalist that made his name by revealing stuff publishers would rather keep under wraps, as if now he's just a Sony shill.
Why is it so important to people to validate their console of choice?
This is so exhausting, there isn't a single thread about next gen that isn't just ridden with fanboys going in circles about the same thing.
 

radiotoxic

Member
Oct 27, 2017
1,019
And then we got people here questioning the neutrality of a journalist that made his name by revealing stuff publishers would rather keep under wraps, as if now he's just a Sony shill.
Please understand, no one is a bigger Sony shill than the guy that made everyone realize Naughty Dog (wow! is that a Sony studio?!) is one of the worst places to work for in the entire gaming industry (if you are a human being, of course).
 

Shpeshal Nick

Banned
Oct 25, 2017
7,856
Melbourne, Australia
12 Tflops is not an average performance figure either though. By your logic Microsoft is misleading people too. In reality the Tflop figure is peak performance only, not average. Infact neither console will have frequency clocks at their maximum simultaneously very often at all.

you got a source for this? Because Microsoft mentioned nothing about their clocks being variable. 52 CU at 1850 (or whatever the number was) is a constant.
 

mordecaii83

Avenger
Oct 28, 2017
6,854
Do we?

Here's an article where they overclocked their 5700 from the default 1,850 MHz to 2,005 MHz, a clock frequency increase of 8.4%.

This was their gaming performance increase across different titles.

3D Mark: Increase in GPU score of 12.5%
Assassin's Creed Odyssey: 12% fps increase
Far Cry New Dawn: 10.1% fps increase
F1 2019: 27.4% fps increase
Metro Exodus: 11.8% fps increase
SoT Tomb Raider: 10% fps increase.

These results show that the gaming performance percentage increase, actually exceeds the overclock percentage increase, which discredits your notion and gives further credence to why Sony chose a higher clockspeed. Remember, the GPU in the PS5 is clocked 22% faster at peak than the Series X's (2.23Ghz vs 1.825GHz). We don't really know how RDNA2 will fare.

www.pcgamesn.com

AMD hid the true power of the Radeon RX 5700… here’s how to unlock it

Unlocking the frequency and power shackles on the second-string Navi GPU makes it a fantastic card

Further to that, a 5700 at a core clockspeed of 1,920 MHz gets a Fire Strike GPU score of 25,205, whilst a 5700 with a core clockspeed of 2,194 MHz (closer to the PS5's speed) gets a GPU score of 28,806.

Meaning, a clock speed increase of 14.3% leads to a Fire Strike GPU benchmark score increase of 14.3%. Obviously this isn't a gaming test, but it is interesting nonetheless and shows an almost perfect correlation with increases in frequency clock to actual benchmarking (not Tflop) results.

www.3dmark.com

I scored 23 590 in Fire Strike

AMD Ryzen 5 3600, AMD Radeon RX 5700 x 1, 16384 MB, 64-bit Windows 10}

www.3dmark.com

I scored 23 737 in Fire Strike

Intel Core i7-6700K Processor, AMD Radeon RX 5700 x 1, 16384 MB, 64-bit Windows 10}

On a side note, a 5700 at near PS5 clock speeds actually gets a similar GPU score as a 2080 Super which gets 29.3k, and that's a 2,055 MHz clocked GPU at 12.6 turing Tflops that costs a whopping £600-£700 alone.

www.3dmark.com

I scored 25 400 in Fire Strike

AMD Ryzen 7 3800X, NVIDIA GeForce RTX 2080 SUPER x 1, 16384 MB, 64-bit Windows 10}
I fully expect your post to be ignored by people with an agenda to push, even though you just 100% refuted the claim that clock speed gains are lower than expected.
 

Lady Gaia

Member
Oct 27, 2017
2,476
Seattle
you got a source for this? Because Microsoft mentioned nothing about their clocks being variable. 52 CU at 1850 (or whatever the number was) is a constant.

Teraflops were always a theoretical peak performance figure, which is why anyone who actually understands the underlying technology has been saying over and over for ages that it's a terrible measure of performance. You don't actually get 12.1 trillion floating point operations per second out of a Series X GPU, because not every instruction is a FMAD, and not all the operands are available without stalling computation pipelines waiting for cache or RAM fetches.

So yes, the theoretical peak is constant. The actual throughput in practice is not, so the "constantness" of the theoretical peak is not really all that relevant.
 

mordecaii83

Avenger
Oct 28, 2017
6,854
Teraflops were always a theoretical peak performance figure, which is why anyone who actually understands the underlying technology has been saying over and over for ages that it's a terrible measure of performance. You don't actually get 12.1 trillion floating point operations per second out of a Series X GPU, because not every instruction is a FMAD, and not all the operands are available without stalling computation pipelines waiting for cache or RAM fetches.

So yes, the theoretical peak is constant. The actual throughput in practice is not, so the "constantness" of the theoretical peak is not really all that relevant.
That's a really excellent way of explaining things, thank you. :)
 

Dezzy

Member
Oct 25, 2017
3,432
USA
PS5 is powerful enough, and I'd bet they'd do Pro after a few years too.
Look at games like Horizon: Zero Dawn on the PS4 Pro. That already looks more than good enough for me.

I think people being negative towards the PS5 are probably feeling that way because the numbers are lower than on the Series X. It's not like they're actually bad though. They'll do amazing things with the PS5 regardless.
 

Paronth

Member
Oct 25, 2017
268
If you read the article from digital foundry Cerny makes it clear that rather the cpu and gpu clocks are constantly at their maximum (like xbox serie x or the previous consoles), they adapt to the use that the games need, so if it needs 10.3 Tflops it could, this flexibility probably makes it easier to regulate the cooling.

Put simply, the PlayStation 5 is given a set power budget tied to the thermal limits of the cooling assembly. "It's a completely different paradigm," says Cerny. "Rather than running at constant frequency and letting the power vary based on the workload, we run at essentially constant power and let the frequency vary based on the workload."

Rather than look at the actual temperature of the silicon die, we look at the activities that the GPU and CPU are performing and set the frequencies on that basis - which makes everything deterministic and repeatable," Cerny explains in his presentation.
 
Last edited:

Sei

Member
Oct 28, 2017
5,705
LA
Still have no idea how the power gets handled in both systems, they've only said a few buzz words. The 30 TLFOPS is just as accurate as this article.
 

tzare

Banned
Oct 27, 2017
4,145
Catalunya
I'm not a game programmer, but I do develop GPU accelerated scientific models, so I have a few years experience in optimizing different gpu architectures to accelerate computation. So I have sense of what makes something fast or slow as I've seen it play out on various hardware.

The notion of a 'sustained TF rate' doesn't really make sense. The TF number is a peak theoretical number that will in all likelyhood never be actually hit on either console(possibly occassionally for a few milliseconds at a time, but generally they'll not be computing at that rate). It's not a benchmark, it's just a way of counting components and clocks within a single number. It's actually a lot like a the way a business considers the number of 'man hours' it'll take to perform a task. The actual computational throughput will be determined by things like thread occupancy and how much shared and local memory each thread requires. Since both components use the same architecture, this in theory effects both equally, but it's not nonsense to suggest the higher clockrate gives a bit of help to the PS5 here as it's able to utilize local data (the data sitting in the GPU cache) and shuffle it out for the next piece of data that it needs more readily. It is an effective bandwidth increase on the ram->cache->computation pipeline.

More technical version:
The biggest factor on speeding up a GPU is how much you can saturate the compute units (those CUs that we keep hearing about). If you can get a concurrent thread on each ALU within each CU, and have minimal or no requirement to reach back to VRAM within a kernel call to swap data in and out of the CU cache, then you can get pretty close to your peak throughput. This is, in practice, not common, as the available local storage within the CU is tiny, a few tens of KB shared between all the ALUs. For example, in the Nvidia volta architecutre (which i'm most familiar with), there is a single 256KB block of memory (arranged in 32bit registers) for every thread running on that SM to use for data exclusive to that thread. In a perfect world, every one of the 64 CUDA cores in a single Volta SM would have it's own thread, meaning each one gets ~4KB per thread to store useful data. (There is 96KB of shared memory as well, but I'll ignore this for the moment. It's extremely useful but somewhat immaterial for this explanation). This is not generally practical, in my experience, so you're left with two options, not mutually exclusive. You can reduce the number of concurrent threads, or you can periodically swap data in and out of registers by calling back to VRAM. The former is what Cerny was alluding to when he said it's hard to fill more CUs than fewer, although I somewhat disagree with his characterization in the case where all CUs in both platforms have access to the same relative register and shared memory. I'm not familiar with RDNA2 though so I don't want to comment on that too much. In the latter case, which is almost always necessary to some degree, you can think of an analogy to screen tearing as to what happens under the hood.

Several threads are going about their business, making computations, thread a says 'oh, i need something from vram'. Thread a then gets paused, and thread b gets moved into his place to keep computing while the data for thread a is fetched. Meanwhile, thread b finishes his work, and thread a either is ready to keep going or isnt. This is determined by the latency in access to vram. Now if thread a isn't ready, most likely another thread gets moved in to thread b's place and keeps going. If thread a is ready, then it'll get shuffled back in to pick up the computation where it left off with its new data. The analogy to screen tearing is this. If you have ever played with v-sync on a 60hz monitor, and then on a 144hz monitor, you've probably noticed that screen tearing is far less noticeable than on a slower refresh. This is because the gap between when getting data and being able to use it is smaller. A similar analogy holds with clock speeds in a GPU. A faster clock speed will generally lead to less 'down time' in any given ALU as it is more likely to be ready to go sooner when the requisite data is available.

What I want to point out, is that NONE of this shows up in a TF metric. The underlying reality of swapping data in and out, the various bottlenecks, tradeoffs, etc, all of that is presumed to essentially not exist when discussing TF. However this is one of the biggest considerations when doing optimization, as you have to take into account these facts of life about having threads 'stall', so to speak.

Will this make the PS5 faster in computations than the XSX? In a few cases, possibly, but in general, no it won't. However it does mean that the story on what that gap is isn't as simple as many here are claiming. I expect that the PS5 may generally run at a slightly lower resolution (some quick calculations would put the resolution of 3504x1971 at a hair more than 16% reduction in pixel count), but in many cases I think the gap will be closer than you'd expect from a raw TF count, because the higher clock speed does help 'in the real world' in a somewhat non-linear fashion as compared to raw TF numbers in that it makes the penalty of moving data in and out of local storage smaller. It's not a HUGE difference (at least not in most cases), but it's not nothing.
Interesting explanation. Makes sense.
 

gremlinz1982

Member
Aug 11, 2018
5,331
Do we?

Here's an article where they overclocked their 5700 from the default 1,850 MHz to 2,005 MHz, a clock frequency increase of 8.4%.

This was their gaming performance increase across different titles.

3D Mark: Increase in GPU score of 12.5%
Assassin's Creed Odyssey: 12% fps increase
Far Cry New Dawn: 10.1% fps increase
F1 2019: 27.4% fps increase
Metro Exodus: 11.8% fps increase
SoT Tomb Raider: 10% fps increase.

These results show that the gaming performance percentage increase, actually exceeds the overclock percentage increase, which discredits your notion and gives further credence to why Sony chose a higher clockspeed. Remember, the GPU in the PS5 is clocked 22% faster at peak than the Series X's (2.23Ghz vs 1.825GHz). We don't really know how RDNA2 will fare.

www.pcgamesn.com

AMD hid the true power of the Radeon RX 5700… here’s how to unlock it

Unlocking the frequency and power shackles on the second-string Navi GPU makes it a fantastic card

Further to that, a 5700 at a core clockspeed of 1,920 MHz gets a Fire Strike GPU score of 25,205, whilst a 5700 with a core clockspeed of 2,194 MHz (closer to the PS5's speed) gets a GPU score of 28,806.

Meaning, a clock speed increase of 14.3% leads to a Fire Strike GPU benchmark score increase of 14.3%. Obviously this isn't a gaming test, but it is interesting nonetheless and shows an almost perfect correlation with increases in frequency clock to actual benchmarking (not Tflop) results.

www.3dmark.com

I scored 23 590 in Fire Strike

AMD Ryzen 5 3600, AMD Radeon RX 5700 x 1, 16384 MB, 64-bit Windows 10}

www.3dmark.com

I scored 23 737 in Fire Strike

Intel Core i7-6700K Processor, AMD Radeon RX 5700 x 1, 16384 MB, 64-bit Windows 10}

On a side note, a 5700 at near PS5 clock speeds actually gets a similar GPU score as a 2080 Super which gets 29.3k, and that's a 2,055 MHz clocked GPU at 12.6 turing Tflops that costs a whopping £600-£700 alone.

www.3dmark.com

I scored 25 400 in Fire Strike

AMD Ryzen 7 3800X, NVIDIA GeForce RTX 2080 SUPER x 1, 16384 MB, 64-bit Windows 10}
You are looking at the wrong things and coming to a wrong conclusion.

An 18% overclock to 2.1Ghz which is close to the frequency Sony is aiming for got worse returns across the board.
 

nib95

Contains No Misinformation on Philly Cheesesteaks
Banned
Oct 28, 2017
18,498
You are looking at the wrong things and coming to a wrong conclusion.

An 18% overclock to 2.1Ghz which is close to the frequency Sony is aiming for got worse returns across the board.

In my post I specifically compare a 5700 that is overclocked to 2,194 MHz, which is even closer to the PS5's clock and shows a correlating improvement in performance.

Just looks like the 5700 responds much better to overclocking than the 5700 XT does, perhaps that's due to the XT hitting power limitations, but as in your original post you only mentioned 5700, that's the one I looked into.

But that is interesting nonetheless. I guess now it's a case of seeing how RDNA2 responds to OC'ing, but presumably Sony is happy enough with the results that they did it.
 

Dee Harp

Banned
Nov 7, 2017
98
Why is the majority speculation super negative about the ps5? Getting ridiculous.


Because the Microsoft defense for the last 6 years is that the only reason the PS4 sold more is because it was more powerful. And now that we are in world where people bought PlayStation because it has the games they want to play. It seems like some people want to recreate the power narrative. At least sustained clock speeds as a new thing.
 

Fatmanp

Member
Oct 27, 2017
4,438
I fully expect your post to be ignored by people with an agenda to push, even though you just 100% refuted the claim that clock speed gains are lower than expected.

The post is discussing an identical chip at different clock speeds. The XSX and PS5 GPUs are different chips. One has nearly 50% more CUs which unless I am mistaken is more like a 2080 Vs an obscenely over overclocked 2060.
 

nib95

Contains No Misinformation on Philly Cheesesteaks
Banned
Oct 28, 2017
18,498
you got a source for this? Because Microsoft mentioned nothing about their clocks being variable. 52 CU at 1850 (or whatever the number was) is a constant.

And they're being truthful, but they're referring to peak performance, that's what Tflops refer to. And the XSX could indeed function at sustained peak CPU/GPU clocks were the load to require it.

And whilst I don't have a source, you can test it with PC GPU's that show this in action. Essentially GPU's have DTM States that regulate clocks with voltage depending on demand/load (for efficiency savings). There's no reason to assume that consoles wouldn't have similar efficiency saving features or functions, especially when we can actually see the huge variations in power/wattage usage whilst running console games. That is why for example when playing Gears of War 4 on an Xbox One X, power usage (in wattage) in testing, jumps from well below 100w all the way up to 172w peak, but averaged at around 107w. When the load doesn't require it, the voltages and frequencies change to save energy (and presumably heat and hardware longevity too).

I should clarify, this isn't me saying this is the same at what Cerny is describing, it's not. What Cerny is talking about with the PS5 is different, instead it's a potential minor downclock in the event of hitting a peak power threshold (eg 2% downclock to get back 10% power). The XSX in this uncommon scenario would just keep running at max clocks, but have a power and/or heat spike instead.
 
Last edited:

Vinx

Member
Sep 9, 2019
1,411
I'm not a game programmer, but I do develop GPU accelerated scientific models, so I have a few years experience in optimizing different gpu architectures to accelerate computation. So I have sense of what makes something fast or slow as I've seen it play out on various hardware.

The notion of a 'sustained TF rate' doesn't really make sense. The TF number is a peak theoretical number that will in all likelyhood never be actually hit on either console(possibly occassionally for a few milliseconds at a time, but generally they'll not be computing at that rate). It's not a benchmark, it's just a way of counting components and clocks within a single number. It's actually a lot like a the way a business considers the number of 'man hours' it'll take to perform a task. The actual computational throughput will be determined by things like thread occupancy and how much shared and local memory each thread requires. Since both components use the same architecture, this in theory effects both equally, but it's not nonsense to suggest the higher clockrate gives a bit of help to the PS5 here as it's able to utilize local data (the data sitting in the GPU cache) and shuffle it out for the next piece of data that it needs more readily. It is an effective bandwidth increase on the ram->cache->computation pipeline.

More technical version:
The biggest factor on speeding up a GPU is how much you can saturate the compute units (those CUs that we keep hearing about). If you can get a concurrent thread on each ALU within each CU, and have minimal or no requirement to reach back to VRAM within a kernel call to swap data in and out of the CU cache, then you can get pretty close to your peak throughput. This is, in practice, not common, as the available local storage within the CU is tiny, a few tens of KB shared between all the ALUs. For example, in the Nvidia volta architecutre (which i'm most familiar with), there is a single 256KB block of memory (arranged in 32bit registers) for every thread running on that SM to use for data exclusive to that thread. In a perfect world, every one of the 64 CUDA cores in a single Volta SM would have it's own thread, meaning each one gets ~4KB per thread to store useful data. (There is 96KB of shared memory as well, but I'll ignore this for the moment. It's extremely useful but somewhat immaterial for this explanation). This is not generally practical, in my experience, so you're left with two options, not mutually exclusive. You can reduce the number of concurrent threads, or you can periodically swap data in and out of registers by calling back to VRAM. The former is what Cerny was alluding to when he said it's hard to fill more CUs than fewer, although I somewhat disagree with his characterization in the case where all CUs in both platforms have access to the same relative register and shared memory. I'm not familiar with RDNA2 though so I don't want to comment on that too much. In the latter case, which is almost always necessary to some degree, you can think of an analogy to screen tearing as to what happens under the hood.

Several threads are going about their business, making computations, thread a says 'oh, i need something from vram'. Thread a then gets paused, and thread b gets moved into his place to keep computing while the data for thread a is fetched. Meanwhile, thread b finishes his work, and thread a either is ready to keep going or isnt. This is determined by the latency in access to vram. Now if thread a isn't ready, most likely another thread gets moved in to thread b's place and keeps going. If thread a is ready, then it'll get shuffled back in to pick up the computation where it left off with its new data. The analogy to screen tearing is this. If you have ever played with v-sync on a 60hz monitor, and then on a 144hz monitor, you've probably noticed that screen tearing is far less noticeable than on a slower refresh. This is because the gap between when getting data and being able to use it is smaller. A similar analogy holds with clock speeds in a GPU. A faster clock speed will generally lead to less 'down time' in any given ALU as it is more likely to be ready to go sooner when the requisite data is available.

What I want to point out, is that NONE of this shows up in a TF metric. The underlying reality of swapping data in and out, the various bottlenecks, tradeoffs, etc, all of that is presumed to essentially not exist when discussing TF. However this is one of the biggest considerations when doing optimization, as you have to take into account these facts of life about having threads 'stall', so to speak.

Will this make the PS5 faster in computations than the XSX? In a few cases, possibly, but in general, no it won't. However it does mean that the story on what that gap is isn't as simple as many here are claiming. I expect that the PS5 may generally run at a slightly lower resolution (some quick calculations would put the resolution of 3504x1971 at a hair more than 16% reduction in pixel count), but in many cases I think the gap will be closer than you'd expect from a raw TF count, because the higher clock speed does help 'in the real world' in a somewhat non-linear fashion as compared to raw TF numbers in that it makes the penalty of moving data in and out of local storage smaller. It's not a HUGE difference (at least not in most cases), but it's not nothing.
This is an excellent post and should probably be posted into every topic dealing with these systems however, it will be ignored and forgotten by tomorrow.

People with little to no understanding of this has been throwing around "Tflops" so much over the past couple days that it has lost all meaning. Might as well say the Xbox Series X has eleventy flippity flappy floos.

So, again, excellent post but you wasted your time typing that up in an effort to educate people here on the subject. Unfortunately, no one cares. They only want to hear that their box of choice is the one thats the best.
 

Asbsand

Banned
Oct 30, 2017
9,901
Denmark
Meh, with that SSD it's basically 30 TFLOPS anyway so it doesn't matter.
It's graphics processing vs asset loading.

The GPU is the speed of rendering. The hard-drive is the speed of streaming in the data that needs to be rendered. On Xbox Series X the hard-drive will "bottleneck" the GPU but not by much. The PS5 GPU will bottleneck the increase of speed of the SSD, but it only depends on where developers set the bar for graphics.

PS5 will still have faster load times and shit but if what you want is higher image quality you'll go with Xbox this time.
 

zombiejames

Member
Oct 25, 2017
11,918
I'm not a game programmer, but I do develop GPU accelerated scientific models, so I have a few years experience in optimizing different gpu architectures to accelerate computation. So I have sense of what makes something fast or slow as I've seen it play out on various hardware.

The notion of a 'sustained TF rate' doesn't really make sense. The TF number is a peak theoretical number that will in all likelyhood never be actually hit on either console(possibly occassionally for a few milliseconds at a time, but generally they'll not be computing at that rate). It's not a benchmark, it's just a way of counting components and clocks within a single number. It's actually a lot like a the way a business considers the number of 'man hours' it'll take to perform a task. The actual computational throughput will be determined by things like thread occupancy and how much shared and local memory each thread requires. Since both components use the same architecture, this in theory effects both equally, but it's not nonsense to suggest the higher clockrate gives a bit of help to the PS5 here as it's able to utilize local data (the data sitting in the GPU cache) and shuffle it out for the next piece of data that it needs more readily. It is an effective bandwidth increase on the ram->cache->computation pipeline.

More technical version:
The biggest factor on speeding up a GPU is how much you can saturate the compute units (those CUs that we keep hearing about). If you can get a concurrent thread on each ALU within each CU, and have minimal or no requirement to reach back to VRAM within a kernel call to swap data in and out of the CU cache, then you can get pretty close to your peak throughput. This is, in practice, not common, as the available local storage within the CU is tiny, a few tens of KB shared between all the ALUs. For example, in the Nvidia volta architecutre (which i'm most familiar with), there is a single 256KB block of memory (arranged in 32bit registers) for every thread running on that SM to use for data exclusive to that thread. In a perfect world, every one of the 64 CUDA cores in a single Volta SM would have it's own thread, meaning each one gets ~4KB per thread to store useful data. (There is 96KB of shared memory as well, but I'll ignore this for the moment. It's extremely useful but somewhat immaterial for this explanation). This is not generally practical, in my experience, so you're left with two options, not mutually exclusive. You can reduce the number of concurrent threads, or you can periodically swap data in and out of registers by calling back to VRAM. The former is what Cerny was alluding to when he said it's hard to fill more CUs than fewer, although I somewhat disagree with his characterization in the case where all CUs in both platforms have access to the same relative register and shared memory. I'm not familiar with RDNA2 though so I don't want to comment on that too much. In the latter case, which is almost always necessary to some degree, you can think of an analogy to screen tearing as to what happens under the hood.

Several threads are going about their business, making computations, thread a says 'oh, i need something from vram'. Thread a then gets paused, and thread b gets moved into his place to keep computing while the data for thread a is fetched. Meanwhile, thread b finishes his work, and thread a either is ready to keep going or isnt. This is determined by the latency in access to vram. Now if thread a isn't ready, most likely another thread gets moved in to thread b's place and keeps going. If thread a is ready, then it'll get shuffled back in to pick up the computation where it left off with its new data. The analogy to screen tearing is this. If you have ever played with v-sync on a 60hz monitor, and then on a 144hz monitor, you've probably noticed that screen tearing is far less noticeable than on a slower refresh. This is because the gap between when getting data and being able to use it is smaller. A similar analogy holds with clock speeds in a GPU. A faster clock speed will generally lead to less 'down time' in any given ALU as it is more likely to be ready to go sooner when the requisite data is available.

What I want to point out, is that NONE of this shows up in a TF metric. The underlying reality of swapping data in and out, the various bottlenecks, tradeoffs, etc, all of that is presumed to essentially not exist when discussing TF. However this is one of the biggest considerations when doing optimization, as you have to take into account these facts of life about having threads 'stall', so to speak.

Will this make the PS5 faster in computations than the XSX? In a few cases, possibly, but in general, no it won't. However it does mean that the story on what that gap is isn't as simple as many here are claiming. I expect that the PS5 may generally run at a slightly lower resolution (some quick calculations would put the resolution of 3504x1971 at a hair more than 16% reduction in pixel count), but in many cases I think the gap will be closer than you'd expect from a raw TF count, because the higher clock speed does help 'in the real world' in a somewhat non-linear fashion as compared to raw TF numbers in that it makes the penalty of moving data in and out of local storage smaller. It's not a HUGE difference (at least not in most cases), but it's not nothing.
We need more post and insights like this one here from people with actual experience in these matters. Really good to see.
 

test_account

Member
Oct 25, 2017
4,645
"What that means is that the console will not run the GPU at 2.23 GHz all the time. Since Microsoft's clock is a static number, just because PS5's clock rate is variable makes the 10.28 TFLOPs number uncomparable to the Xbox Series X and misleading. This is because XSX is displaying the "sustained TFLOPs" figure while PS5 is displaying the "peak TFLOPs" figure. To give you some context, when the industry uses the term TFLOPs, it is usually referring to the sustained TFLOPs figure."

Can someone explain the difference to me? So XSX GPU will run constantly at 12TFLOPS with no problem at all, while PS5's GPU wont be able to run at 10.3TFLOP all the time? If so, why not?
 

rokkerkory

Banned
Jun 14, 2018
14,128
"What that means is that the console will not run the GPU at 2.23 GHz all the time. Since Microsoft's clock is a static number, just because PS5's clock rate is variable makes the 10.28 TFLOPs number uncomparable to the Xbox Series X and misleading. This is because XSX is displaying the "sustained TFLOPs" figure while PS5 is displaying the "peak TFLOPs" figure. To give you some context, when the industry uses the term TFLOPs, it is usually referring to the sustained TFLOPs figure."

Can someone explain the difference to me? So XSX GPU will run constantly at 12TFLOPS with no problem at all, while PS5's GPU wont be able to run at 10.3TFLOP all the time? If so, why not?

As I understand it, XsX is similar to other consoles of the past with fixed performance with varying power (watt) consumption. This is why sometimes the fan in the system kicks up higher under higher load.

PS5 is designed to run at constant power consumption and will flex cpu/gpu speeds as needed to keep it at a constant power. Cerny said he expects most of the time the 'peak' speeds will be obtained.
 

Munstre

Member
Mar 7, 2020
380
"What that means is that the console will not run the GPU at 2.23 GHz all the time. Since Microsoft's clock is a static number, just because PS5's clock rate is variable makes the 10.28 TFLOPs number uncomparable to the Xbox Series X and misleading. This is because XSX is displaying the "sustained TFLOPs" figure while PS5 is displaying the "peak TFLOPs" figure. To give you some context, when the industry uses the term TFLOPs, it is usually referring to the sustained TFLOPs figure."

Can someone explain the difference to me? So XSX GPU will run constantly at 12TFLOPS with no problem at all, while PS5's GPU wont be able to run at 10.3TFLOP all the time? If so, why not?
The article is nonsense. Read the post by guitarNINJA to understand what's going on.
 

gundamkyoukai

Member
Oct 25, 2017
21,087
I'm not a game programmer, but I do develop GPU accelerated scientific models, so I have a few years experience in optimizing different gpu architectures to accelerate computation. So I have sense of what makes something fast or slow as I've seen it play out on various hardware.

The notion of a 'sustained TF rate' doesn't really make sense. The TF number is a peak theoretical number that will in all likelyhood never be actually hit on either console(possibly occassionally for a few milliseconds at a time, but generally they'll not be computing at that rate). It's not a benchmark, it's just a way of counting components and clocks within a single number. It's actually a lot like a the way a business considers the number of 'man hours' it'll take to perform a task. The actual computational throughput will be determined by things like thread occupancy and how much shared and local memory each thread requires. Since both components use the same architecture, this in theory effects both equally, but it's not nonsense to suggest the higher clockrate gives a bit of help to the PS5 here as it's able to utilize local data (the data sitting in the GPU cache) and shuffle it out for the next piece of data that it needs more readily. It is an effective bandwidth increase on the ram->cache->computation pipeline.

More technical version:
The biggest factor on speeding up a GPU is how much you can saturate the compute units (those CUs that we keep hearing about). If you can get a concurrent thread on each ALU within each CU, and have minimal or no requirement to reach back to VRAM within a kernel call to swap data in and out of the CU cache, then you can get pretty close to your peak throughput. This is, in practice, not common, as the available local storage within the CU is tiny, a few tens of KB shared between all the ALUs. For example, in the Nvidia volta architecutre (which i'm most familiar with), there is a single 256KB block of memory (arranged in 32bit registers) for every thread running on that SM to use for data exclusive to that thread. In a perfect world, every one of the 64 CUDA cores in a single Volta SM would have it's own thread, meaning each one gets ~4KB per thread to store useful data. (There is 96KB of shared memory as well, but I'll ignore this for the moment. It's extremely useful but somewhat immaterial for this explanation). This is not generally practical, in my experience, so you're left with two options, not mutually exclusive. You can reduce the number of concurrent threads, or you can periodically swap data in and out of registers by calling back to VRAM. The former is what Cerny was alluding to when he said it's hard to fill more CUs than fewer, although I somewhat disagree with his characterization in the case where all CUs in both platforms have access to the same relative register and shared memory. I'm not familiar with RDNA2 though so I don't want to comment on that too much. In the latter case, which is almost always necessary to some degree, you can think of an analogy to screen tearing as to what happens under the hood.

Several threads are going about their business, making computations, thread a says 'oh, i need something from vram'. Thread a then gets paused, and thread b gets moved into his place to keep computing while the data for thread a is fetched. Meanwhile, thread b finishes his work, and thread a either is ready to keep going or isnt. This is determined by the latency in access to vram. Now if thread a isn't ready, most likely another thread gets moved in to thread b's place and keeps going. If thread a is ready, then it'll get shuffled back in to pick up the computation where it left off with its new data. The analogy to screen tearing is this. If you have ever played with v-sync on a 60hz monitor, and then on a 144hz monitor, you've probably noticed that screen tearing is far less noticeable than on a slower refresh. This is because the gap between when getting data and being able to use it is smaller. A similar analogy holds with clock speeds in a GPU. A faster clock speed will generally lead to less 'down time' in any given ALU as it is more likely to be ready to go sooner when the requisite data is available.

What I want to point out, is that NONE of this shows up in a TF metric. The underlying reality of swapping data in and out, the various bottlenecks, tradeoffs, etc, all of that is presumed to essentially not exist when discussing TF. However this is one of the biggest considerations when doing optimization, as you have to take into account these facts of life about having threads 'stall', so to speak.

Will this make the PS5 faster in computations than the XSX? In a few cases, possibly, but in general, no it won't. However it does mean that the story on what that gap is isn't as simple as many here are claiming. I expect that the PS5 may generally run at a slightly lower resolution (some quick calculations would put the resolution of 3504x1971 at a hair more than 16% reduction in pixel count), but in many cases I think the gap will be closer than you'd expect from a raw TF count, because the higher clock speed does help 'in the real world' in a somewhat non-linear fashion as compared to raw TF numbers in that it makes the penalty of moving data in and out of local storage smaller. It's not a HUGE difference (at least not in most cases), but it's not nothing.

Thanks for the write up .


This is an excellent post and should probably be posted into every topic dealing with these systems however, it will be ignored and forgotten by tomorrow.

People with little to no understanding of this has been throwing around "Tflops" so much over the past couple days that it has lost all meaning. Might as well say the Xbox Series X has eleventy flippity flappy floos.

So, again, excellent post but you wasted your time typing that up in an effort to educate people here on the subject. Unfortunately, no one cares. They only want to hear that their box of choice is the one thats the best.

I happy that he post it .
Yeah most will ignored it but some of us happy to gain knowledge and info from people that know more.
Still like you said it will be sad since people not willing to learn and we go around in circles.
 

Iwao

Member
Oct 25, 2017
11,778
Articles like this would not be necessary if there wasn't such a massive defense of a poor initial presentation by Sony. We have supposed neutral journalist and gaming fans moving goal posts and changing the importance of teraflops. MSFT screwed up with the X1 and they had to eat it, they were crucified for "secret sauce" The ps5 is not some revolutionary system in comparison to the series X, it's inferior in most areas. Sony is being vague and getting a pass by most of the media.
This is an embarrassing take. Teraflops are not what you think they are, and devs agree. That actual neutral journalist is only repeating what developers who are working on the machines are saying. If PS5 wasn't some revolutionary system, developers would not be calling it revolutionary. What is this then? Developer bias? Outta here.





Why is this thread still open?
Agreed. They are misleading people in an article about something they think is misleading.
 

DSP

Member
Oct 25, 2017
5,120
"What that means is that the console will not run the GPU at 2.23 GHz all the time. Since Microsoft's clock is a static number, just because PS5's clock rate is variable makes the 10.28 TFLOPs number uncomparable to the Xbox Series X and misleading. This is because XSX is displaying the "sustained TFLOPs" figure while PS5 is displaying the "peak TFLOPs" figure. To give you some context, when the industry uses the term TFLOPs, it is usually referring to the sustained TFLOPs figure."

Can someone explain the difference to me? So XSX GPU will run constantly at 12TFLOPS with no problem at all, while PS5's GPU wont be able to run at 10.3TFLOP all the time? If so, why not?

The peak TF value is a function of frequency. On Series X, the peak is always 12 because it is running at a fixed rate. The peak on PS5 is going to vary because its clock varies, basically this 10.3 figure is the peak of the peak, and that's the point the author is trying to make that these are not directly comparable, if you want to compare you need some kind of average on PS5 that we can't get so he's trying to estimate.
 
Status
Not open for further replies.