• Ever wanted an RSS feed of all your favorite gaming news sites? Go check out our new Gaming Headlines feed! Read more about it here.

Cantaim

Member
Oct 25, 2017
33,320
The Stussining
Posting this in behalf of Krejlooc

When you talk about time for operations when discussing computers, you tend to use terms like "nano" and "micro" and "milli" before the word "second" which makes it pretty hard for the average person to comprehend just how latent some operations are. I think a good discussion could stem from this:
UgnJ6y0.jpg

This is a chart that translates average times for various operations into "human" time, beginning by translating a single CPU cycle at .3 nanoseconds into 1 second. From there, other various operations are extrapolated out, which really shows how slow, or how fast, some technologies really are. I figured that, given discussions recently about cloud computing, this would be a good topic starter to get people understanding just how enormous the gulf in time between accessing local parts of your system and accessing remote parts really are.



for some extra clarity I'm going to include a follow up question that Nick C asked him in the adopt a user thread
Any way you can build onto this? Like, where does the .03 ms = 1 s equation come in? Being cloud-based computing, I would assume that this requires a constant connection to the internet. Is this correct? If so, I would also assume that connection speeds would also play a big role in "real-world" speeds as well?

I'm obviously very interested in this, but there are some things that I'm still unclear on. The infographic is both fascinating and damning at the same time, but an OP like this could use a little more fleshing out. To put a scale on 32 millenia, that is roughly 6 times longer than recorded human history and 1 millenia shy of when humans first domesticated dogs.

Here was Krejilooc's response
The figures come from a company called TidalScale who published their research in a lecture at Flash Memory Summit 2015 entitled "What enables the solution? The Memory Hierarchy in Human Terms"

Unfortunately their lecture isn't available online save for slides, which is where that comes from.

I actually saw the slide about a year ago while conversing with some friends on twitter about latency. I really wish I could find the original lecture online, but regardless, I trust the people I was conversing with and the figures don't sound outlandish.

EDIT: Further clarification, the ".3ns = 1 second" isn't a formula, that's the conversion scale they're using. It's saying that, if 1 CPU cycle is .3ns in length, here are relative values for other operations on the left hand side. And on the right hand side is the conversion, i.e. "let's assume 1 CPU cycle was 1 second long, instead of .3ns long, here's how long those other operations would take in comparison."

They use 1 CPU cycle as the base unit, because that's the fastest unit of computing. The comparison is meant to give you a better grasp of how relatively slow other operations are.

EDIT AGAIN: Also, to your point about connection speed -- this goes into a bit about what exactly latency is. There is a misunderstanding in the role of concurrency in latency. When you have faster connection speeds, you are still bound by some physical properties of the speed of light. When they say you went from, say, 14.4 Kbps to 28.8 Kbps, this does not mean the speed by which information was transmitted increased. That's not really possible. Instead, they mean every single trip to send data, more data can be sent concurrently.

There is a classic problem in computer science that explains this: 9 women can't make a baby in 1 month. Some processes take time to perform regardless of whether it's happening once or whether it's happening 10 times at once. Let's say I want to send a kid down the street to get a coke from the store. A kid can only carry 1 can of coke. It takes him 10 minutes to walk to the store. My throughput is thus 1 coke per 10 minutes. But if I send 10 kids at once, then in 10 minutes, I can have a throughput of 10 cokes per 10 minutes. But that's not quite the same as 1 coke per 1 minute, as 1 coke still takes 10 minutes to retrieve.

So while connection bandwidth has increased, and thus our internet concurrency has increased, the time for 1 single operation is still the same.
 
OP
OP
Cantaim

Cantaim

Member
Oct 25, 2017
33,320
The Stussining
This was honestly a really fascinating read. I don't know much about computer process but it's interesting to see how they compare operations with their "real time" counter part
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
This was honestly a really fascinating read. I don't know much about computer process but it's interesting to see how they compare operations with their "real time" counter part

Whats really cool is realizing this stuff as you learn more about computer programming and optimization. Things like L1-L3 cache work well if you take into account Locality of reference when writing code. For example, say you are iterating through objects in memory every call, it's exponentially faster to access it all through cache rather than successive hits to RAM. When you call a location in memory and your computer fetches it, to account for this, it'll fetch surrounding data on the assumption that you'll be accessing it on your next command, to save on these hits. Depending on how you handle your memory, you can exploit spacial locality to increase performance. For example, I'm currently writing some code that handles collision detection between objects. I have built an object handler that manages memory for me, and thus lets me order my objects in memory in any way I wish. By using a z-ordered curve to populate my memory, I can ensure that sequential memory follows a spacial pattern like so:

542px-Four-level_Z.svg.png


which allows me to iterate through immediately surrounding spaces without having to check useless conditions (i.e. it's not possible for two objects across the screen from each other to collide at all, so never check that condition). By ordering my memory this way, when I access it and it's moved to cache, I have a better chance of my next iteration being already in cache to begin with.
 
OP
OP
Cantaim

Cantaim

Member
Oct 25, 2017
33,320
The Stussining
Whats really cool is realizing this stuff as you learn more about computer programming and optimization. Things like L1-L3 cache work well if you take into account Locality of reference when writing code. For example, say you are iterating through objects in memory every call, it's exponentially faster to access it all through cache rather than successive hits to RAM. When you call a location in memory and your computer fetches it, to account for this, it'll fetch surrounding data on the assumption that you'll be accessing it on your next command, to save on these hits. Depending on how you handle your memory, you can exploit spacial locality to increase performance. For example, I'm currently writing some code that handles collision detection between objects. I have built an object handler that manages memory for me, and thus lets me order my objects in memory in any way I wish. By using a z-ordered curve to populate my memory, I can ensure that sequential memory follows a spacial pattern like so:

542px-Four-level_Z.svg.png


which allows me to iterate through immediately surrounding spaces without having to check useless conditions (i.e. it's not possible for two objects across the screen from each other to collide at all, so never check that condition). By ordering my memory this way, when I access it and it's moved to cache, I have a better chance of my next iteration being already in cache to begin with.
This shit is facinating
 

ss_lemonade

Member
Oct 27, 2017
6,648
I barely know anything about programming at that level but in web development, stuff closer to your executing server will always run several times better (eg. web server to a nearby memcached server versus web server to a distant database server). It's how large sites like Facebook run extremely well - by stuffing a ton of data into cache servers that can be accessed by other servers. Does this also fall into the category of "locality of reference"?
 
Last edited:
Oct 25, 2017
1,713
I barely know anything about programming at that level but in web development, stuff closer to your executing server will always run several times better (eg. web server to a nearby memcached server versus web server to a distant database server). It's how large sites like Facebook run extremely well - by stuffing a ton of data into cache server that can be accessed by other servers. Does this also fall into the category of "locality of reference"?

Same concept on a huge scale. Pretty cool stuff.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
I barely know anything about programming at that level but in web development, stuff closer to your executing server will always run several times better (eg. web server to a nearby memcached server versus web server to a distant database server). It's how large sites like Facebook run extremely well - by stuffing a ton of data into cache server that can be accessed by other servers. Does this also fall into the category of "locality of reference"?

Sorta kinda the same principle, but locality of reference is usually more abstract. Like, locality of reference as I described it refers to objects that are supposed to reside next to each other in spacial or temporal representation, residing next to each other physically in local memory. What you're referring to is merely locality - i.e. computer components that needs to travel shorter distances to communicate do so quicker :P

to understand what I mean, imagine I have a screen full of objects, like so:

83rzGEv.png


Our screen is a 2D grid that is 4x8 big (for simplicity), and each circle is a single pixel object. They are each labeled, A-Z plus !, @, $, %, ^, and &. Additionally, their X, Y coordinates are given on the left side of each object. However, while our screen is a 2D grid, computer memory is a 1D array (i.e. just a list of sequential values in 1 dimension, not two), like so:

Code:
[A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] [Z] [!] [@] [$] [%] [^] [&]

Say each "box" of memory in this example can hold one object. The above is 32 "boxes" of memory. If I access the first box, I get object A, which is at 0,0. If I access the second box of memory, I get object B, which is at 0,1. This is a nested height by width representation of the objects in memory, like so:

XN2F0OV.png

izNLU1e.png

PbC6S9x.png

taakoQG.png


Say for example you want to determine object collision from the perspective of the red circle at 1,1. A logical way to do this is to start at the first memory box at address 1, and compare it's X and Y coordinates to the X and Y coordinates of the red circle to determine if that object collides, then move on to the next box of memory at address 2 to examine object B. Then move to the next box of memory at address 3 to examine object C. We keep the index of the box we're examining as an incrementing value that goes from 0 to 32. This is a relatively simple kind of loop. The problem is, however, that we are examining instances that are impossible for collision. For example, we eventually examine objects H, P, X, and &, which are all the way across the other end of the screen, and thus not really possible to collide with our red circle. We end up doing more than twice the amount of number of checks we need to do, but we do so because going through all the checks we need by using a single incrementing value is rather easy to do (and, when you get really low level, there are reasons why auto-incrementing indexes like this are used).

What we want to do, instead, is limit our search to a range like so:

BbScUAI.png


Locality of reference is a way to order your memory so that you can examine only those items while still using a single incrementing value for an index. There are many ways to accomplish this, and the method I'm using is a Z-ordered curve. basically, instead of ordering my objects in memory sequentially according to width and height of the 2D screen, I order my memory in tiny Z-shapes. Since each Z-shape is a check of 4 components, my selection range must be a power of 2 greater than 4, so we just bump our 9 element range to the nearest power of 2 past 4 (16), and fill it with z-shapes like so:

MxnGFVQ.png

Code:
[A] [B]

TM3Ab5z.png

Code:
[A] [B] [I]

CxVmO0v.png

Code:
[A] [B] [I] [J]

gnMOwjZ.png

Code:
[A] [B] [I] [J] [C]

96sheQX.png

Code:
[A] [B] [I] [J] [C] [D]

octcvtU.png

Code:
[A] [B] [I] [J] [C] [D] [K]

neYDlBJ.png

Code:
[A] [B] [I] [J] [C] [D] [K] [L]

Tc6WVal.png

Code:
[A] [B] [I] [J] [C] [D] [K] [L] [Q]

all the way to:

KDOruSp.png

Code:
[A] [B] [I] [J] [C] [D] [K] [L] [Q] [R] [Y] [Z] [S] [T] [!] [@] [E] [F] [M] [N] [G] [H] [O] [P] [U] [V] [$] [%] [W] [X] [^] [&]

This type of ordering is called Z-ordering, and there are many different ways to automatically populate your memory this way. The way I use is called morton encoding, which is a way of projecting an nth dimensional space onto a 1 dimensional array through bit interleaving.

For a bit more exaplaination about morton encoding:
Basically, the process is that one takes the X and Y coordinates for any given screen space object and convert them into binary, then interleave the bits for each digit into one stream, like so:

Object A: 0, 0 -> X: 0000, Y: 0000 -> Result: 00000000
Object B: 0, 1 -> X: 0000, Y: 0001 -> Result: 00000001
...
Object R: 2, 1 -> X: 0010, Y: 0001 -> Result: 00001001
....
Object %: 3, 5 -> X: 0011, Y: 0101 -> Result: 00011011

The resultant bitstrings of these interleaved coordinates reveals a list that, when ordered sequentially, and when iterated from 00000000 -> 11111111 will yield the Z-ordered pattern.

Ordering your memory in this fashion has many benefits. Primarily, locality of reference reduces the amount of objects we need to examine. if we want to examine the 9 elements surrounding our red circle, we only have to examine the following memory addresses:

Code:
[A] [B] [I] [J] [C] [D] [K] [L] [Q] [R] [Y] [Z] [S] [T] [!] [@]

Which is half as many addresses as we would have to examine if we didn't exhibit locality of reference. With regards to cache, this can be beneficial in drastically speeding up access. Let's assume, for example, that we only have 2 levels of cache, and level 1 of cache is 2 memory boxes big, and level 2 cache is 16 memory boxes big. When we tell our computer that we want to examine the memory box at location A, because direct access from RAM is so much slower than access from cache is, it'll instead move 16 boxes to level 2 cache first, then 2 boxes from level 2 cache to level 1 cache to make the comparison. Because we used locality of reference to ensure that the next 16 memory boxes we access hold every object surrounding the object we want to check and nothing more, we never need to access ram again and can work directly out of cache instead. And, as you can see in the opening post, doing so is massively speedier.

Hopefully that makes sense.
 
Last edited:

tokkun

Member
Oct 27, 2017
5,400
This stuff is a lot of my day-to-day work. I'm supposed to take an operation that has 7us latency and get it down to 5us by the end of the year.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
This stuff is a lot of my day-to-day work. I'm supposed to take an operation that has 7us latency and get it down to 5us by the end of the year.

What type of operation? The most latent-sensitive technologies I've worked with have been vision tracking stuff and EEG work (related to sampling for FFT).
 
Oct 27, 2017
1,283
SoCal
Didn't see this posted the first time, but thanks for sharing, Krejlooc! It is interesting to see the magnitudes of difference represented in more human terms to hammer home how damning a cache miss can be.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
Didn't see this posted the first time, but thanks for sharing, Krejlooc! It is interesting to see the magnitudes of difference represented in more human terms to hammer home how damning a cache miss can be.

It also helps understand just how many billions of operations our computers can perform in a second. The amount of math a computer can do all at once is staggering. Take any single frame from a modern AAA game and try and calculate all the math by hand for one second, and it could very well take an entire lifetime to achieve.

I like to imagine what time would be like on this scale, if we could shrink down and slow down time to see the individual electrical impulses leaving the CPU. It would take eons to accomplish anything. When we use computers, it's like seeing the grand canyon carve out before us in fast motion. Computers work because of mind bogglingly huge amounts of operations, each one doing a tiny amount of work.
 

Deleted member 32101

User requested account closure
Banned
Nov 9, 2017
335
Currently taking a Computer Architecture class. Or rather "How can we keep the pipeline busy while data is being fetched"-101 and other dirty tricks to utilize the billions of transistors to make CPUs and GPUs faster which gets too damn complex faster than you ever wished for. Theoretical CompSci is easier, imo.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
Currently taking a Computer Architecture class. Or rather "How can we keep the pipeline busy while data is being fetched"-101 and other dirty tricks to utilize the billions of transistors to make CPUs and GPUs faster which gets too damn complex faster than you ever wished for. Theoretical CompSci is easier, imo.

It's the most interesting shit.

EDIT: It's more fun to learn an old CPU/system, IMO. The m68k is fascinating to study.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
I agree to some extend. I 100% acknowledge the importance of mastering the basics of it to be a competent Computer / Data Scientist / Engineer. Still gonna pursue a MSc far away from Low-Level. :p

I get a rush out of making hardware do as much as possible, as fast as possible, with as little effort as possible. Even if it's for trivial things. I guess it's how I'm wired.

Like, going back to above, for morton encoding, I actually have several different ways of encoding bitstring representations of X,Y coordinates, for different architecture optimizations, because I'm writing a lot of code lately that is multiplatform.

You can do morton encoding using a couple of nested for-loops, but you get n^2 time which is pretty terrible. With Haswell and newer processors, there is actually a couple of BMI2 instructions - PEXT (Parallel bit extract) and PDEP (Parallel bit deposit) -- that are perfect for interleaving bits, but I have options if I am using pre-haswell processors. Most x86 and ARM processors will give me access for a multiply no-carry instruction, and a property of multiplying a digit by itself with multiply no carry is that it produces a bit string interleaved with 0's. Some simple logic and I can create morton string nearly as fast as the BMI2 instructions. But my code also runs on really old and exotic processors, namely the SuperHitachi 4 (Sega Dreamcast), which has no multiply no carry instruction. Using LUTs I can actually approach speed nearing the BMI2 instruction at a cost of a few kilobytes of memory.

I get a kick out of my code running fast and efficiently no matter where it's running. To do that, you gotta know how it all works.

EDIT RE RE: WRT x86 64, thank god modern compilers are what they are today.
 

tokkun

Member
Oct 27, 2017
5,400
What type of operation? The most latent-sensitive technologies I've worked with have been vision tracking stuff and EEG work (related to sampling for FFT).

I work on stuff related to high-performance storage and networking technologies.

The particular operation I am working on optimizing is the client-side CPU time spent doing a single remote data read in a distributed storage system. Like the stuff you are working on, a lot of it comes down to optimizing memory accesses and avoiding cache misses or other sources of stalls. However, it's a multi-threaded client, so the issues I'm facing are less to do with spatial and temporal locality and more to do with interactions between threads on different cores.

For instance, there are some reference-counted objects that are shared across threads - think std::shared_ptr if you are C++ programmer. Any time a referenced-counted object is copied or goes out of scope it requires an atomic read-modify-write operation to the address storing the object's reference count. Since it includes a write, you have to invalidate the cache line holding this address in all the other cores, so they are likely to miss when they next read it. Making things worse, an atomic RMW operation on x86 processors requires that the processor acquire a lock on the memory location, which may further require that the CPU assert a bus lock, stalling any accesses to memory during the operation. When decrementing the reference count, the CPU also needs to enforce sequential consistency in the memory model so it adds a memory fence, further stalling the processor by preventing it from reordering memory accesses.

Lots of stuff going on simply from something like

{
std::shared_ptr<Foo> a = b;
}

That can cost hundreds of nanoseconds on every core that touches 'b'.
 

Sniffynose

Member
Oct 30, 2017
313
It makes sense aside from maybe adding Canadian Internet and having it take since the big bang until present lol
 

johan

Member
Oct 29, 2017
1,554
This is a great topic! Thanks Krejlooc. I have been learning about data locality recently and how it pertains to game development.

The book Game Programming Patterns helped me a lot in understanding this. I feel like real world analogues to these software workings help a lot in understanding; the computer time vs human time table is very helpful and awe inspiring.
 

ss_lemonade

Member
Oct 27, 2017
6,648
Sorta kinda the same principle, but locality of reference is usually more abstract. Like, locality of reference as I described it refers to objects that are supposed to reside next to each other in spacial or temporal representation, residing next to each other physically in local memory. What you're referring to is merely locality - i.e. computer components that needs to travel shorter distances to communicate do so quicker :P

...

Hopefully that makes sense.

Though not exactly applicable in my current field of work/web development, that does make sense with the clear examples. Thanks!
 

ray_caster

Member
Nov 7, 2017
663
Whats really cool is realizing this stuff as you learn more about computer programming and optimization. Things like L1-L3 cache work well if you take into account Locality of reference when writing code. For example, say you are iterating through objects in memory every call, it's exponentially faster to access it all through cache rather than successive hits to RAM. When you call a location in memory and your computer fetches it, to account for this, it'll fetch surrounding data on the assumption that you'll be accessing it on your next command, to save on these hits. Depending on how you handle your memory, you can exploit spacial locality to increase performance. For example, I'm currently writing some code that handles collision detection between objects. I have built an object handler that manages memory for me, and thus lets me order my objects in memory in any way I wish. By using a z-ordered curve to populate my memory, I can ensure that sequential memory follows a spacial pattern like so:

which allows me to iterate through immediately surrounding spaces without having to check useless conditions (i.e. it's not possible for two objects across the screen from each other to collide at all, so never check that condition). By ordering my memory this way, when I access it and it's moved to cache, I have a better chance of my next iteration being already in cache to begin with.

Another good use for Morton ordering is simply ordering texels inside textures. Textures with texels stored linearly perform very well when rendering linearly (as in, the texture is accessed linearly). Geometry with textures mapped to it is rarely, if ever, oriented in such a way that it results in an access order where rendering performs well due to excessive amounts of cache misses. Of course, there are plenty of other texel ordering schemes that address this issue. Nowadays people generally don't need to worry too much about how texels are ordered in memory unless they work with graphics drivers and/or hardware.
 

Durante

Dark Souls Man
Member
Oct 24, 2017
5,074
I work on stuff related to high-performance storage and networking technologies.

The particular operation I am working on optimizing is the client-side CPU time spent doing a single remote data read in a distributed storage system. Like the stuff you are working on, a lot of it comes down to optimizing memory accesses and avoiding cache misses or other sources of stalls. However, it's a multi-threaded client, so the issues I'm facing are less to do with spatial and temporal locality and more to do with interactions between threads on different cores.

For instance, there are some reference-counted objects that are shared across threads - think std::shared_ptr if you are C++ programmer. Any time a referenced-counted object is copied or goes out of scope it requires an atomic read-modify-write operation to the address storing the object's reference count. Since it includes a write, you have to invalidate the cache line holding this address in all the other cores, so they are likely to miss when they next read it. Making things worse, an atomic RMW operation on x86 processors requires that the processor acquire a lock on the memory location, which may further require that the CPU assert a bus lock, stalling any accesses to memory during the operation. When decrementing the reference count, the CPU also needs to enforce sequential consistency in the memory model so it adds a memory fence, further stalling the processor by preventing it from reordering memory accesses.

Lots of stuff going on simply from something like

{
std::shared_ptr<Foo> a = b;
}

That can cost hundreds of nanoseconds on every core that touches 'b'.
As someone whose day job is in HPC, my first question would be "do those objects actually need reference-counted lifetime semantics".
It's a platitude that the fastest operation is the one you don't perform, but it's also true.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
Another good use for Morton ordering is simply ordering texels inside textures. Textures with texels stored linearly perform very well when rendering linearly (as in, the texture is accessed linearly). Geometry with textures mapped to it is rarely, if ever, oriented in such a way that it results in an access order where rendering performs well due to excessive amounts of cache misses. Of course, there are plenty of other texel ordering schemes that address this issue. Nowadays people generally don't need to worry too much about how texels are ordered in memory unless they work with graphics drivers and/or hardware.

An even more straight forward use for morton encoding - the dreamcast pvr core uses deferred tile based rasterization, and the tile drawing pattern follows the z order. By arranging your texels in the z order, you can ensure the core is properly fed.

This type of zordering on a texel is sometimes known as twiddling, or swizzling. I have been writing a book on dreamcast development that talks about this.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
This is a great topic! Thanks Krejlooc. I have been learning about data locality recently and how it pertains to game development.

The book Game Programming Patterns helped me a lot in understanding this. I feel like real world analogues to these software workings help a lot in understanding; the computer time vs human time table is very helpful and awe inspiring.

An aside, but this is a great book.
 

Izayoi

Member
Oct 25, 2017
828
Hey man, this is a great thread. I know next to nothing about the nitty gritty details of how this works, but I appreciate (and understand) the conceptual picture that you have painted. I am fascinated by computer and how they work, and while I would never dare to get involved with the mathematical side of it, I love learning and talking about the theory and concept.

Thank you!
 

Weltall Zero

Game Developer
Banned
Oct 26, 2017
19,343
Madrid
Whats really cool is realizing this stuff as you learn more about computer programming and optimization. Things like L1-L3 cache work well if you take into account Locality of reference when writing code. For example, say you are iterating through objects in memory every call, it's exponentially faster to access it all through cache rather than successive hits to RAM. When you call a location in memory and your computer fetches it, to account for this, it'll fetch surrounding data on the assumption that you'll be accessing it on your next command, to save on these hits. Depending on how you handle your memory, you can exploit spacial locality to increase performance. For example, I'm currently writing some code that handles collision detection between objects. I have built an object handler that manages memory for me, and thus lets me order my objects in memory in any way I wish. By using a z-ordered curve to populate my memory, I can ensure that sequential memory follows a spacial pattern like so:

542px-Four-level_Z.svg.png


which allows me to iterate through immediately surrounding spaces without having to check useless conditions (i.e. it's not possible for two objects across the screen from each other to collide at all, so never check that condition). By ordering my memory this way, when I access it and it's moved to cache, I have a better chance of my next iteration being already in cache to begin with.

This is very interesting! Unless I'm missing some crucial bit, this Z-sorting method can expand more or less trivially to any number of dimensions, so you could use it for collisions in 3D space, right? Although my intuition is that the correlation between physical adjacency and list adjacency may be worse for higher-dimensional spaces? Even in its 2D form, this correlation seems considerably better for elements at the top and bottom of the list (i.e. the ones near the starting and ending "corners" of the 2D space).

Funnily, it reminded me a bit of a kind of fanfiction I made in my head in the late 90s about an (existing) PSX game called Zero Divide (where the main characters are programs), in that I envisioned a kind of three-dimensional memory (physically literal in my case: memory was made of actual, physical cells arranged in a 3D matrix) where each memory address was determined by three coordinates and code could "snake" around without memory jumps.

Again, fascinating read; thanks Krejlooc!

Edit: Hadn't read your second, in-depth post when I read this, but the Wikipedia article was descriptive enough for me to get it. :)
 
Last edited:

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
This is very interesting! Unless I'm missing some crucial bit, this Z-sorting method can expand more or less trivially to any number of dimensions, so you could use it for collisions in 3D space, right? Although my intuition is that the correlation between physical adjacency and list adjacency may be worse for higher-dimensional spaces? Even in its 2D form, this correlation seems considerably better for elements at the top and bottom of the list (i.e. the ones near the starting and ending "corners" of the 2D space).

Funnily, it reminded me a bit of a kind of fanfiction I made in my head in the late 90s about an (existing) PSX game called Zero Divide (where the main characters are programs), in that I envisioned a kind of three-dimensional memory (physically literal in my case: memory was made of actual, physical cells arranged in a 3D matrix) where each memory address was determined by three coordinates and code could "snake" around without memory jumps.

Again, fascinating read; thanks Krejlooc!

Yes, a z-order can fill any n-dimensional space:

FeatureDet_fig1.png
 

Arebours

Member
Oct 27, 2017
2,656
visualization of latencies

This is also a reason for why compilers only can do so much with optimization. Compilers can't change the structure of your data so even just doing something like using a linked list in the wrong place might kill your cache coherency resulting in bad performance. It's never ok to not understand how cpus work if you are programmer, code runs on hardware not software "platforms".

Here's a good read on memory: What Every Programmer Should Know About Memory
 
Last edited:

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
Visualization of latencies

this is also why compilers only can do so much with optimization. compilers can't change the structure of your data so even just doing something like using a linked list in the wrong place might kill your cache coherency resulting in too slow a program. it's never ok to not understand how cpus work if you are programmer, code runs on hardware not software "platforms".

Here's a good read on memory: What Every Programmer Should Know About Memory

This is also why languages that completely expose memory to the programmer, like C, are so incredibly powerful.

Completely agree with your point about needing to know hardware to know software well. I've advocated learning a retro machine for years.
 

smash_robot

Member
Oct 27, 2017
994
I agree to some extend. I 100% acknowledge the importance of mastering the basics of it to be a competent Computer / Data Scientist / Engineer. Still gonna pursue a MSc far away from Low-Level. :p

EDIT Reply: We have x86 64 Architecture. It's as messy as it gets.
If you understand the fundamentals at a low level, you'll be much better off. I've met plenty of developers who struggle with even simple things like multi-threading, or even just allocating on the stack vs the heap - and it is usually down to not really understanding how computers work.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
If you understand the fundamentals at a low level, you'll be much better off. I've met plenty of developers who struggle with even simple things like multi-threading, or even just allocating on the stack vs the heap - and it is usually down to not really understanding how computers work.

Funnily enough, i was writing a special kind of parser for a scripting language and had to implement a stack. I was remarking to a friend how familiarity with m68k asm made doing so easier.

I remember in one of my firsr university comp Sci classes, the professor asked how many people considered themselves programmers, and the entire class raised their hand. Then he asked how many knew what the difference between stack and heap was, and about half the class put their hands down. The professor laughed and wondered outloud if that was a success or a failure of modern programming paradigms.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
Hey man, this is a great thread. I know next to nothing about the nitty gritty details of how this works, but I appreciate (and understand) the conceptual picture that you have painted. I am fascinated by computer and how they work, and while I would never dare to get involved with the mathematical side of it, I love learning and talking about the theory and concept.

Thank you!

To be fair, while computer science can be very heavily math based, most of what is being discussed in this topic is more related to arithmetic - the manipulation of numbers - than actual mathematics

Of course arithmetic is very closely related to mathematics, and on a logical level all mathematics in computing is accomplished by literal, physical arithmetic (which is why games like minecraft that simulate physics can be used to build things like calculators).
 

Weltall Zero

Game Developer
Banned
Oct 26, 2017
19,343
Madrid
I remember in one of my firsr university comp Sci classes, the professor asked how many people considered themselves programmers, and the entire class raised their hand. Then he asked how many knew what the difference between stack and heap was, and about half the class put their hands down. The professor laughed and wondered outloud if that was a success or a failure of modern programming paradigms.

That's an amazing anecdote, and I completely understand the teacher's point, but one has to wonder to what extent it isn't also a failure of education itself. Although if it was an introductory class, I guess that's to be expected.

They say sometimes having hapharadly learnt stuff yourself can be harmful and when you start getting a proper formal education you need to "unlearn" that, so you actually start at a worse position than a complete virgin.
 

smash_robot

Member
Oct 27, 2017
994
Funnily enough, i was writing a special kind of parser for a scripting language and had to implement a stack. I was remarking to a friend how familiarity with m68k asm made doing so easier.

I remember in one of my firsr university comp Sci classes, the professor asked how many people considered themselves programmers, and the entire class raised their hand. Then he asked how many knew what the difference between stack and heap was, and about half the class put their hands down. The professor laughed and wondered outloud if that was a success or a failure of modern programming paradigms.
I remember doing some embedded work with the 68k being the microcontroller for an in-house hardware design - was a combination of C and asm. I enjoyed embedded work, but I don't miss having to do things stick a scope on the memory bus to find out why half the memory was missing. Although seeing the damn thing boot and have the scheduler blink a led on and off is a great feeling lol
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
That's an amazing anecdote, and I completely understand the teacher's point, but one has to wonder to what extent it isn't also a failure of education itself. Although if it was an introductory class, I guess that's to be expected.

They say sometimes having hapharadly learnt stuff yourself can be harmful and when you start getting a proper formal education you need to "unlearn" that, so you actually start at a worse position than a complete virgin.

I think its overall a good thing that the barrier of entry to development has been lowered. The more needed to know to program, the less people can do it. In certain fields, that can be problematic because it forces your programmer to be a jack of all trades. I've taken on projects from doctors and different types of engineers before where the vast majority of the early project time is spent givinh me and others on the team a crash course in their industry, because the problems the programs are meant to solve are intimate to their field. Usually, the programming stuff is actually pretty rote.

Imagine how many books and how much cumulative knowledge we would miss out on if being able to write written language was a unique skill, for example.

Of course, every project is different. If you don't know what how memory works, for example, you really have no buisness trying to do, say, graphics programming.
 

smash_robot

Member
Oct 27, 2017
994
That's an amazing anecdote, and I completely understand the teacher's point, but one has to wonder to what extent it isn't also a failure of education itself. Although if it was an introductory class, I guess that's to be expected.

They say sometimes having hapharadly learnt stuff yourself can be harmful and when you start getting a proper formal education you need to "unlearn" that, so you actually start at a worse position than a complete virgin.
Krejlooc made the point about modern programming languages and he is right. I do enterprise work which is pretty much all in java now-a-days - you can be competent (if not great) at your job without really needing to care about details like that for the most part.

It's also harder now-a-days. I grew up in the UK's computing golden age - BBC micros and the like. Things were simple enough that a child (like I was) could understand pretty much how the entire thing worked because they were very simple architecturally and you had access hardware directly so you could play around - this is impossible to do today, even with something like a raspberry pi.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
I remember doing some embedded work with the 68k being the microcontroller for an in-house hardware design - was a combination of C and asm. I enjoyed embedded work, but I don't miss having to do things stick a scope on the memory bus to find out why half the memory was missing. Although seeing the damn thing boot and have the scheduler blink a led on and off is a great feeling lol

The m68k is just overall a fantastic cpu to learn with. So many registers, and they are all huge, plus the actual architecture has many risc-like properties which makes for a great stepping stone.

Hands down my favorite cpu of all time.
 

Spectrum

Member
Oct 27, 2017
343
That was a very interesting read. Thanks.

Sadly I don't have many Computer Science courses in my university's electronic engineering program. While I know about computer architectures, my knowledge about data structures and ordering is quite basic. Do you have any good references?
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
It's also harder now-a-days. I grew up in the UK's computing golden age - BBC micros and the like. Things were simple enough that a child (like I was) could understand pretty much how the entire thing worked because they were very simple architecturally and you had access hardware directly so you could play around - this is impossible to do today, even with something like a raspberry pi.

I find working with modern gpus feels very much like retro console programming, many concepts are similar and you can even see where certain modern gpu features evolved from. Like amiga copper programming - remarkably similar in concept to a fragment shader, only per raster line instead of per fragment. Even though you dont technically interact with hardware directly, it still feels like you do lol. Especially with much lower level apis like vulkan.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
That was a very interesting read. Thanks.

Sadly I don't have many Computer Science courses in my university's electronic engineering program. While I know about computer architectures, my knowledge about data structures and ordering is quite basic. Do you have any good references?

I took an actual data structures course in college my... junior year iirc? Dont remember the name of the book, but it was an awful class by an awful professor. Luckily, i was pretty much self taught at that point.

I learned all this stuff from sonic the hedgehog hacking. This stuff is useful all over, so id say pick a concentration. Maybe study the tile image format of the sega genesis, and see how the sonic games use quantization and custom structures to save space. Learning how to render a level in sonic will teach you every thing you need to know about memory and data structures.

pP7woPBr.jpg
 
Last edited:

smash_robot

Member
Oct 27, 2017
994
I find working with modern gpus feels very much like retro console programming, many concepts are similar and you can even see where certain modern gpu features evolved from. Like amiga copper programming - remarkably similar in concept to a fragment shader, only per raster line instead of per fragment. Even though you dont technically interact with hardware directly, it still feels like you do lol. Especially with much lower level apis like vulkan.
I've never worked with GPU's. Nearest I ever got to that kind specialized programming was doing DSP work back in the 90's - and I expect that's probably closer to he stuff I do today than it is to GPU dev. I bet it's really interesting, but I doubt I have the maths required for it anymore.

Edit: Oh god, I sound like those guys who I used to know early in my career who I'd roll my eyes at when they talked about tech from the 70's.
 

Deleted member 12790

User requested account closure
Banned
Oct 27, 2017
24,537
I've never worked with GPU's. Nearest I ever got to that kind specialized programming was doing DSP work back in the 90's - and I expect that's probably closer to he stuff I do today than it is to GPU dev. I bet it's really interesting, but I doubt I have the maths required for it anymore.

In a lot of ways, it's just like talking to any other piece of hardware, which is why I said it's so applicable. I.e. use command ports, command buses, etc.

GPUs aren't really only for graphics these days anymore. They're like thousands of tiny computers inside your main computer, all running in parallel. It's fun learning to, for example, think of textures and framebuffer targets as nothing more than abstract data that can be read or written to as any other data is used.

Edit: Oh god, I sound like those guys who I used to know early in my career who I'd roll my eyes at when they talked about tech from the 70's.

I learned from people like that. I had a mentor at a PC Users club I belonged to circa 1998 who got me into SDL. He was a NASA programmer who specialized in FORTRAN. Over the course of a summer, he and I sat down and built an Ultima engine that ran off the original data files:

RqOmgKZ.png


8fXEBVk.png


Brilliant dude, ever bit of advice he gave me regarding computing and graphics wound up being solid gold.
 

Sqrt

Member
Oct 26, 2017
5,880
This is a very interesting topic. I have never coded to such low level optimization but as anecdote, I can share one big problem that I had. For a numerical experiment, I went the lazy route to make a process parallel and just make a is single thread program and executed it several times with different setups until saturating the threads. The program was taking much longer than expected (as in weeks), so I checked up system usage: less than 3% for the CPU! The PC had 8GB of RAM and the PC was using ~9GB, meaning that the program(s) was hitting the Disk Drive to fetch data (Virtual Memory)... such a dumb issue. :S

This is also why languages that completely expose memory to the programmer, like C, are so incredibly powerful.

Completely agree with your point about needing to know hardware to know software well. I've advocated learning a retro machine for years.
I mainly code in Haskell, but my understanding is that no Random access memory takes care of the issue. Is that correct!? :O
 

Spectrum

Member
Oct 27, 2017
343
I took an actual data structures course in college my... junior year iirc? Dont remember the name of the book, but it was an awful class by an awful professor. Luckily, i was pretty much self taught at that point.

I learned all this stuff from sonic the hedgehog hacking. This stuff is useful all over, so id say pick a concentration. Maybe study the tile image format of the sega genesis, and see how the sonic games use quantization and custom structures to save space. Learning how to render a level in sonic will teach you every thing you need to know about memory and data structures.

pP7woPBr.jpg
I'm also mostly self-taught. Lately I've been doing some Windows executable modding, but it's mostly a bunch of hard to follow compiler optimized instructions, so I've meant to get into programming of some retro console, NES or GB perhaps.