QA Lead at Obsidian explains the process of fixing that Outer Worlds bug where the game thinks your companions are dead

Captain of Outer Space · Dec 13, 2019

18 part thread, so you'll have to dive into the Tweet to read the full thing.

Skipping ahead:

Finale:

Princess Bubblegum · Dec 13, 2019

Wow, I only had that bug happen to me once with Parvati. Getting off the ship fixed the failed state of her companion quest.

Cess007 · Dec 13, 2019

By this time, the game had come out, and all hopes of this being a weird fluke only a couple devs would ever see were dashed, as players all over the place started reporting their companion quests failing

Worst feeling in the world (as QA)

Karak · Dec 13, 2019

This is why you should follow these kinds of folks on twitter damn

Deleted member 12790 · Dec 13, 2019

several years ago, I wrote a GPU accelerated tilemap system using openGL shaders. I wanted the entire rendering process to happen outside of the CPU, on the GPU. The concept was a simple atlas system. I would have tiles that were 8x8 pixels big in a large tilemap in VRAM, then pass a second texture that was called a block map. The block map was a series of what looked like random pixels. Each pixel would be point sampled by the GPU iterating through, left to right, top to bottom. The red subchannel of the pixel would be a look up table value to draw the appropriate tile. So like, pixel 1 on the block map would might return a red color value of "1" meaning it would draw the first 8x8 tile in the tilemap. Pixel 2 might return a red color value of "3" which meant it would draw the third 8x8 tile in the tilemap. So far, so good. I would repeat this step - 8x8 tiles -> 64x64 blocks -> 256x256 chunks -> 1024x1024 level segments. This let me build extremely large levels while using just a tiny amount of VRAM, I could swap in and out level segments to create huge, seemingly endless levels with no breaks in between them. It worked fine on AMD and Nvidia GPUs, and seemingly most intel GPUs. However, one day, while testing on a random integrated intel GPU core, I started noticing black lines running through my maps. Now, this is sometimes a common error caused by floating point division precision where you hit a fringe area. The way the GPU reads pixels is through a process known as point sampling, you tell the GPU an area of a texture in memory to look at, but not in precise terms like "look at the pixel at location 1,1". The way sampling works on GPUs is that you feed in a ratio and tell it to look at the sample there. I.e. look at "1%" to the top, and "1%" down. This is commonly known as UV sampling. I thought the error might be a weird rounding error, but was stumped at why it'd only present on a single type of integrated graphics core, and only intel.

MONTHS of testing to fix this problem. I consulted many, many graphics engineers, like people from blizzard, valve, etc. People were honestly stumped on it. Eventually, I stumbled upon the solution -- it's a very, very weird bug involving a very, very specific driver for a not-common intel graphics core, where point sampling is off by a small percentage. The first thing everyone suggested was "aim for the center of the pixel" which is what I'd do. I'd sample at not whole number increments, i.e. "not looking at the pixel at (1,1)" but rather "(0.5, 0.5)" but it wouldn't fix the problem. Normally when hunting down bugs like this that occur on the CPU, you can use all sorts of debugging tools like gdb to stop execution and look at memory. But GPUs don't work like that, they are basically separate machines running side by side with your computer. You, the programmer, control the CPU, and send commands to the GPU, but the GPU is it's own box. Now, there ARE tools which let you debug the GPU, like renderdoc, but this specific intel driver would crash renderdoc, so it was useless. The only way I could debug the GPU was guerilla methods. When you have no debug tools available to you on the CPU side of things, you usually resort to placing "printf" statements throughout your code, a command which makes your program output text to a console. You don't have printf on your GPU. The best equivalent is drawing single pixels to the screen. So I would draw a pixel to the screen, whose color channels would represent values I'd like to test. If I wanted to see what I was sampling at a certain point in detail, I would have to relay that data as a series of pixels, that I would take a screenshot of, then open up in gimp, and read the colors using the color finder tool, to look at the numbers hidden inside.

Long story short, the sampling position of this driver was off by 0.25 in both directions. If I sampled at 0.5, 0.5, it would think I was sampling a 0.75, 0.75. This was juuuust enough in very weird situations to hit what is known as the diamond-exit rule. So, even though my math worked on other drivers, it wouldn't work right on that specific intel driver. And, while newer intel drivers weren't subject to this error, not all older intel integrated GPUs could run that newer driver.

So, after months of figuring out the bug, the solution was... it was unsolvable. The method I was using to draw maps in this way would have to have a custom shader made for that specific graphics core. I said fuck that, and instead wrote a software rasterizer from scratch and presented it as an option, because if a bug like that could exist on one driver, it could exist on multiple drivers and I might never catch them all. So, in case of extreme fringe errors, I provide a software fallback. After all my debugging and error finding, I still had to basically write an entirely separate graphics module to solve it.

This was for a "simple" 2D game.

Debugging is nothing simple at all. Anytime someone says "lazy devs" when talking about bugs, I want to punch my computer monitor.

PeakPointMatrix · Dec 13, 2019

Fascinating. But hellish sounding trying to find the single cause of this.

EVA UNIT 01 · Dec 13, 2019

Did they fix that bug where I can't be in a loving mutual asexual relationship with Parvati?
Cause it would have meant the world to me.
=/

SJRB · Dec 13, 2019

Man figuring this out must've felt SO GOOD.

I love reads like this.

7threst · Dec 13, 2019

I can imagine the rage of devs when people call them lazy devs when they go through stuff like this, wow.

Haze · Dec 13, 2019

SapphiCine said:
Wow, I only had that bug happen to me once with Parvati. Getting off the ship fixed the failed state of her companion quest.

Yup. Reloaded my save and before and fixed left the shift immediately

Deleted member 61326 · Dec 13, 2019

I don't get the explanation. Yes, no new interactions can be started while in conversation, but what about when the conversation ends?

Interaction 1: enter ladder
Conversation starts, NPC can not exit ladder because it's a new interaction
Conversation ends
Interaction 2: sweet, let's go

TronLight · Dec 13, 2019

MatthiasHjalmarsson said:
I don't get the explanation. Yes, no new interactions can be started while in conversation, but what about when the conversation ends?

Interaction 1: enter ladder
Conversation starts, NPC can not exit ladder because it's a new interaction
Conversation ends
Interaction 2: sweet, let's go

That's the bug, the NPC doesn't stop climbing while you're in the conversation, gets way too high above the ground, then you stop talking, the NPC gets off the ladder and falls to their death.

kami_sama · Dec 13, 2019

Man, that's quite the reason for the bug.

Cess007 · Dec 13, 2019

Krejlooc said:
Debugging is nothing simple at all. Anytime someone says "lazy devs" when talking about bugs, I want to punch my computer monitor.

Every time someone says "Did they not have QA?", "How QA passed this?", I wanna strangle people

Deleted member 12790 · Dec 13, 2019

MatthiasHjalmarsson said:
I don't get the explanation. Yes, no new interactions can be started while in conversation, but what about when the conversation ends?

Interaction 1: enter ladder
Conversation starts, NPC can not exit ladder because it's a new interaction
Conversation ends
Interaction 2: sweet, let's go

There exists what is known as an event handler in games that works as a queue. Event handlers are good because they let multiple parts of the game throw all their actions into a giant list all at once, then let different modules "gobble" up the event as it goes from one handler to another. This is how things like joystick polling are done, your joystick throws an event, then he game says, like, "are we in a menu? If so, menu handler, gobble this event. If not, are we on the ground? If so, ground interaction handler, gobble this event. If not, are we in the air? if so, gobble this event" and so forth. It's done so that the queue shrinks as different modules handle different relevant events. They are "gobbled" (i.e. removed from the queue, and that specific handler has code deciding what to do when it gobbles such an event). He says that they block the even when in a conversation, which reads to me as the conversation handler "gobbles" all events regarding furniture interactions to prevent them from happening. So when a character enters a ladder, an event is put into the queue, and the "ladder handler" says "ok, make the character start climbing up." When the character reaches the top of the ladder, it throws an event, and the "ladder handler" is supposed to say, "ok, now stop climbing." What happens is that the "conversation handler" gobbles the event first, which it's supposed to do, and says "do nothing." So the event that causes the character to stop climbing never reaches the right handler. Apparently the event to stop climbing only happens when you reach the top of the ladder. Since the character never exits the ladder, it keeps climbing, and never throws the event again. It just climbs infinitely, never reaching another point where it'd send another event to stop climbing. The conversation handler ate the ladder handler's lunch.

MegaXZero · Dec 13, 2019

Wow I remember this bug but thought it was I didn't do her companion quest fast enough. This explains it.

Deleted member 61326 · Dec 13, 2019

Krejlooc said:
There exists what is known as an event handler in games that works as a queue. Event handlers are good because they let multiple parts of the game throw all their actions into a giant list all at once, then let different modules "gobble" up the event as it goes from one handler to another. This is how things like joystick polling are done, your joystick throws an event, then he game says, like, "are we in a menu? If so, menu handler, gobble this event. If not, are we on the ground? If so, ground interaction handler, gobble this event. If not, are we in the air? if so, gobble this event" and so forth. It's done so that the queue shrinks as different modules handle different relevant events. They are "gobbled" (i.e. removed from the queue, and that specific handler has code deciding what to do when it gobbles such an event). He says that they block the even when in a conversation, which reads to me as the conversation handler "gobbles" all events regarding furniture interactions to prevent them from happening. So when a character enters a ladder, an event is put into the queue, and the "ladder handler" says "ok, make the character start climbing up." When the character reaches the top of the ladder, it throws an event, and the "ladder handler" is supposed to say, "ok, now stop climbing." What happens is that the "conversation handler" gobbles the event first, which it's supposed to do, and says "do nothing." So the event that causes the character to stop climbing never reaches the right handler. Apparently the event to stop climbing only happens when you reach the top of the ladder. Since the character never exits the ladder, it keeps climbing, and never throws the event again. It just climbs infinitely, never reaching another point where it'd send another event to stop climbing. The conversation handler ate the ladder handler's lunch.

Thanks! Makes perfect sense. Fun to see the same architecture/design problems in games that I have myself encountered.

Using (asynchronous) events to allow other systems to act on is a great way to decouple systems, but creates the problem "but what if the event for some reason gets lost". Could be due to network errors, a message broker that screwed up, programming error etc. Sometimes it's ok, because you can expect a new event in the near future.

For important one-off events we therefore also whenever possible implement polling. The event is used 99.99% of the time to make processes real-time, with a polling solution (usually once per hour or so) to catch glitches and automatically fix them. Twice the work, but simultaneously more robust and more effective.

Deleted member 12790 · Dec 13, 2019

MatthiasHjalmarsson said:
For important one-off events we therefore also whenever possible implement polling. The event is used 99.99% of the time to make processes real-time, with a polling solution (usually once per hour or so) to catch glitches and automatically fix them. Twice the work, but simultaneously more robust and more effective.

Usually for game development, since so much is dependent on framerate, you are limited to how much you can do in any one instance. Continuous polling might be more safe and fine for standard applications, but it could be too slow for a game which needs to operate at a high frequency. You'll usually have multiple systems operating at different tickrates in games to solve this, but I think something like this is probably something non-critical and thus doesn't poll more than once. You have to pick and choose how and when to handle boundary cases in game development. Sometimes you think you don't need any boundary checks because you assume something will work right every time, until they don't, and you wind up with stuff like this. You assume there's no way that event will ever be lost, and when you run into weird collisions like this, you hunt down for seemingly impossible situations. It can be super, super frustrating when you think your logic is air tight in isolation, but can't account for all sorts of other systems blocking each other.

Oh man, and then you throw threading into the mix with certain systems, which literally block each other, and out of order execution, and so forth. It can be really insanely complex.

Kaguya · Dec 13, 2019

Krejlooc said:
Debugging is nothing simple at all. Anytime someone says "lazy devs" when talking about bugs, I want to punch my computer monitor.

Cess007 said:
Every time someone says "Did they not have QA?", "How QA passed this?", I wanna strangle people

People don't expect games to ship bug free, especially huge ones, don't remember seeing any complaints about the state OW shipped in. There are still other games that shouldn't have been shipped when they did, and the first thing most people blame is the publisher not giving QA enough time, rather than the devs.

Deleted member 12790 · Dec 13, 2019

Kaguya said:
and the first thing most people blame is the publisher not giving QA enough time, rather than the devs.

I wish this was true. "Lazy devs" is a meme for a reason.

Kaguya · Dec 13, 2019

Krejlooc said:
I wish this was true. "Lazy devs" is a meme for a reason.

Dunno, I see it a lot thrown(mostly stupidly) at other things like graphic, shitty animation, missing Pokemons, etc... But rarely at bugs, it's always QA and publishers that are blamed for that!

Deleted member 12790 · Dec 13, 2019

Kaguya said:
Dunno, I see it a lot thrown(mostly stupidly) at other things like graphic, shitty animation, missing Pokemons, etc... But rarely at bugs, it's always QA and publishers that are blamed for that!

I saw it pretty recently on this very forum when people were talking about how bethesda are "lazy devs" for "reusing the same bug-filled engine" instead of "fixing the bugs."

You usually see it when people are talking about how "shitty" an engine is. It usually pops up when they don't understand that those "shitty" bugs are likely some compromise made to solve even bigger, "shittier" bugs that would be more egregious.

TheDutchSlayer · Dec 13, 2019

Thanks for posting this it was a great read and a good look into game development loved it!

Mazz · Dec 13, 2019

I spend most of my time debugging software, I can only imagine what kind of hell is debugging such a massive game with many edge cases like this.

Deleted member 12790 · Dec 13, 2019

Mazz said:
I spend most of my time debugging software, I can only imagine what kind of hell is debugging such a massive game with many edge cases like this.

It depends on how you write your stuff. Monolithic pieces of code can be very hard to debug, so keeping things broken into pieces can be a good thing. On the other hand, if you break things into too small of pieces, following the execution of your code becomes like spaghetti and also becomes confusing (and doing that can also impact performance due to the way jumping around memory in execution works).

Debugging tools are worth their weight in gold.

Princess Bubblegum · Dec 13, 2019

Krejlooc said:
several years ago, I wrote a GPU accelerated tilemap system using openGL shaders. I wanted the entire rendering process to happen outside of the CPU, on the GPU. The concept was a simple atlas system. I would have tiles that were 8x8 pixels big in a large tilemap in VRAM, then pass a second texture that was called a block map. The block map was a series of what looked like random pixels. Each pixel would be point sampled by the GPU iterating through, left to right, top to bottom. The red subchannel of the pixel would be a look up table value to draw the appropriate tile. So like, pixel 1 on the block map would might return a red color value of "1" meaning it would draw the first 8x8 tile in the tilemap. Pixel 2 might return a red color value of "3" which meant it would draw the third 8x8 tile in the tilemap. So far, so good. I would repeat this step - 8x8 tiles -> 64x64 blocks -> 256x256 chunks -> 1024x1024 level segments. This let me build extremely large levels while using just a tiny amount of VRAM, I could swap in and out level segments to create huge, seemingly endless levels with no breaks in between them. It worked fine on AMD and Nvidia GPUs, and seemingly most intel GPUs. However, one day, while testing on a random integrated intel GPU core, I started noticing black lines running through my maps. Now, this is sometimes a common error caused by floating point division precision where you hit a fringe area. The way the GPU reads pixels is through a process known as point sampling, you tell the GPU an area of a texture in memory to look at, but not in precise terms like "look at the pixel at location 1,1". The way sampling works on GPUs is that you feed in a ratio and tell it to look at the sample there. I.e. look at "1%" to the top, and "1%" down. This is commonly known as UV sampling. I thought the error might be a weird rounding error, but was stumped at why it'd only present on a single type of integrated graphics core, and only intel.

MONTHS of testing to fix this problem. I consulted many, many graphics engineers, like people from blizzard, valve, etc. People were honestly stumped on it. Eventually, I stumbled upon the solution -- it's a very, very weird bug involving a very, very specific driver for a not-common intel graphics core, where point sampling is off by a small percentage. The first thing everyone suggested was "aim for the center of the pixel" which is what I'd do. I'd sample at not whole number increments, i.e. "not looking at the pixel at (1,1)" but rather "(0.5, 0.5)" but it wouldn't fix the problem. Normally when hunting down bugs like this that occur on the CPU, you can use all sorts of debugging tools like gdb to stop execution and look at memory. But GPUs don't work like that, they are basically separate machines running side by side with your computer. You, the programmer, control the CPU, and send commands to the GPU, but the GPU is it's own box. Now, there ARE tools which let you debug the GPU, like renderdoc, but this specific intel driver would crash renderdoc, so it was useless. The only way I could debug the GPU was guerilla methods. When you have no debug tools available to you on the CPU side of things, you usually resort to placing "printf" statements throughout your code, a command which makes your program output text to a console. You don't have printf on your GPU. The best equivalent is drawing single pixels to the screen. So I would draw a pixel to the screen, whose color channels would represent values I'd like to test. If I wanted to see what I was sampling at a certain point in detail, I would have to relay that data as a series of pixels, that I would take a screenshot of, then open up in gimp, and read the colors using the color finder tool, to look at the numbers hidden inside.

Long story short, the sampling position of this driver was off by 0.25 in both directions. If I sampled at 0.5, 0.5, it would think I was sampling a 0.75, 0.75. This was juuuust enough in very weird situations to hit what is known as the diamond-exit rule. So, even though my math worked on other drivers, it wouldn't work right on that specific intel driver. And, while newer intel drivers weren't subject to this error, not all older intel integrated GPUs could run that newer driver.

So, after months of figuring out the bug, the solution was... it was unsolvable. The method I was using to draw maps in this way would have to have a custom shader made for that specific graphics core. I said fuck that, and instead wrote a software rasterizer from scratch and presented it as an option, because if a bug like that could exist on one driver, it could exist on multiple drivers and I might never catch them all. So, in case of extreme fringe errors, I provide a software fallback. After all my debugging and error finding, I still had to basically write an entirely separate graphics module to solve it.

This was for a "simple" 2D game.

Debugging is nothing simple at all. Anytime someone says "lazy devs" when talking about bugs, I want to punch my computer monitor.

As someone with even just some basic programming experience, I empathize. Debugging is a fucking nightmare.

Petruccis Biceps · Dec 13, 2019

This relates to me working as a QA Lead in the finance industry. Sometimes you find some weird as fuck bugs that make no sense until after tons of investigating and trying little weird edge case scenarios.

Filipus · Dec 13, 2019

I've had some crazy testing classes this last semester in college so I'm full of empathy for this guy. Great read.

QA Lead at Obsidian explains the process of fixing that Outer Worlds bug where the game thinks your companions are dead

Come Sale Away With Me

I'll be the one who puts you in the ground.

User requested account closure

User requested account closure

User requested account closure

One Winged Slayer

User requested account closure

User requested account closure

User requested account closure

User requested account closure

Did you find it? Cuez I didn't!

User requested account closure

I'll be the one who puts you in the ground.

Prophet of Regret