• mozz
    link
    fedilink
    29
    edit-2
    11 months ago

    The game was crashing. Not a lot, but about once a day the game would randomly just explode. No particular rhyme or reason to the stack trace or what was going on at the time. Just, boom!

    We obviously can’t ship that way. It was getting kinda late in the development cycle, enough that this made everyone nervous, and I was tasked with tracking the thing down. They said we hope you find it fast, but fast or slow, you have to find it because we don’t really have a choice.

    I spent about a week trying to see if it was related to any particular thing. I started disabling subsystems to track it down. No audio, wireframe graphics, simple test level, remove all equipment or controllers we don’t need. Nope. At this stage you start having those tense conversations like “Maybe we don’t NEED level 6.3” when fiddly bugs start to show up, but it was irrelevant, because nothing we could remove would solve it. I made some level of attempt to run back in the source control to see if I could binary-search to where it first showed up, but the game crashing isn’t exactly unheard of and it was fiddly to even reproduce the thing in the first place. That was what made it so insanely slow to track down. No luck from any of this.

    After several days, I understood that I wasn’t in for a typical debugging session. What I decided on was to embark on making a build of the game that was byte-for-byte identical from run to run. No audio, no control input, constant RNG seed, track down anything that might make it deviate or anything time- or IO-dependent. That took about another week, but at the end, I had my build. Every byte of memory was the exact same from run to run; every address, every value. (This was a console game so this task was easier than it otherwise would have been.)

    So we’re two weeks in now. By dumb luck, I managed to replicate the bug almost immediately once I had the repeatable build. The morbidly simple level and setup that reproduced it quickly was: Start the repeatable build with a script running that would spawn the player in a bare room with a floating gun. The gun fires, killing the player. The player drops to the ground, and then the game crashes. Every time.

    I got you now, you son of a bitch. Happy that the thing was trapped now, unable to flee into its stochastic wilderness, I began to live in an endless world of execution-style slayings.

    What followed was actually the longest part. As I said, the crash had no real pattern, but I now had the ability to go backwards in time. I could control everything. I took any wrong-looking values from the memory at the time of the crash, and started setting watchpoints on their memory addresses. How had it gotten this way? What touched this piece of memory? The process of single-stepping through, as a single piece of memory was allocated, used, freed, and then reallocated, trying to understand it all and find the wrong bit, was intense.

    My memory of the following couple of weeks is honestly a little fuzzy. I know I was able to track down, by single-stepping through every single place the offending memory value was touched, to definitively work out what value it was that had had an impossible value, how it had caused the crash, and every single place that value had been touched, until I arrived at the offending code. The problem was… the bug wasn’t in the code I was looking at. It had had some weird impossible value in some of its data that had caused it to set up something else wrong.

    Oh. Oh no. Do I have to do it again? By now I was three or four weeks in.

    I did do it again. Like I say I don’t remember clearly, but I remember that it took me about 6 weeks in total and I filled up the majority of one little notebook that I kept next to the computer to write notes in. Little hex addresses, stack traces, clues that I might need to flip back to later. No matter. After tracing back all the breakage, I finally arrived at the code that was actually causing it. Once I actually saw what was happening, it was easy to work out.

    At some point, someone had added a cache for some expensive-to-compute data for actors in the world. The cache had a mapping of actors to cached-value matrices, and each actor had a single value that was a pointer to its cached value, or NULL if it had none. I think they were too big to store indefinitely? There was a little LRU cache. But, if an actor was destroyed, it wasn’t removed from the cache. So its cached data would sit there in the LRU cache, like a little time bomb, until the thing expired, and then the last part of its removal from the cache would be… unsetting the pointer to cached data in the now-dead actor back to NULL. The actor’s memory had already been freed, and so one int, somewhere at some random location in memory, would get set to 0. And once in a while, it would happen to be at a location where that would cause serious problems (usually leading quickly to a crash).

    The fix was 1 line.

    • @[email protected]
      link
      fedilink
      411 months ago

      Do you think address or thread sanitizer would have been able to catch this? It sounds like it would.

      • mozz
        link
        fedilink
        411 months ago

        Hm… not sure it would have. Something simple like overwriting freed memory with an identifiable pattern wouldn’t have helped, because by the time the crash happened, the actor’s memory had been reallocated and was being used for something else.

        This was on the Wii, which had some pretty bare-bones development tools, so it wouldn’t have been real easy to try to use something like valgrind to do more sophisticated memory debugging or anything. It was basically single-threaded, too, BTW (I think audio and input were based on interrupt handlers), and I made it more so with the repeatable build.

      • mozz
        link
        fedilink
        211 months ago

        Yeah. I was proud of myself. 😃 IDK how easy it is now with all these friendly toolkits, but back at that time and definitely for console development, there wasn’t a lot of way around really knowing your shit. Even having a degree, it definitely helped, but even that wasn’t really a full preparation for some of the difficult stuff I ran into. I wish CS prepared people better for real-world engineering; in particular I wish reading big codebases was more of the curriculum. I actually came out of school still with fairly bad working-with-production-code abilities.

        Separate related story, I was disagreeing with my boss about whether to hire some guy. My boss said, but his code is really bad. I said, yeah but he’s right out of school so it’s to be expected. When I first graduated my code was really bad. My boss got sort of distant for a second while he was remembering, and then his face came alert again and he said, all animated, “Yeah! It was. It was terrible!” I said yeah, I know, then I learned what I was doing and it was better. It’s just sort of how it works.

    • @luxyr42
      link
      fedilink
      English
      111 months ago

      Mem stomps are the worst. At least nowadays we have address sanitizer to find em. We recently had one where the same 4 byte pattern was being written randomly in different places in memory, would happen all over the place. Always the same 4 byte pattern, just different places. Eventually, it would write to a spot that was being used and cause a crash. Different callstacks almost every time, but the same memory footprint wherever the crash happened. An array size, a memory address, a string mangled, etc. Eventually we got our ASan build working after about a month of trying to track it down, digging through callstacks and core dumps. We found that it was a dangling pointer in our AI system, when an AI was removed, there was a situation where the pointer wouldn’t always be cleaned up, then later when another AI was removed, a boolean and an enum were written to the address of the dangling pointer, always the same format/value. which had haunted us for so long.