Hey, folks --
I got this sent to me via PM, and I thought it provided really interesting insight into the bug situation, so I thought I'd share (with the sender's permission, of course):
Sorry for the delayed response but I actually had to spend some time and put a reasoned reply down for you. Unheard of on the internet. It is rather long but I don't have the time to go over enough times to reduce it. My apologies up front.
Let's assume a reasonably standard application development environment. We have workstations with servers and a central repository of out product. We also have a list of libraries with licenses to provide various levels of functionality that we either don't have the expertise or the resources (money,time,desire) to develop.
Each developer is responsible for a sub-set of the work and they work to a set of design documents to provide that. APIs are designed so functionality (internal or external) can support whatever combination of developers we have. Cool.
We built the product. Using the basic engine with enhancements to support the crazy/neat feature set for our game.
There is another huge (these days I really, really mean huge) department that does art. Design, layout, textures, animation, character design, weapons and whatever else can be seen by the user. All this has to be put together to build the actual game environment.
Now we get to start sending it out to QA. Which is play testing and not a whole lot else as far as I can tell.
In the OS world we tested with corner cases, automated scripts, writing specific tests to verify that a particular bug is not re-introduced as well as performance testing for extreme cases and memory size variations as well. That is a lot of work. Add a new hardware card and all this get tested again. Large server companies hammered on Intel (probably AMD as well) to quite pushing new chips out every quarter. Why? Because it took them six months to test the last one. With a new processor every quarter their customers were getting grumpy about being two CPUs behind.
When I transitioned into consulting I still had those habits. If my application could take an input file one of my tests was to send the executable through as input. It could complain and give a million errors or give up after some maximum number of errors but it could not crash. And it should start processing proper input without having to restart the application. I never delivered a bug to a customer. I wouldn't release the product to the customer until the bugs were gone.
Now onto the current world and we have a game to get out by Xmas or before we run out of money. The game has data in the GBs and objects that are fixed in time and space as well as some that can be moved, destroyed, sold and even left in a different world then where they entered the environment.
Scripting a test solution will never cover what a human can do. We had a technician named Mary that could crash code consistently. She never knew how she did it. It took us forever to finally get a system in that logged enough of what she did to repeat some of her bugs. She followed a script and her fellow techs could follow the same script and never get a failure. Mary starts and bang, Crash!
Video games are completely driven by the millions of Marys (of either sex) and they can be very frustrating to resolve things for.
Those were/are the basics. Now to answer your question: How do the same bugs show up time after time?
Hire a team, drive a product to completion and fire the team. Start another team to drive the next product and we get the same bugs repeated.
Keep the team but change the framework, or just some of the libraries and what worked for the old library will cause the new one to fail. It might even look like the same bug.
Don't document fixes. Don't have a coding standard. Don't let people get enough sleep. Don't develop complete automated test cases for internally developed functions. Don't have a good centralized bug database or a content control system.
Funny thing about operating systems: the developers never trust anybody. Check all error codes, verify all parameters and memory allocations. Those are standard in good OS coding. Not done in most other cases. Memory always works, writing to the hard drive always works, that library has never thrown an exception. Those views will get you a big bite out of your butt every time. Randomly if you pissed off the wrong gods.
I bought Rage at release (discounted from Amazon but release date) because I got 7.5 hours of discussions from Carmack over the last five years. I got another 10-15 hours of interviews. I figured I had at least 17 hours of time into the game before I bought it. Okay, pay the man. Wished for a better cohesive game but I paid the man. One of the things he talked about was static checking of the code using a MS tool from the XB development toolset. It gave probable bugs and a lot of coding standard result output. It will output a list of Don't do this: it can cause problems. Most game developers aren't anywhere close to doing this on their projects.
I don't use static checkers because I use rigid coding standards (I have tested them and they don't improve my code). I never vary the coding style. I may update them but once updated they never vary until it is time to update them again. That includes all the weird OS methods from above (never trust anybody, et al).
Now assume we are in crunch mode and trying to get to a gold master. We have three major releases to get working together. Sony has a 8-9GB drive space (i think) limit on what is loaded on the drive, MS has a different one and PCs basically don't have a limit. However, all system may have significant restrictions. Consoles have no memory. PCs have wildly divergent video, sound and input support. Plus they may have an unknown number of programs running simultaneously with yours.
We have a huge art department, a significant animation, game logic and probably two or three different teams driving the product releases. Developer Bob finds an issue with the PC and fixes it. It may or may not impact the console teams and he may or may not inform them. Of course he could be fixing a couple of dozens things a day and the one really important one doesn't get documented clearly enough (or at all) and the console guys miss it.
id keeps their teams around. Their people grow and get better. Runic games of Torchlight fame has a small team and intends to keep it small. Their code bases are pretty solid. Although Carmack's opinion on PC problems were interesting regarding being too far from the hardware to consistently get 60fps while the lesser powered consoles were easier to hit 60fps.
Now for the ultimate disappointment: I cannot fix this. Only good project management and reasonable solutions to letting developers having a life can fix it. Bring in young engineers and train them, grow them. Build solid testing solutions for every module. Write design documents for the actual software, not just the game. Assume nothing, trust no one.
That is why I rant. I don't repeat bugs. I learn, find the root cause and teach myself not to do that again. Change the coding style to reflect it if necessary.
Sometimes I need to think more like you. Waiting until 2/7/12 or later to get the New Vegas DLCs is probably cutting my nose off to spite my face. I will have to rethink it after this discussion.
ps: on Skyrim and PS3 issues. I would be surprised if we ever actually find out the real failure mode. We will just hear about a patch that fixed it. The problem, I suspect, is probably related to Sony encrypting the entire hard drive. They tie up one of the SPEs and as the save files get bigger it take longer (probably exponentially as it is a common mistake) and longer to process the save file. So is the hard drive nearly full and the save file large? Software mind says: If so then where is the temporary file stored? QA mind says: does the QA dept actually have machine with a hard drive that is nearly full?