On 3/26/11 Forrst went down from around 7:00pm CST and stayed down until around 9:45pm CST. It appears that somewhere around 7:00pm last night our Redis server became unavailable to our front-end server for around 45 seconds. We’re still investigating the cause for why Redis went away, but for whatever reason, it was gone.
In most normal situations, you would have all seen the Error Bear for about 45 seconds, and things would have gone back to normal, but instead the white screen of death was showing. This happened because Redis went away in the middle of a request (this person most likely saw the Error Bear). This caused PHP to shutdown incorrectly and not call our custom shutdown handler, which we believe left a file pointer to one of our log files open.
This was causing our PHP framework (Magnus) to die during startup because it couldn’t write to the log file that had an open file pointer. It was a simple oversight on our part to not check to see if our log file was writeable (which we have now remedied).
The reason it took us almost 3 hours to diagnose this problem was because we initially believed it to be a networking issue between servers, as that had happened to us early in the year, and we started triaging based on that (now incorrect) assumption.
So, going forward: We’ve fixed the issue where Magnus isn’t able to write to log files with bad/wrong permissions, and are going to reduce the memory footprint that Redis is using (probable cause of the crash).
We’re extremely sorry for the downtime and have worked to ensure that this issue won’t occur again.
TL;DR Redis crashed for a few seconds, our PHP framework couldn’t write log files, and was dying during bootstrapping causing the white screen of death.