5 hours of downtime

My server had 5 hours of downtime today, it was completely my fault and I’m very sorry.

I don’t take this lightly, I care very much about uptime, in most months my uptime is above 99.9% and even in “bad” months I keep it above 99% – such a long downtime happened only once since I stopped using shared hosting (and that time was my fault too, I’ve learned from it and didn’t make that mistake again).

This is the story of what happened, this is not an excuse but just what happened.

I had a lot of updates (security and others) I had to install, so I tried to do the responsible thing:

  • I’ve chosen to do it on Sunday when it’s very early in the morning in the US (where most of my customers are from)
  • I’ve checked every update to make sure it’s safe to install it on the server
  • I made sure I can reboot and troubleshoot the server remotely before I started
  • I prepared backups
  • I know from experience it takes about one minute for the server to reboot and I know I can install the updates while the server is running – so I thought I can install everything without any noticeable downtime.

I very was wrong.

Everything started up according to plan, the updates were installed while the server was running and I rebooted to apply them.

Then everything went bad.

First, after a minute of rebooting the server came up and showed “Configuring updates: Stage 3 of 3 100% Do not turn off your computer” for over 2 hours.

Then it just stopped responding to anything, practically disappearing from the internet.

Remember I said before I could troubleshoot and reboot the server remotely, this is the point in the story that system that allowed me to do so crushed.

Then the server started rebooting continuously (a sign of life, good, something I can work with) and the remote control system came back to life (even better).

I took the server off line and started the process of restoring the server from backup (the restore failed, obviously), I then tried to restart the server, it resumed the reboot cycle, so I trued to take it off line again, at that point the remote control system crushed again.

The next time the remote control system started working, partially and showing error messages, it started the server who surprisingly just started working.

So, what the cause of the downtime? probably one of the updates I installed destabilized the system – that means I made a mistake when I checked the updates are safe and I should do a better job with it next time.

Also, I should have installed the updates in small batches – so that applying the updates wouldn't have taken take several hours.

And did get some things right – I had backups and I did it at a time most of my customers are sleeping (I didn’t get a single e-mail complaining about the downtime, so I hope no customer were effected).

I’ve learned my lessons, next time:

  • More backups, and backups of the backups, and another working copy of everything on my computer.
  • Go slow – install the updates in tiny batches.
  • Be much more careful when checking the updates are safe to install.

And again, if you were effected I’m very sorry.

posted @ Monday, April 2, 2012 12:45 AM

