A few days back, and we had an outage at Relio.

Outage
Check out the dip on Friday
Simple put, Rob took down one of the 2 routers we use, to upgrade it from FreeBSD to OpenBSD for better IP usage, with OpenBSD we can save 2 ip’s per allocation, currently each time we allocate an IP block, we set aside 3 ip’s, one for router 1, one for router 2, and one for gateway.

With OpenBSD on the routers, we only need to use 1 ip .. how it works, I leave to the Sysadmins, but it has the potential to save us loads of IP’s in the long run, since we are getting new ip’s soon from ARIN, it’s worth doing it now.

So, one of the routers was put down, which meant while the upgrade was being performed, all traffic was routed to thru the second router and a single provider (Time Warner – TW) .. and thats when trouble happened, all of the sudden, mid-way thru the upgrade, the TW link went dead, and with the second router down, we had no way to go out thru the other providers as usual.

What’s the odds of it happening ? very slim, but it happen .. So the guys had to get on the phone with TW and get the link back up, we were dead on the water for about an hour and eventually the TW link was restored and we were back in business, since the second router has been upgraded and we are back 100%, and with a bruised ego.

It just goes to show, you work with a system for years, it’s so reliable that you think “I can rely on it for a few hours” while you take other system down for an upgrade, and BINGO .. your uber-reliable system goes down the one time you need it up.

Is it Karma ? Cosmo-misalignment ? Ex-Wife’s curse ?, who knows ..

One thing this episode made me realize .. we have/had a big flaw on our system, even so the US DC was the only one down, the UK went down as well, why ?, because even so we have a great DNS system, with a separate physical server for each DNS as the book recommends, both DNS servers site on the same datacenter, with it down, it was a matter of minutes before some ISP’s started to drop the sites on the UK, that while they were UP, they had no DNS; So, we are now moving the DNS2 server to the UK to prevent it from ever happening again.

The good thing was, the forums http://www.relioforums.com site both off-network and off-dns .. so they were up and people were able to get information during the outage.

After 7-8 years on this business, you think you know and seen it all, then something comes along that smacks you right on the face and tells you never to forget that SH*T Happens and you need to always be prepared.

Technorati Tags: , , ,