The importance of failure
Henry Wadsworth Longfellow once said “Sometimes we may learn more from a man’s errors, than from his virtues.” .. I guess we can adapt that quote to the fact these days we learn more from when the systems fail, than when they are working fine .. such was the case today.
We had done some tests on the VPS Cloud self healing measure by shutting down a Hypervisor’s services, and always, the VPS nodes residing there, would automatically boot up on a different one in the Cloud … always below 40 seconds (yes, 40 seconds)… but we decided that was no fun, it was time to go to the datacenter and pull some cables … yes .. I love my job 🙂
So there we are, myself and Paul (head of IT @ UK2), we located the Hypervisor we wanted to test, enabled monitors for the main NIC and the VPS’s IP’s … all systems go, the anticipation of success in the air, our palms sweating, giggling like schoolgirls .. and “click” … we pulled the cable …
We then started counting till the VPS’s where back up elsewhere in the VPS Cloud … 1,2,3 … 40 .. ok, anytime now … 60 … 90 ? .. wait a second … what is wrong here ?
3 minutes later, we looked at each other in horror … the unthinkable happened … the VPS Cloud self healing feature .. one of the cornerstones of our offer .. had failed !!!!
Luckily this is still our beta testing .. but what had happen ? it always worked when we turned the services down, why did it fail when we pulled a NIC cable instead ?, what was the difference ? .. and so it began todays “must know” task.
We started by looking at the logs, nothing strange there, actually, nothing there at all for the past few minutes .. thats when it hit .. there was indeed NOTHING in there, the logs show no downtime had ben detected for a Hypervisor.. we returned to the Admin CP, and sure enough .. it still detected that system as being up .. but how ?
Well, as so often is the case, you look into a few hundred lines of code, until you decide to instead look at the obvious … could the internal monitor daemon failed … and if it failed .. why where we not notified ? .. wait .. what monitors the monitor .. ? ..
Simple schoolboy error, we had all sorts of monitors, bells and whistles .. if anything in the VPS Cloud fails, it gets detected within 5 seconds or less .. a true example of monitoring excellence … but what we forgot was .. what if the monitor fails ?
This brings me back to the post topic, the importance of failure .. there was no difference on our test .. pulling the cable or shutting down the services all lead to the VPS Cloud monitor to kick in and do it’s job … it was just a coincidence that this time the monitor daemon had hang, had the failure not happen, this simply oversight could have cause trouble later on, so yes, failure is good .. as long as it happens during beta testing ;).
So we rewrote some of the daemon to be more robust (we found what caused it to fail and fixed it) and implemented extra monitoring procedures that now we monitor the monitor too 😉
This emphasizes the importance of beta testing and fault simulation, so often we see companies go live with untested ground breaking products and have a miserable first quarter or two of constant failure and bug fixing, many times driving them to closure .. if not properly tested, it’ the small things that will get you in the end.