What a crappy night.

One of our client’s d-servers decided to act up at 3:00 AM, when running most commands it would error with a Input/Output Error message.

The initial consensus was that this was a Bad HD, but this exact server had the HD replaced on April 11, although not impossible, the prospect of a 2 month old HD failing already was hard to believe, but this was not the problem, the problem was .. the client had not done any backups.

Even so it’s not my responsibility to make sure customers have backups, somehow in the end I knew we would take the head for it, so I attempted to backup the one site this server hosts, and it failed, mid way thru the backup, the server simply .. CRASH, so we moved to repair mode, while the tech ran a lengthy FSCK, I started to plan the move to another disk, and started to complement the terrible day I would have ahead, this is a online pharmacy, which demands a uptime SLA allot higher than usual customer do.

The first FSCK for some reason failed (I now believe due to tech error), so we moved on to a second one, using a few extra options .. and thanks to the lucky start .. it seems to have done it.

I checked the logs, there were no EXT3 or HD related errors even before when it was giving the I/O error .. so I’m uncertain if the HD was bad, or if it was just a corrupt sector of data, were part of the kernel or a major OS file might have been.

Regardless, this has allowed me to now have a stable system, and my first command to it was “BACKUP”

I have let it run now, as the HD seems to be behaving, but if it fails again, with a backup on the side, I can restore it in an hour, vs trying to harvest data from a bad drive, which could have taken days.

I really should have chosen a less stressful profession, like dismantling live bombs… at least those guys get to sleep.

Technorati Tags: ,