A few days ago i was bragging that things were just peachy.
I spoke too soon, today was just Hell.
About 1:00 PM I got an email notifying server #60 was down .. I asked Ali to reboot it .. after 15 minutes it was still down, so I asked him what was the problem (I was fearing a hacker) .. when he tells me it’s just going thru an FSCK. I tough it was weird, these are ext3 machines and this server was rebooted not too long ago.
When it finally came down, I proceeded to restart postgres, so H-Sphere could run, it seems to start fine, I then realized mysql was not running .. so I went to /var/lib .. and to my horror .. mysql was nowhere to be seen.
It was a catastrofic corruption of /var .. all mysql was gone, and worst yet, Hsphere pgsql db was corrupt, i proceeded then to /backup to find out in horror the backups for mysql and pgsql were MIA also .. it seems the HD had filled up and not backed up the db’s
After a few more FSCK’s .. mysql directory finally show up on lost+found .. but now the real problem .. hsphere database is corrupt, the tech try to move it to another server .. no avail, the main table was in bad shape.
At the same time, i get notice that #32 still wont send email to AOL .. the idiot postmasters still dont see the RDNS entry for it .. great job AOL
And I still have #88 to attach to the #43 cluster .. Miguel is after me to get it done, but with the problems we got today, it will have to wait till tomorrow
Eventually Vlodimir came to the rescue and manually restored the table bit by bit .. 20 account details were lost, but thats not a big deal .. I know the client can get those back ..
Now Vlodimir is moving the data from the old server with the bad HD into a new one .. crossing my finguers the worst is past us, the user’s data are on a separate HD .. so none of that is lost.