Dealing with corruption on drives - reiserfsck --rebuild-tree how to

It is really uncanny that I run across these problems during the holidays. On Dec. 24 one of my boxes was reporting a "read only filesystem" on the /var partition. My alerts didn't notify me since services were up, but mails weren't happening and mysql was reporting crashed tables.

The first time I was able to reboot it, but the second time, I wasn't so lucky. Here is the procedure in case I run across this next Christmas.

I wanted to unmount the partition so I could run fsck on it, but since it was crashed, I needed to force umount it..

mount -l /dev/hda8

I ran reiserfsck and it did report corruption, so I needed to run it with --rebuild-tree

reiserfsck --rebuild-tree /dev/hda8

This process takes a long time. I would guess it was an hour to do my /var partition. After the corruption was fixed, I ran the reiserfsck again (it takes a long time) it reported no errors. I didn't see any drive IO errors in the logs prior to the partition being set to read-only, but both times this happened approximately 3AM (I was suspecting something in cron.daily). I did have dirvish processes running, which is a back system with a lot of IO, so I stopped that temporarily, and will be testing with heavier IO calls, but the primary goal is to keep the services up.

I wasn't able to remount /dev/hda8 to /var. It was complaining the filesystem was already mounted or busy. Looking at top, I noticed some processes that were still there, and ran..

lsof | grep /var

That showed:

syslog-ng  1699     root    5u     unix 0x00000000       0t0       3028 /var/run/syslog-ng.ctl
fail2ban-  1843     root    3u     unix 0x00000000       0t0       3588 /var/run/fail2ban/fail2ban.sock
fail2ban-  1843     root    4u     unix 0x00000000       0t0   10501109 /var/run/fail2ban/fail2ban.sock
sh        27519     root    3u     unix 0x00000000       0t0       3588 /var/run/fail2ban/fail2ban.sock
sh        27519     root    4u     unix 0x00000000       0t0   10501109 /var/run/fail2ban/fail2ban.sock

I then killed those processes and was able to run shutdown -r now to get the system to reboot and load up that partition. Even by removing those processes, I was still unable to force a mount on it.