Reading Docs == Not Waking Up Early On Sundays

So, a little over a week ago, I upgraded from CentOS 5.3 to CentOS 5.4 on two of my home servers, one of which has a simple RAID1 array with two 1 TB drives. On the server lacking RAID, the upgrade went very smoothly, with no complications whatsoever. On the server with RAID, however, I saw a few issues.

Last Sunday, I woke up to the sound of my cell phone alerting me to issues with the RAID array (thanks Nagios!). For no apparent reason, all of the arrays decided to resync, and with one of those arrays at 914 GB in size, it takes a little over three hours for all the arrays to resync (3 hours, 15 minutes, 40 seconds, to be exact). When SMART isn’t issuing any alerts, and there’s nothing to signal drive failure in /var/log/messages or in the output of dmesg, you’ve got to wonder what the hell is going on. I found out what was going on after I got the alerts again this morning.

As it turns out, Red Hat added data scrubbing to RHEL 5.4, which of course leaded to the feature appearing in CentOS 5.4. What is data scrubbing? Per this:

When you have multiple copies of data, you can use data scrubbing to
actively scan for corrupt data and clean up the corruption by replacing
the corrupt data with correct data from a surviving copy.

Normally, raid passively detects bad blocks. When you attempt to read a
block, if a read error occurs, the data is reconstructed from the rest of
the array and the bad block is rewritten. If the block can not be
rewritten the defective disk is kicked out of the active array.

During raid reconstruction, if you run accross a previously undetected bad
block, you may not be able to reconstruct your array without data
corruption. The larger the disk, the higher the odds that passive bad
block detection will be inadaquate. Therefore, with today's large disks it
is important to actively perform data scrubing on your array.

OK, sounds good to me. Red Hat made a note about this change in their Technical Notes (warning: PDF format), but that seemed to the extent of notification to users. This RAID check is set to run as part of cron.weekly, with execution starting for me at a little after 4 AM on Sunday mornings, a time at which I’m certainly not actively reading my log files. Without Logwatch and Nagios, chances are good that I might not have ever noticed this process occurring, so I have those programs to thank for my education on this subject.

But if only I had read the documentation in the first place, I wouldn’t have needed Nagios to wake me up two Sundays in a row. And for what it’s worth, these are just my home servers, they aren’t doing any real “enterprise” stuff, though they are configured in something resembling enterprise fashion. Certainly a good way to learn without having a whole lot of money, jobs, or lives on the line…

Notes