Badblocks

Posted by Thoughts and Ramblings on Saturday, August 10, 2024

I was looking into mechanisms to test out new hard drives and ran across the program badblocks which can perform such testing and even a script which wraps this. Though it seems that badblocks has an issue with the block count being only a 32-bit integer so if you want to perform the test on a drive bigger than 16TB, you need to use a block size of 8192 or larger. In further examination of the documentation, I saw there is an option about the number of blocks which are tested at a time and this value defaults to 64. I searched to see if increasing this value can have a speed impact and many commented that doesn’t seem to have any effect.

In all the discussion around using this program to test out a modern hard drive, a thought dawned on me: What about caches? The source for badblocks makes an effort to bypass the Linux disk cache but modern hard drives have a cache on the controller board. The typical options for a modern hard drive would result in the program writing 512kB, reading the freshly written data, moving to the next 512kB of the disk, and repeat until reaching the end of the drive. So, what will a modern hard drive do when it is told to write 512kB and immediately read that same 512kB when it has an on-board cache (256MB) of over 500 times that size? Wouldn’t it just read the data from the cache instead of the physical disk? Why has no one in all of the discussions of badblocks seemed to have noticed the read/write cycle involves far less data than can be stored in the disk’s on-board cache? Does this do anything real at all when it comes to testing the physical layer of the disk?

I did come across mention of a vastly better alternative which involves using whole disk encryption and writing zeros to the encrypted container (so it writes encrypted zeros to the disk), fill the entire disk, and then compare the entire disk to see if zeros are read back. Combine this with checking the smart data before and after the operation. This exhausts the cache and read/writes the entire disk. Then you simply let the internal diagnostics notice issues which will be reported through smart and tell you the condition of the disk. This seems like a much better approach. If you want multiple passes, repeat the whole process with a different encryption key and that’ll write a completely different pattern to the disk.