Can Data on Hard Drives Degrade Without a Warning About the Damage?

Quick Links

The Question

The Answer

We all worry about keeping our data and files safe and intact, but is it possible for data to become damaged and be accessed by a user without a notification or warning of any kind about the problem? Today's SuperUser Q&A post has the answer to a worried reader's question.

Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.

Photo courtesy of generalising (Flickr).

The Question

SuperUser reader topo morto wants to know if data on hard drives can degrade and be accessed without a warning about the damage:

Is it possible that physical degradation of a hard drive could cause bits to "flip" in a file's contents without the operating system noticing the change and notifying the user about it when reading the file? For example, could a "p" (binary 01110000) in an ASCII text file change to a "q" (binary 01110001), then when a user opens the file, they see "q" without being aware that a failure has occurred?

I am interested in answers relating to FAT, NTFS, or ReFS (if it makes a difference). I want to know if operating systems protect users from this, or if we should be checking our data for variances between copies over time.

Can data on hard drives degrade and be accessed without a warning about the damage?

The Answer

SuperUser contributor Guntram Blohm has the answer for us:

Yes, there is a thing called bit rot. But no, it will not affect a user unnoticed.

When a hard drive writes a sector to the platters, it does not just write the bits in the same way that they are stored in RAM, it uses an encoding to make sure there are no sequences of the same bit that are too long. It also adds ECC codes that allow it to repair errors that affect a few bits and detect errors that affect more than a few bits.

When the hard drive reads the sector, it checks these ECC codes and repairs the data if necessary (and if possible). What happens next depends on the circumstances and the firmware of the hard drive, which is influenced by the designation of the drive.

If a sector can be read and has no ECC code problems, then it is passed on to the operating system.

If a sector can be repaired easily, the repaired version may be written to disk, read back, then verified to determine if the error was a random one (i.e. cosmic rays, etc.) or if there is a systematic error with the media.

If the hard drive determines that there is an error with the media, it reallocates the sector.

If a sector can be neither read nor corrected after a few read attempts (on a hard drive that is designated as a RAID hard drive), then the hard drive will give up, reallocate the sector, and tell the controller that there was a problem. It relies on the RAID controller to reconstruct the sector from the other RAID members and write it back to the failed hard drive, which then stores it in the reallocated sector (that hopefully does not have a problem).

If a sector cannot be read or corrected on a desktop's hard drive, then the hard drive will engage in more attempts to read it. Depending on the quality of the hard drive, this might involve repositioning the head, checking to see if there are any bits that flip when read repeatedly, checking which bits are the weakest, and a few other things. If any of these attempts succeed, the hard drive will reallocate the sector and write back the repaired data.

This is one of the main differences between hard drives that are sold as "desktop", "NAS/RAID", or "video surveillance" hard drives. A RAID hard drive can just give up quickly and make the controller repair the sector to avoid latency on the user's side. A desktop hard drive will continue trying again and again because having the user wait a few seconds is probably better than telling them the data is lost. And a video hard drive values constant data rates more than error recovery as a damaged frame will typically not even be noticed.

At any rate, the hard drive will know if there has been bit rot, will typically recover from it, and if it cannot, it will tell the controller which will in turn tell the driver which will then tell the operating system. Then, it is up to the operating system to present the error to the user and act on it. This is why cybernard says:

I have never witnessed a single bit error myself, but I have seen plenty of hard drives where entire sectors have failed.

The hard drive will know if there is something wrong with a sector, but it will not know which bits have failed. A single bit that has failed will always be caught by ECC.

Please note that chkdsk and file systems that automatically repair themselves do not address repairing data within files. These are targeted at corruption within the structure of the file system itself, like a difference in a file's size between the directory entry and the number of allocated blocks. The self-healing feature of NTFS will detect structural damage and prevent it from affecting your data further, but it will not repair any data that is already damaged.

There are, of course, other reasons why data may become damaged. For example, bad RAM on a controller may alter data before it is even sent to the hard drive. In that case, no mechanism on the hard drive will detect or repair the data, and this may be one reason why the structure of a file system is damaged. Other reasons include software bugs, blackouts while writing to the hard drive (although this is addressed by file system journaling), or bad file system drivers (the NTFS driver on Linux defaulted to read-only for a long time since NTFS was reverse engineered, not documented, and the developers did not trust their own code).

I had this scenario once where an application would save all of its files to two different servers in two different data centers in order to keep a working copy of the data available under all circumstances. After a few months, we noticed that about 0.1 percent of all the copied files did not match the MD5 check sum that the application stored in its database. It turned out to be a faulty fiber cable between the server and the SAN.

These other reasons are why some file systems, like ZFS, keep additional check sum information in order to detect errors. They are designed to protect you from a lot more things that can go wrong than just bit rot.

Have something to add to the explanation? Sound off in the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.