Managing Errors in NAND Flash Based Devices
NAND flash is the proper choice for mass storages devices. Its use allows large storage devices at relatively low cost. Nor flash was traditionally used in embedded systems, to hold boot code and data, however, it is inefficient for holding large amounts of data.
In today’s systems, having a Nor flash (protected) bios supporting system boot from USB or SATA/IDE drives, SD cards and CF cards is now the choice for design, and these devices are typically NAND flash based.
NAND flash comes in different technologies:
- SLC (Single Level Cell)
- MLC (2 Level Cell – which can be used as Pseudo SLC)
- TLC (Triple Level Cell)
- 3D – (X-Y-Z axis grid array)
All these technologies are great and have a long service life, details noted in Flash Endurance. They require sophisticated interfacing, both hardware and firmware, to operate properly. Raw bit error rates in MLC/TLC are quite high and require heavy duty ECC algorithms for error detection and correction. > 96 bit detection/correction is about the average for MLC/TLC.
Source of Errors
NAND flash, by its nature (charge based cells), is inherently prone to errors. Here is a list of prominent error possibilities:
- Cell Charge Loss or Gain
- Read Disturbs
- Program Disturbs
- Excessive Program/Erase cycles
The use of a sophisticated, well designed NAND controller can make or break a storage product. The controller should have a properly designed Flash Translation Layer (FTL) in its hardware/ firmware to minimize error occurrence.
The FTL is responsible for translation of logical to physical mapping. Logical sectors (LBA) to physical sectors in the flash (Blocks- > Pages->Sectors). Mapping is the relatively easy part. Most controllers use either block based mapping or page based mapping. Both have advantages and disadvantages. One can find comprehensive information about FTL mapping with a simple internet search if interested in further researching that topic. The most difficult task of the FTL is error handling. As you can imagine, this gets more critical with MLC/TLC and very small die process geometries.
Wearing out of blocks happens after many P/E cycles. ORT Testing can help determine wear.The first consideration is avoiding prematurely wearing out blocks in the flash which have a life as low as 3000 P/E cycles in small process MLC/TLC devices. Wear Leveling makes sure that all blocks receive approximately the same number of P/E cycles. The best wear leveling is of the static type, whereby data blocks circulate, even if rarely written to (static data is moved).
A good FTL will keep an erase count on each active block, and trigger a wear level once a preset number of erases is reached. Therefore, wear leveling is the first basic step toward error handling.
This function is normally performed in hardware. Errors will occur when data is read from the flash, there is no escaping that. In the best FTL designs, a combination of ECC and CRC are used, preventing false detection and correction of data. There are different algorithms in use, but the most effective is the BCH algorithm. This requires 13 bits of overhead for each bit of correction desired.
The overhead detection corrections data is kept in the spare area of the flash, hence the amount of correction used is dependent on the available flash in the overhead area. Flash vendors are tuned into this, so a good FTL will keep track of how many bits needed correction on each read, even in cases where the data is corrected. There is a good reason for this as we shall see.
Read Disturb Errors
This phenomenon is commonly overlooked by many controller vendors, and it shouldn’t be. Read disturb errors account for many mysterious issues in the field.
Read disturb errors occur when too many reads (about 1M) are made to a page, without an intervening block erase. Static wear leveling helps to some degree, but it is not enough. Wear leveling is a protective mechanism for preventing too many program erase cycles on a block. What is needed is a count of the number of reads from a block/page. When a threshold is reached, the data needs to be moved to a freshly erased block, and then the block erased can be put back into service.
The issue with read disturbs is that they affect not just the page being read, but adjacent pages as well. This is dangerous because the errors that happen when reading the affected pages may not be correctable.
Program Disturb Errors
Program disturb errors occur normally when too many partial page programs are made and adjacent pages are disturbed. One way to avoid this is to limit the number of partial page programs. There is no defined limit as with read disturbs, and other mechanisms can be used to minimize this. It happens when an affected page is read and the error is detected. Sometimes this can be corrected, however, the block is still in trouble.
So, what can be done? One operation that helps is to keep track of the number of bits needing correction on each page read. When approaching a percentage of available correction capability, the block needs to be erased after moving the data to another good block, then the block can be returned to use.
Program errors are less common than erase errors. When a program error occurs, it is not always necessary to take the block out of service. Taking the block out of service would be inefficient and unnecessary. Simply pick another block, copy the data from the old block to the new block, and erase the block that caused the error. It is then ready to return to service.
In the case that the erase fails, the result is more severe. Erase errors require that the block be retired and replaced (Remapped).
Handling Bad Blocks – Bad Block Remapping
Flash devices come from the factory with marked bad blocks. Normally all good blocks have 0xFF in them. The manufacturer marks the bad blocks on the first few pages of the block with different data than 0xFF. The first order of business for any controller is to scan for these and build a defect table. These blocks will not be included in logical to physical mapping. Manufacturers typically specify that initial + transient occurring bad blocks won’t exceed a percentage of total blocks. Typically, 2% is stated. It is important to have a pool of spare blocks established to replace these blocks as more go bad, hence bad block remapping is essential.
Excessive Use of Copy Page (Copy Back)
NAND flash has a feature whereby data can be internally copied from one block to another. This can help the speed increase by roughly 20%. The data never leaves the flash device. However, the big drawback is there is no error correction. Do this enough times and bit errors accumulate. So, what can be done to help prevent this? Limit the amount of copy backs performed by using a counter. When the threshold is exceeded, read the data out of the controller to correct any errors. Then reset the counter. This turns out to be a good compromise between speed and reliability.
Reporting Errors to the Host Device
We now see that there are several ways errors are introduced when using NAND flash. The methods mentioned above should all be used to minimize the chance of encountering errors.
However, errors will occur anyway. Let’s take read errors from the flash. Above, we already mentioned a method of keeping track of how many bits are corrected on each read. By watching this counter, we can tell if the block is degrading. When it reaches 75% of its correction ability, a block erase will help.
Even at that, let’s say the controller reads a page and gets an uncorrectable error. It doesn’t know if this is a hard or soft error, maybe from a read or program disturb that occurred in the past. So, the controller should read a few times in succession to see if a correction is achieved. If so, the block should be scheduled for a refresh, so it gets erased and the data moved to another block. If the erase fails, then it should be remapped.
If a correction is not achievable, a UNC error is returned to the host, with no data. This is a bad scenario that can be minimized if the techniques mentioned above are used. From a system level, any service affecting critical data should be kept in multiple places on the drive.
For write errors that involve flash program errors, the controller should move the existing data to a new block, program the new pages, and schedule the block for a refresh. If the block fails an erase, it must be remapped.
Recovery from Power Down During Drive Writes
If power is lost (Power Fail) during a write cycle, with any storage device, there is a good chance of data corruption. From a controller/disk perspective, there is a limit to what can be done. What happens depends on how much data is in the sector buffers, and how much data is behind it overall.
If there is enough capacitance on the disk power supply rails, the controller can detect the power fail at the beginning, flush the sector buffers to the flash, and present a busy signal to the host interface. At the same time, write protect the flash and halt the CPU.
It is not always possible to implement this scenario. From a disk perspective, all that can be done is to minimize corruption, so, upon detection of power down, if the above steps are taken, at minimum the flash is write protected.
The power up recovery is usually where things are attempted to be fixed. Keep in mind that with any reasonable controller design, it writes new data to flash buffers, not touching the existing data. So, power up recovery involves reading these buffers, and if errors occur, replacing this data with the old data. This is not ideal, but it at least guarantees no corruption or nonsense data stored to the drive.
The host bears the brunt of the responsibility of recovering from a power failure. Upon detection of power loss, the host needs enough time to complete writing data that it started to write, and then halt. The host then needs to resolve data conflicts by reading back what was intended to be written and make any corrections if possible. This is not a trivial task, and it is recommended doing further research for additional insight on this process.
Carmine C. Cupani, MSEE
CTech Electronics LLC