Considerations Beyond Specs for Industrial Storage Solutions
Multiple flash memory options exist on the market, including triple-level cell (TLC), multi-level cell (MLC), single-level cell (SLC), and pseudo single-level cell (pSLC), which leaves developers facing many decisions when selecting storage for their applications. The key to choosing the right storage is understanding how flash storage works and how different storage-level effects impact the longevity and reliability of the memory. By understanding the mechanism through which flash memory functions, engineers and other developers will know what questions they should ask their suppliers as they choose storage solutions.
For embedded industrial systems, there are multiple application aspects to consider when selecting memory, such as:
- Read and write speed
- Endurance (how long the flash media lasts)
- Retention (how long the memory stores data)
- Vibration resistance
- Temperature resistance
- Power failures and how data is secured in the event of a failure
- BOM control and the long-term availability of the memory product
One aspect of flash memory that is easy for developers to overlook is how NAND chips age. NAND flash cells do not have an infinite life. Instead, there are a limited amount of block erase cycles that they can tolerate. The reason for this is that hot electrons with increased energy levels become trapped in the oxide layer, which is responsible for separating the storage gate. The programming voltage accelerates the electrons, and over time, the threshold voltage is pushed to the point that the cell cannot be read any longer. The complete block of cells will be written off as a bad block as read errors pile up.
Accelerated Aging in NAND Flash
A secondary kind of aging also involves the oxide layer. This type of aging occurs when conductive paths form throughout the oxide, allowing the cell to lose its charge, and in turn, its stored bit. The effects of this kind of aging are increased in high temperatures. It takes a relatively minor increase in temperature to dramatically escalate this effect. For instance, after operating for five years at 55 ºC, a 25nm MLC NAND flash device has retention of about 75%. If that same device is operated for the same time period at 85 ºC, the retention falls below 10%.
There can be a considerable loss of data retention capacity in NAND flash as the memory device gets closer to the end of its life. Every kind of NAND flash has a maximum expected number of program-erase (P/E) cycles it can complete, as well as a projected lifetime. However, the data retention capacity lifetime can be severely limited by the number of P/E cycles. In the case of an MLC NAND flash device with an expected data retention capacity of 10 years, the capacity can drop to as a little as a year after 3,000 P/E cycles.
For TLC NAND flash, the data retention loss is even more dramatic. TLC flash needs to use 8 different charge levels to store three bits of data on each cell. This creates challenges for both the charge state and threshold voltage that leads to accelerated aging—to the degree that data retention timelines fall to three months from one year after running a mere 500 P/E cycles.
SLC NAND flash is not immune to these challenges, but it is more resilient. Most SLC devices do not begin to lose data retention capabilities until about 10,000 P/E cycles are completed. This longer-term data retention capacity is the reason that many industrial application developers choose to use SLC NAND flash.
The Role of pSLC
A newer NAND flash development is pSLC, which is intended to be a middle ground between MLC and SLC flash storage. Designers of the pSLC format recognized that memory becomes less robust as the levels of charges required for storage grow. As such, they have made use of MLC chips for the first bit per cell to improve data retention over traditional MLC cards. Using this MLC chip for the first strong data bit per cell allows pSLC to operate much faster than MLC flash typically does while boosting the number of P/E cycles that are possible before degradation occurs. Compared to standard MLC, which can undergo 3,000 P/E cycles before degradation begins, pSLC can complete 20,000 P/E cycles. Although data endurance is 6.7 times better than that of standard MLC cards, the cost per bit stored is usually only roughly two times more. For some developers, this strikes the right balance of performance and price.
How Workload Impacts Endurance Specs
For developers, having the information about endurance and other specs is helpful, but that information often doesn’t tell the full story. Instead, it is essential to understand what exactly the specifications stated by the manufacturer really mean and what kind of workload they are referring to.
There are two measurements that provide particularly helpful insight into the endurance of a solid-state drive (SSD). These are the terabytes written, or TBW, which states how much data can be written to a device during its lifetime, and drive writes per day, or DWPD, which refers to how many times data can be written to a device during the period it is under warranty. Although having this information is helpful to developers, there is no way of knowing exactly how relevant it is to the application on which they are working. The actual spec values are really determined by how the device is being used.
An example of these spec variations in action is the testing of a 480GB SSD. This SSD had TBW values of 1,360, 912, and 140 when used in different applications. The highest performance was associated with sequential writing. In a second test, a client workload was simulated to represent a typical PC user’s behavior that involved mostly sequential data access operations—TBW values dropped to 912. The third test mimicked an enterprise workload in a multi-server environment with random access of 80% of the data. With this usage, this SSD delivered 140 TBW.
Testing flash memory devices in this way is common, thanks to the guidelines set for endurance testing by the JEDEC standardizing organization. These guidelines set a baseline for comparability between products from different manufacturers. The issue for developers is that the varied workload result specs are not usually in the datasheets that they see. Instead, manufacturer data may include the highest levels for endurance values, even though these levels are usually only seen when sequential writing is used in a limited number of applications. The actual performance when it is put into an application can vary widely, which developers have to keep in mind when evaluating different solutions.
Erasure and Flash Memory Aging
Memory cells age faster when erasure occurs, but in order to write, block erasures are essential. This doesn’t mean that applications that are read-only, like book mediums, keep data safe for longer because retention increases. In reality, read errors can be caused by other issues, and these errors wear down NAND cells.
When writing processes occur, the cell being programmed may not be the only one under stress. The cells next to the cell being programmed can also be exposed to increased voltage transferred from the programmed cell. When this occurs, it is known as a program disturb. Reading can similarly cause read disturb issues in which adjacent pages store increased voltage. These read disturbs eventually lead to read errors, which disappear after the deletion of the block. Although it is true that the lower voltage used in reading compared to writing means that the effect of excess charges is weaker, it still can cause bit errors. ECC—or error-correcting code—makes up for this by deleting the entire block.
In applications that read the same data repeatedly, the effect of increased voltage during read cycles is intensified. As such, even when memory is only used for reading, pages still have to be written due to blocks being deleted in error corrections, so aging still occurs.
Other Aspects of Flash Memory Aging
Another aging consideration associated with flash memory is the processes that are triggered not just by the application itself, but also by the processes of the controller and firmware. These processes are often unnoticed and not accounted for when considering speed and endurance, but they are actually major contributors to overall performance.
Wear leveling is one such internal process that can affect memory with endurance-increasing results. After the failure of a cell, the entire block has to be written off. However, for the purposes of device longevity, wear leveling takes over to delay the marking of a bad block for as long as possible. Wear leveling helps to ensure that physical memory addresses are used evenly, thereby delaying bad block failures for as long as possible. Wear leveling works alongside garbage collection, which involves recopying so that blocks can be released for use.
Mapping between the logical and physical address, which is at the core of data storage, is complemented by these processes. The mapping between these two locations is part of the ratio that helps to determine the efficiency of a flash medium controller. This ratio is between the data coming through the host from the user and the data value written into flash memory, and is specified through the WAF, or write amplification factor.
Lower WAFs are associated with higher endurance, but the WAF isn’t controlled by the memory device itself, but rather by the workload factors associated with the firmware. These factors include the size of data blocks, the blocks’ sizes as compared to the pages, and the differences between random and sequential access. As such, firmware has to be considered when choosing memory, rather than just the specs of the storage devices themselves.
Strategies for Increasing Efficiency
Manufacturers work carefully to maximize the efficiency of NAND flash memory. Understanding the basics of how flash memory works can help developers understand how manufacturers do this. At its most simple level, flash memory requires the pages of blocks of cells to be programmed one after the other, but blocks can’t be deleted unless they are complete. In standard instances, mapping between logical and physical addresses happens in blocks, which is efficient for storing sequential data, since the blocks have to be written in succession. Video data that is collected on an ongoing basis is an example of a format in which this kind of block-based allocation works best.
Random data is where manufacturers run into complications. Pages of random data are stored in different blocks, so every internal reprogramming means entire blocks have to be deleted. This leads to a high WAF and associated loss of endurance. To make up for this issue, page mapping replaces block-based allocations, so that data from different sources can be saved in order on the pages of a single block. Page mapping cuts deletions and increases write performances, which can improve endurance, but it also boosts the flash translation layer allocation table. To deal with the increased allocation table, manufacturers integrate dynamic random-access memory (DRAM).
Page mapping can also be used if high utilization demands in the data medium cause an increase in WAF. As the flash holds additional data, bits have to be moved back and forth more frequently. As such, manufacturers have to prevent the media from becoming overloaded. This is achieved with over-provisioning.
Over-provisioning is the part of the flash that is used only for internal activities. Usually, it is 7% of the total area. By using 12% of the total area for over-provisioning rather than 7%, an increase in endurance occurs.
Managing Data
Although wear leveling and error correction are found universally in flash products, there are additional data loss and system failure protections incorporated by manufacturers in industrial SSDs. Combining ECC, read disturb management, and auto read refresh keeps stored data safe and under monitor, while refreshing as required. Note that this kind of data integrity should be present without help from the host application so that the processes can work independently within the memory card itself.
Quality data care management does not rely on requests by applications for possible errors. Written pages are instead read and refreshed as required with multiple causes for error corrections. Some of these error correction triggers include repeated switching on the number of P/E cycles, temperature increases, re-readings, and the read data volume.
Choosing a Memory Solution
It’s essential for developers to consider the broader scope of their applications when choosing memory, rather than selecting a flash memory based on the specs alone. Long-term availability, power-fail protections, rugged processing capacities, and other factors should all be considered. Developers should also work closely with manufacturers to understand how their products can work in different applications. You can narrow down your memory choices by considering these questions:
- Does my application have requirements for vibration resistance? Typically, industrial grade materials will have been tested for more extreme conditions.
- Will the device be exposed to high temperatures? Data care protections that refresh data regularly will reduce the risks imposed by high temperatures.
- Will the data be stored for an extended period of time? Generally, SLC NAND flash memory is best suited to this kind of usage.
- Does my application mainly read or mainly write? Data care management is recommended for read-heavy applications, while applications that are mostly write-based need either block-based mapping or page mapping, depending on whether the data is sequential or random.
- Will the application need the full capacity of the memory? Over-provisioning can help to increase endurance.
- What kind of workload do the specs refer to? The manufacturer can explain the workload benchmark for their products.
- Does my application need extra data loss protection? Most industrial users require power fail protection to preserve sensitive data.
- Will the memory medium be available for long-term use? Ask the manufacturer to explain their device longevity and replacement cycles.
When you need industrial grade storage solutions that are customized to fit the needs of your application, Delkin can offer the support you’re looking for. To find out more, get in touch with our team today.