Flash Controller Hardware and Software Design for Optimum Performance

Comments Off

General:

This article will describe NAND FLASH Controller hardware and software optimization supporting high speed data transfer.

Modes of Data Transfer

Data transfer modes vary by application, and often are the key factor in overall system throughput.

A properly designed industrial flash storage device will have overall higher performance by optimizing transfer speeds in each mode.

Transfer modes for both read and write transfers are either sequential or random. One can probably surmise that write speeds, in particular random write speeds, are the most difficult to deal with. This results from nature of NAND FLASH itself.

NAND FLASH Basic Architecture

NAND FLASH is structured as cells arranged as follows:

Sectors → Pages → Blocks/Planes

The Block is the largest element in the flash. A block consists of multiple pages. A page consists of multiple sectors plus overhead bytes. The size of the overhead area varies with the type of Flash. It contains enough area to hold the error correction data and checksums. A sector consists of 512 Bytes plus overhead bytes. The overhead size is largest in Flash that requires a lot of correction such as small process MLC and TLC.

Blocks are split into 2 planes. Odd blocks in plane 0 even blocks in plane 1. There is a purpose behind this. Flash supports simultaneous operations on blocks in each plane. This is referred to 2 Plane operation and there is a set of special commands to use the feature. For example, a block in each plane of a device can be erases at the same time. Same for almost all commands.

Basic NAND FLASH Operation

NAND Flash is written at the page level and can only be erased at the block level. Pages must be written in sequential order lowest to highest. In all but SLC flash, a page can only be written once. This presents a complicated task for the NAND FLASH controller hardware and firmware, if good performance is to be expected.

Logical to Physical flash Mapping Schemes – Flash Translation Layer

Flash is a physical media where logical data must be stored. The basic unit of storage is the sector or LBA (Logical Block Address). This must be mapped to a physical sector in the Flash. This is one of the tasks of the FTL.

So how can this be done? There are 2 basic schemes for mapping. One is called Block Based mapping, (BBM) the other is called Page Based mapping (PBM). Of the 2 schemes, Block based mapping is the easiest to implement.

BBM (Block Based Mapping)

This has the lower performance for random data writes, which are the most common.

This because if a page needs to be overwritten with new data, the old data must be merged with the new data, and the resultant written to a freshly erased block. The current block is then erased and returned to service. One can imagine how transfer speeds reduce as the storage device becomes well used. Random write speeds are not the high point of BBM. Sequential write speeds are generally good. Sequential read speeds are good as well. A good point of BBM is there is no need for garbage collection, which can introduce periodic overhead, even if done in the background.

The mapping tables for BBM are much smaller than those for PBM. This is due to the fact that mapping is basically a block number and page offset.

PBM (Page Based Mapping)

With this mapping technique there are much less block erases. Logical data is sequentially written to pages within a block. When a page must be overwritten, the new data is written into an already vacant page. The old data page is mapped out in the tables. Much less going on for random writes, since there is minimal data moving and block erasing. Random write performance is much better. The drawback is the flash blocks become fragmented, and at some point must be cleaned up to recover spots of old data for new use. (Garbage Collection). This can be time consuming and present periodic overhead that will slow down performance. Read performance can suffer especially before a garbage collection. This is because of the fragmentation of sector/page locations Mapping table access es increase to retrieve even sectors that may have been written sequentially at some point in the past.

The phenomenon of Write Amplification Factor (WAF) comes into play with both mapping schemes. BBM is by far the worst of the 2 having a much higher WAF. WAF is defined as the ratio of flash to host writes. A WAF approaching is 1.0 is the goal.

The mapping schemes and FTL are part of the firmware. So, the choice of scheme is flexible, and important in determining final performance.

Flash Controller HW and FW Considerations for High Performance Design

We have seen how critical choice of mapping scheme is to final performance. But his is just one aspect determining performance.

Since the controller, firmware is an embedded system, one cannot design each in a vacuum. There are integration and tradeoffs to make. Some involve costs vs. performance.

There is a simple rule we have learned through years of design experience. Whatever can be done in HW will be better performing than done in FW. As an example, the core of error detection and correction (ECC) should be implemented in HW with minimal FW support.

In fact, some functions demand a HW implementation. One such function is data encryption. Necessary in today’s controllers and too much overhead for a FW only implementation.

FLASH Bus Considerations

This is a very critical HW consideration during design. All communication with the Flash devices occurs on this bus. There are important performance affecting items to consider. Once committed to HW, changes are very expensive to make. So, it’s important to spend time to get this correct. This is one area where HW/FW integration tradeoffs are made.

Single or Multichannel FLASH Buses

It should be intuitive that having more than a single Flash bus will improve performance. This at the expense of cost and power consumption. Having multiple Flash buses, each supporting several FLASH devices allows HW interleaving. This allows simultaneous processing of Flash pages in each channel. The more the channels the faster the speed. This also offloads the CPU as well, adding further improvement in performance.

Interleaving

Within each channel, adds additional improvements in performance. This involves page access and block erases, in multiple flash devices within each channel. Because of power and other limitations, usually flash devices in a channel are accessed in pairs. This is mostly a FW implementation.

Within each Flash device, interleaving between planes is possible using the 2 Plane command set. This is a FW function and should always be used.

Additional Performance Enhancements

When designing controller HW, it is important to consider these items carefully:

Sector Buffers– These hold incoming data for each Flash Bus channel, until it’s time to commit to Flash. There is a tradeoff here as to how big these volatile RAM Buffers. Too big and data loss potential becomes excessive during power fails. Too small and performance is impacted.

Direct Flash Access– This provides a direct path from the sector buffers to the Flash devices, offloading the CPU. Used judiciously, this can be a powerful tool to increase performance.

Host interface– Properly designed, this interface, implemented usually as a HW state machine, evaluates incoming commands and uses Direct Memory Access to transfer incoming data to and outgoing from the sector buffers. Interrupts are generally used to signal the CPU when processing is required. The interface must be configurable via registers so that some functional changes are possible.

Conclusion

Controller HW and FW must be carefully co-designed to ensure high performance. The use of as much hardware-based functionality as practical is desired. This, along with FW tradeoffs. The use of multiple flash bus interleaving, Intra-Channel interleaving and Intra-Device interleaving (using 2 plane commands) all add to final performance of the controller and storage device.