Looking for our retail site? Click here.

Delkin Blog

NAND Flash Based SSD Drives and the Flash Controller

 

SSD Drive in HandGeneral

Solid State Disk drives (SSD Drives) are becoming more and more common in personal computers and enterprise server systems, and in industrial applications. They either replace mechanical drives or can be used in a mixed system using both types of drives, depending on factors such as cost and reliability, which often are a tradeoff.

Mechanical Disk Drives have good speed performance and data retention, but can fail for mechanical reasons. Industrial grade drives can be very expensive. When mechanical drives are used in enterprise applications or server applications in general, a RAID (Redundant Array of Independent Drives) configuration is mandatory. This is used for fault tolerance and/or speed increase. Typically, there is a minimum of 4 drives in a RAID 10 configuration. This can get very expensive with industrial drives.

From experience, a single mechanical drive has a life of about 5 years with moderate use. So, we now see the SSD coming to life for such applications.

 

Disk Drive Basics

Disk Drives, no matter the type (PATA, SATA etc), are functionally the same as far as their job is concerned. Take requests from a host computer to store to or read data from the disk. Data is transferred with a defined protocol, in LBA (Logical Block Address) units (Logical Blocks), most often referred to as sectors which is 512 bytes each in size. The term sector originated from the mechanical disk. This because a spinning disk is made up of multiple plates (cylinders) which are divided into sectors. There are multiple heads, so that each cylinder can be read or written simultaneously. In the days of old, addressing the disk to read or store data could be done by the LBA, or CHS addressing scheme. CHS stands for Cylinder, Heads, Sectors. Using simple arithmetic, the total capacity of a drive in Sector units (LBAs)  is C x  H x S. CHS addressing is still used in legacy systems, and must still be supported by modern drives, at least up to a certain size.

The reason we are offering the above explanation, is because matching these types of operations with an SSD using NAND Flash is not a simple task. This due to the totally different physical nature of the storage media (NAND Flash).

 

NAND Flash Based SSD Drives

SSD drives eliminate the mechanical failures from shock, vibration and other causes of malfunction. Depending on configuration, they can also take much less space and power than mechanical drives. However, using NAND Flash (SLC, MLC or TLC), presents a challenge to the NAND Flash controller designer to match the speed and corrected bit error rate of mechanical spinning drives. This is due to the nature of the architecture of NAND Flash. As we shall see, this becomes more difficult as process geometries shrink and especially with MLC and TLC types of Flash. The makeup of an SSD is simple in hardware architecture:

NAND Flash Controller

NAND Flash devices (of type SLC, MLC (PSLC), TLC)

Power Supply Regulation

PCB usually at least 4 layers.

Enclosure – depends on physical form factor.

NAND Flash Controllers (HW)- Next to the NAND Flash itself, the controller is the most important component in the SSD. Both the HW (Hardware) and FW (Firmware) work together to get a very difficult task accomplished.

The controller hardware contains:

Host Interface (PATA, SATA, SD, SDIO,MMC,eMMC, USB, PCIE, Etc.)- Communicates in both directions with the host device. The Host interface performs the require protocol and uses Direct Memory and/or Flash access to offload the main FW.

Flash Bus Interface(s) HW- Interfaces with one or more NAND FLASH devices, be it SLC, MLC, PSLC, TLC or rarely a combination of such. This is usually a multi-channel bus (often with interleaving within each channel) which creates parallelism for data transfers increasing throughput.

Direct Memory and Flash Access, DMA DFA HW- Sector buffers can be directly transferred to and from RAM and Flash without CPU intervention. This also increases throughput, since it offloads the CPU.

Error Detection and Correction HW (ECC)- This HW block sits between the Flash and the CPU. It is used to detect and correct data that contains bit errors “on the fly”. This is a critical piece of HW and works hardest with small process MLC/TLC. 96+ bits of correction is required.

Data Scrambler HW- Better designed controllers use a dedicated scrambler unit required for modern MLC/TLC Flash.  Some controllers use their encryption unit to accomplish this task. Use of this with Direct memory and Flash Access really speeds things up.

Data Encryption HW- Data Encryption/Decryption using one of techniques such as AES to and from Flash is becoming a requirement for secure applications. This must be done in HW. Keys are generated and used to access the secure data, even the FW.

ISO 7816 Secure Serial Data Port (Optional but becoming required in applications such as SD and SDIO cards)

SRAM (With Parity becoming a must)- FW runs in SRAM. Mapping tables are cached to SRAM. Temporary data storage and sector buffers reside here as well. SRAM is very limited in most controllers but 256K+ is common. FW overlays are therefore common, loaded as needed by the resident FW.

 

NAND Flash Controller FW Basics

At the top of the importance list, is the NAND FLASH controller FW. Having the most sophisticated HW does no good if the FW isn’t written properly and make optimum use of the HW features.

Every controller has a built-in ROM (or a Flash that is locked after programming).

The ROM code performs like a BIOS in a PC. It does basic CPU initialization, places the host port on a busy status, and Basic FLASH reset and initialization. It scans for the presence of initialized Flash devices, it then searches the Flash devices looking for a key indicating that FW has been installed, and basic needed structures are installed. This would have been already done by a preformatting process performed prior to use. This is provided by the controller vendor.

If all goes right, the ROM loads the resident part of the FW that usually resides in the first Flash device.

At this point, the initialization part of the FW completes its power up procedure, scanning for Flash errors in the last written data, correcting what is needed, and then releases the busy at the host port. Now the controller is able to receive commands from the host. It is important that this power up initialization be kept as short as possible to avoid host/device issues.

 

 FW Structure

 FW should be written in modular form, for example:

Power-up Procedures

Host Interface Procedures

Flash Translation Layer (FTL)*

Flash Read/Write procedures (aided by DMA and DFA HW)

Encryption/Decryption (aided by HW)

Hooks for customer specific add ons

Debugging procedures

Flash procedures include a common part that is applicable to all Flash, and Flash specific (SLC, MLC, TLC, PSLC – and various vendors) procedures loaded at preformat time.

Although all aspects of FW affect performance, the FTL is the most critical. It can make or break an SSD performance and life. The remainder of this document will cover FTL basics.

Flash Translation Layer (FTL)- The FTL bears the brunt of the work of Controller FW. It is made up of the following:

Logical to Physical Mapping Procedures- The basic unit of transfer in disk drives is referred to as an LBA (Logical Block Address), or Sector. This must be stored in physical media which is, in the case of an SSD, the NAND Flash. As mentioned previously, this can be Single Level Cell (SLC), Multi-Level  Cell (MLC -2 Levels), TLC (Triple Level Cell), Pseudo SLC (a version of MLC, made to simulate SLC, special usage).

Logical to Physical Mapping Tables- These tables hold the information that allows locating and placing LBAs (Logical Blocks) in the PBAs (Physical Blocks) of the N.AND Flash. These tables can be quite large, depending on the mapping scheme uses. There are 2 basic mapping schemes and many variants of such. Block Based mapping and Page Based mapping.

Defective Flash Block Tables- These tables hold the initial manufacturer marked defects and could also be augmented with additional defects as blocks go bad. Some controllers only hold manufacturer defects here, and simply remove the dynamic defects from the mapping tables as they occur.

Generally, Flash vendors guarantee no more than 2 % of the total blocks in the Flash device will be defective. This includes initial defects plus dynamic defects.

Flash Log Blocks- Part of the management tables, these blocks hold a history of the latest transactions as they occur. Usually loaded in ram and flushed to Flash after a transaction is completed.

Spare Flash Block Tables- These hold physical addresses of spare blocks used to replace dynamic defects as they occur. Enough spares must be allocated to cover anything over the initial defects.

Wear Leveling- This function ensures that all blocks in Flash devices are used as evenly as possible. This is as important, or more so than the mapping scheme used. Without it, the SSD would not last too long. This because of the nature of NAND Flash. By itself it has a pretty limited life span. The life of Flash is expressed in Block Program/Erase (P/E) cycles. The Flash is made up of Blocks of cells, and these Blocks can only be erased and programmed a limited number of times.

As typical examples:

PLANAR 2D NAND FLASH:

SLC FLASH: (1 bit per Flash cell) 50K to 100K P/E cycles

MLC FLASH: (2 bits per Flash Cell) 3K to 10K P/E cycles

TLC FLASH: (3 bits per Flash cell) 300 to   1K P/E cycles

3D NAND FLASH:

SLC FLASH: Not Manufactured

MLC FLASH: 30K to 35K P/E cycles

TLC FLASH: 15k to 30K P/E cycles

Depending on the mapping scheme used and the state of the SSD, writing one small file can take multiple P/E cycles on the FLASH, an SSD would wear out quickly under heavy workloads such as a  system with a heavy random write work load. Wear Leveling effectively multiplies these figures to very large numbers. Using Static wear leveling (Preferred) the basic 100K P/E cycles can be amplified by multiplying this number by the number of blocks in the Flash device. So, for example, a Flash with 8192 blocks, would have a life of about 819,200,000 P/E cycles before it reaches end of life. Quite a difference in Flash Endurance.

 

Error Detection and Correction (ECC)

This function is handled in HW, on the fly. The supporting FTL procedures are invoked and handle the administration of errors when they occur. The better controllers read the number of bits corrected on each read (HW is designed to contain counters for this.) When the errors reach a percentage of the correction capability, the block is scheduled for a refresh (Erase and Swap). Usually an interrupt is used to alert the CPU that this has occurred.

By doing this, errors caused by read/write disturbs and soft errors can be handled more efficiently. This especially required with MLC or TLC Flash with ever shirking processes.

There are many types of ECC algorithms. One of the best is the BCH algorithm. It lends itself nicely to a hardware implementation.  All ECC types require overhead in the Flash. Flash vendors provide an overhead area in each page of a Flash to contain the overhead. The ECC type used must take the size of this area to make sure the number of bits of correction can be supported.

The BCH type requires 13 bits of overhead for each bit of correction. So, to achieve 96 bits of correction, 1248 bits or 156 bytes in the Flash overhead area are used of each page. Since all ECC types can give false results, a CRC is used over the data and overhead area of the page. This CRC is checked first. If good, then the ECC does its job on the data, and add an additional 2 to 4 bytes for this.

 

Read Disturb Management

NAND Flash blocks support a limited number of reads without an intervening P/E cycle. This function maintains a read counter for each block. When the number of reads approaches a percentage of this number, the block is erased and the data moved to a block with less reads. Typically, 1M reads is specified for SLC, less for MLC or TLC.

 

NAND Flash and Logical to Physical Mapping

NAND Flash, regardless of type (MLC, TLC ,SLC) consists of Flash blocks. Each block, depending on density, contains a number of pages. Pages contain a number of sectors (each 512 Bytes), plus overhead. Page sizes vary, 4k (8 sectors), 8k (16 sectors) 16k (32 sectors)….

Examples of SLC block size:

64 pages of 2,048+OH bytes each for a block size of 128KB

64 pages of 4,096+OH bytes each for a block size of 256KB

128 pages of 4,096+OH bytes each for a block size of 512KB

256 pages of 4,096+OH bytes each for a block size of 1MB

256 Pages of 8192+OH bytes each for a block size of 2MB

 

Blocks are divided into 2 planes, even Blocks in Plane 0, odd Blocks in Plane 1.

A well-designed controller makes use of this by using the 2Plane Flash commands for all Flash access. This allows writing simultaneously to Block 0 and Block 1, for example, or using the 2 Plane erase command.

The above holds for SLC (PSLC), TLC or MLC.

These pages must be written in sequential order. For SLC, partial page writes are allowed, up to 4. For MLC or TLC, partial page writes are not allowed.

A Flash page cannot be over written, without first erasing the whole block. This presents an interesting challenge to the FTL. It must be designed around this large obstacle in order to produce reasonable performance.

 

Managing Data Transfers and Storage with the Host

In scenarios of disk replacement, where a mechanical drive is replaced by an SSD, the Host has no knowledge of what medium the disk is made of. It transfers data to and from the disk in one or more Logical Blocks (using LBA). The mechanical disk was designed to handle this type of transfer by its physical make up of Cylinders, Heads and Sectors.

On the other hand, we have the SSD, using devices structured in a totally different way, and having many more restrictions than the mechanical media.

So, we have logical sectors sent into the drive, and these must be stored in the Flash at known locations, so that they can be transferred back to the Host on Read commands. This is where the critical function of Logical to Physical mapping comes in.

 

Logical to Physical Translation and Mapping

There are 2 basic types of LogToPhys mapping schemes. The first is Block Based Mapping. The second is Page (or even sub-page) Mapping. There are variants of these that further improve performance

 

Block Based Mapping

Basically the logical block address (LBA) is mapped to a Flash Block number, and a page offset within that block. Relatively simple in implementation. The mapping tables are relatively small, and a large part of the table can be held in RAM and flushed to Flash when transactions end.

For sequential writing, performance is good, until the Flash becomes full, or a page must be over written. This requires a block erase, and a data merge, slowing things down.

Random write performance is also poor, since the chance of hitting a block containing data is high. A block erase and data merge is required quite often.

Using MLC or TLC Flash for this type of mapping is not optimum. The restrictions of these Flash types, along with the slower write and block erase times add to the slow performance.

Write amplification, defined roughly as the number of Flash writes divided the host writes Is quite high, for this mapping scheme, especially so for Random writes. The ideal number for write amplification is as close to 1 as you can get. Values of > 100 are not uncommon for Block Based Mapping.

 

In Summary, for Block Based Mapping:

The advantages are:

Small mapping tables- less Flash and Ram space used

Low data loss on power downs

No Garbage collection

Good Sequential and Random read performance

The disadvantages are:

High Write Amplification

Poor Random Write Performance

 

Page Based Mapping

What would happen if we can write sequentially to Flash pages as often as possible?

This can be made possible by using a page, or sub-page mapping scheme. This means that as Logical Blocks (LBAs) come into the disk. They are written sequentially to physical pages of the NAND Flash block, and entries made to the mapping tables.

No immediate block erase is needed. The LogToPhys map is a map of pages across all the Flash, rather than a simpler block number and page offset map. As one can imagine, mapping tables are large. Many times, sub-page mapping can be used, down to the sector level. The map expands even more. Without partial page writes, as is the case in other than SLC Flash, wasted space can occur.

What happens if this data is over written by the host at the logical level (LBA)?  In the case of Block Based mapping, a block erase and data merge is required. With Page Based mapping, the next blank pages are used to write the new data. No immediate erase required. However, the pages with the old data must be marked in the mapping tables as used. Hence they cannot be written until the block is erased. This produces fragmentation in the Flash as the old data page is now garbage, that must be fixed at some point. Garbage Collection takes care of this.

Garbage collection is inevitable at some point. This requires block erases, data movement and map updates that can be large. This overhead can slow down the disk when Garbage Collection is invoked. If done properly it has a smaller impact in speed.

This is often done in a background task, preferably using disk idle periods.

The speed advantages far outweigh this overhead. In fact write amplification is much lower, with random write performance much higher than block based mapping.

As with any algorithm, there are trade-offs. Because of fragmentation, sequential read speeds can be lower, and often are. They do improve after a garbage collection if done correctly. Also, since that map is so large, and usually cached in RAM, there is a higher risk of data loss during random power downs. This can be minimized by using additional techniques.

So, Page Based mapping is required for write intensive applications, such as data logging and especially using Flash other than SLC.

In summary, for Page Based Mapping:

The advantages are:

Low Write Amplification

Higher sequential and random write speeds

The disadvantages are:

Lower sequential read speeds

Garbage collection required

Large mapping tables using more FLASH and RAM space (Less user data space)

Higher Risk of data loss during random power downs

 

Article Contributor:

Carmine C. Cupani, MSEE

CTech Electronics LLC