Reliability of SSDs – technical advances in the latest generations

If you have been using computers for more than a few years, you may remember when SSDs first emerged onto the market. They were expensive, had small storage capacities, and were unreliable. In fact, a very vocal opposition to SSDs would remind anyone who would listen about their short lifetimes and warn that when your SSD failed, all of your precious data would be lost forever.

Since then, the cost of SSDs has dropped and the capacities soared (following Moore’s law closely) but, whatever happened to those reliability concerns? Why aren’t forums full of posts from users whose SSDs just ‘bricked’ and Google full of articles about how to prevent it?

ssd prices over time
Source: https://www.extremetech.com/computing/153879-storage-pricewatch-hdds-back-to-pre-flood-prices-ssds-grow-as-gb-holds-steady

The short version is that technical improvements in flash memory controllers allow superior error correction resulting in improved reliability and prolonged SSD life. In fact, SSDs now have similar expected lifespans to their HDD competitors with many vendors offering 3 or 5 year warranties.

The remainder of this article will be the long story

hard disk drive hdd

When a consumer researches options for a new laptop or components to upgrade their existing drive, their choice is between a HDD (hard disk drive) or a SSD (solid-state drive). So, what actually are the differences?

HDDs use a spinning magnetic disk which, in its simplest form, was invented in 1955 and subsequently supported the computer industry for 6 decades. HDDs read and write by physically spinning a magnetized disk and positioning a read/write head over the correct location. HDD speed is fundamentally limited by the disk speed as there will always be a delay while the disk spins around before the desired data can be read/written. Moreover, the read/write head can only be in one place at a time further limiting speed. Modern HDD controllers operate fast but not as fast as a purely digital solution such as solid-state memory.

The fundamental building blocks of solid-state memory are floating-gate transistors. Transistors were first developed into CPUs but as their size and cost have continued to fall, they have become a viable option for storage. A transistor is effectively a sandwich of semiconductors that can store discrete levels of charge even when powered down. These discrete charge levels encode one or more stored bits. Early SSDs used NOR flash but since 2009, most SSDs consist of NAND flash memory. (Some enterprise SSDs have recently started to use volatile DRAM based SSDs but I won’t go into a comparison.) In general, modern NAND flash based SSDs have advantages over HDDs that include higher performance, larger capacities, being ultra-quiet, lower power consumption, and improved reliability.

The reliability claim is difficult to validate without comprehensive lifespan experiments and depends greatly on how one defines reliability. Hard drives need to survive the occasional drop, power outage, minor overheating incident etc. as well as aging gracefully. But what actually goes on inside a hard drive that causes it to age and eventually fail?

‘Data degradation’ is a broad term for data, stored in any medium, deteriorating. Modern storage devices (i.e. HDDs and SSDs) are effectively immune from physical material degradation that was once rife with paper media literally rotting (hence the term ‘bit rot’). Within digital storage devices, data degradation occurs randomly on the bit level – called ‘soft’ errors. Errors in magnetic media occur as individual bits lose or have their magnetic orientation disrupted whereas errors in NAND flash memory occur as the inside of a NAND chip is very noisy, the signals weak, and insulation imperfect. Although individual bit flips are often not fatal, gradual accumulation of these errors can corrupt the data.

As consumer expectations and NAND flash technology have evolved, manufacturers have consistently tried to develop products with larger storage capacities. There are two basic methods to do so:

  1. physically squeeze more transistors into the space, or
  2. use each transistor to store more bits of information.

Together, these two methods have made NAND flash the cheapest memory on the market. However, doing so also increases the rate which soft errors occur. Smaller transistors placed closer together, and thus separated by less insulating material, are more likely to have their charge dissipate. If that charge encodes more bits, a small charge dissipation will be more likely to be read incorrectly as another discrete state (comparing 0 or 3 vs comparing 0, 1, 2 or 3).

Further confounding the problem of soft errors, the physical material of each transistor deteriorates with each program/erase (P/E) cycle. High voltage is used to erase the current state and allow new data to be written. As each individual transistor degrades, its ability to reliably hold a given charge diminishes and its soft error rate rises. The culmination of degradation occurs when the transistor fails to hold any charge (referred to as a ‘hard’ error) and the memory cell becomes unusable.

Both SSD and HDD is not perfect

Both HDDs and SSDs use error correction codes (ECC) to correct soft errors and ensure reliability. In an aging SSD, EEC also allows cells to weaken but remain reliably read until a hard error occurs. For this reason, NAND flash memory is specified with a number of P/E cycles after which the first irreparable hard error will occur (typically 100,000 and 10,000 for single and multi-level NAND flash respectively). ECC corrects soft errors in place and does not require rewriting of data. Therefore, ECC improves data reliability and lengthens the life of the storage device by subjecting each cell to fewer P/E cycles.

The principles of ECC are the same within both HDDs and SSDs (and many other forms of data transmission). Storage devices use parity bits that are recorded on the drive along with the raw data. ECCs work by comparing the data the controller reads to the ECC algorithm and if the two do not match, the ECC algorithm tries to correct the data. Sophisticated ECC strategies employ a variety of methods to determine what the correct data is likely to be. If ECC cannot recover the correct data, a replication scheme may be needed to retrieve the correct data from a backup.

There are several common mechanisms for ECC in storage devices. The TL;DR version is that the transition from using Bose-Chaudhuri-Hocquenghem (BCH) ECC to low-density parity-check (LDPC) has greatly improved SSD reliability and lifetime.

BCH

BCH error correction worked well for large geometry NAND flash but as the industry has moved towards much smaller and denser NAND flash (that as described previously, is more susceptible to ‘soft’ errors) BCH no longer sufficed. The industry has moved towards LDPC error correction which can correct more errors per page for the same ratio of data bits per parity bit. Large capacity, long lifespan SSDs would not possible without LDPC.

LDPC

LDPC was originally proposed by Robert Gallagher in 1963. But it wasn’t until the 1990s (after NAND flash was already being deployed with BCH error correction) when LDPC was rediscovered for application in satellite TV, and later Ethernet and Wi-Fi. There are two types of decoding in LDPC: hard-decision and soft-decision. Any LDPC system starts with hard-decision LDPC which is faster and requires fewer overhead resources. Hard-decision LDPC can correct the majority of errors but can be overwhelmed by many errors at once. Soft-decision LDPC would then intervene and re-read the cell, often with more fine-grained reference voltages. Soft-decision decoding has only become viable in the latest generations of NAND flash but allows LDPC to operate in much more noisy environments and correct more errors.

There are variations of soft-decision LDPC algorithms between NAND flash controller manufacturers but, the fundamental principle is that the memory cell is no longer considered in a discrete state encoding one or more bits. Instead, the charge is considered analog and a probabilistic interpretation of the likelihood of that charge corresponding to any of the discrete steps is used to correct errors. Doing so requires a lot of digital signal processing, much of which is proprietary and protected by the controller manufacturers. Because of the added sophistication, LDPC ECC circuits are slower, larger, require more RAM and consume more power than those for BCH. Naturally, development of complex ECC systems drive up the price of NAND flash controllers and thus SSDs. However, consumers and, even more so, enterprise customers would not accept irreparable component failures shorter than typical computer lifetimes of 3-5 years.

SSD manufacturers also increase SSD lifespan through other techniques. One of the most common is wear leveling where the controller writes to every NAND cell once before writing to any a second time. Wear leveling ensures that the entire drive wears and ages together, preventing particular cells enduring high use while others remain untouched. A second method is external data buffering which refers to a set of strategies utilizing RAM to reduce the number of NAND writes. Many SSDs are also over-provisioned; NAND chips typically have about 4% more memory than stated for use by the controller and to allow for some cells to wear out. (This is typically the reason why SSDs have rounded number capacities e.g. 120(vs 128) or 240(vs 256) GB.) Employing LDPC ECC and other strategies has resulted in SSDs with improved reliability and comparable lifespans to HDDs.

Only few companies manufacturer NAND flash controllers

There are relatively few NAND flash controller manufacturers, many fewer than SSD vendors. When there is a large market but a small number of manufacturers, vendors often describe their own products as better than the competition without using quantitative performances or lifetimes that could later be used against them. One of the only metrics a consumer can compare across models is the rated P/E cycles.

Often consumers are frightened by the lower 10,000 cycles of multi- level units compared to the 100,000 cycles of single-level units. However, 10,000 cycles is definitely more than sufficient for most users. In fact, it corresponds to over 25 years of writing and erasing the entire capacity of the drive once per day. However, P/E cycle ratings are likely not indicative of quality and reliability of a particular SSD model. Because of the importance of ECC, purchasing SSDs with the latest controllers developed by a top brand may result in more reliable and longer lasting SSDs.

The transition to LDPC error correction has also enabled a transition to triple-level cell (TLC) NAND from multi-level cell (MLC). TLC stores 3 bits per transistor and thus can store 50% more information in a given number of transistors and space than MLC (2 bits per transistor). TLC SSDs have significant savings in $/GB making SSDs even more accessible to consumers.

Do SSD fail?

A related concern about early SSDs was that if your SSD did fail, it would be bricked and you would not be able to recover your data. NAND SSDs do sometimes fail and the root of that failure may occur in the NAND itself, the controller, or the controller’s firmware. However, in many cases at least some of your data should be recoverable.

Modern SSDs might not be perfect but they are amazing storage devices. However, they are certainly not all made equal. That unbranded SSD you found online for a bargain might actually use slow NAND and an outdated controller and will likely not be worthwhile. Stay informed, purchase wisely, and backup your data!

Vincent Major, New York University

LEAVE A REPLY

Please enter your comment!
Please enter your name here