Ph.D. in Computer Science at Rutgers University

Memory Hierarchy - Disk

As we know, SRAM and DRAM are transient storages in a computer system. Compared to them, disk is a permanent storage media which could store data even if the system is powered off for years.

Hard Disk Drive

Hard Disk Drive stores digital information by using megnatic material on disks. The disks are always rotating, and the megnatic heads lying on the top of the disk and being arranged on a moving actuator arm reads and writes data to the platter surfaces.


The picture above illustrates more detailed structure of an hard disk drive. In a modern hard disk drive, there're typically 5-6 platters. And both sides of a platter are available. For a single side of an arbitrary platter, data are divided into several concentric circles, and each ring is divided into several sectors - the basic unit of a hard disk drive. Data are always stored and retrieved in the manner of a "Cylinder"; Concentric circles with same diameters of different platters constructs a Cylinder. As two pictures shown below, to locate a particular area of a hard disk drive, we need to know its coordinates as we do in a three-dimentional space. The CHS addressing first locate the cylinder and then located the head and finally the sector.


From the two pictures above, we could take a grasp that why the data is always stored by cylinder. Consider if the data is stored by platter as we think it intuitively. As we know, hard disk is the slowest part in a computer system (except for tape:) for the head seeking is up to several micro seconds. If one track is filled up, the header must go to another track. If the data is stored by cylinder, once a track is filled up, the head is changed to another track in the same cylinder to avoid head move.

When the hard disk is working, the platters are spinning all the time. When the head is located to the track, it starts reading data. But here's a problem that it's possible that the next sector is skipped after finish reading current sector for the spinning speed is higher than the initializing and reading speed of head. To solve this problem, engineers from IBM interleaves sectors on the same track as the following picture shows.


Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed. This can reduce the amount of unnecessary drive head movement, resulting in increased performance (and slightly decreased wear of the drive) for workloads where multiple simultaneous read/write requests are outstanding, most often occurring in server-type applications.


NCQ allows the drive itself to determine the optimal order in which to retrieve outstanding requests. This may, as the picture above illustrates, allow the drive to fulfill all requests in fewer rotations and thus less time.

Factors on hard disk performance:

Solid State Drive

For a Windows 7 user, you must clearly remember the Experience Index will limit to 5.9 no matter how strong your CPU, GPU and how large RAM size you have. In a computer system, everything is "communicated" in the speed of electron except of hard disk drive which has mechanical parts. Thus, the emergence of Solid State Drive break the bottle neck of second storage performance.

Hardware Structure

First of all, let's see the underline story inside a solid state drive. Unlike rotating hard disks, which read and write data magnetically, an SSD reads and writes to a medium called NAND flash memory. Flash memory is non-volatile, which means that it doesn't lose its contents when it loses power like the DRAM used in your computer's memory does. It's the same stuff that lives inside your smartphones, mp3 players, and little USB thumb drives, and it comes from the same assembly lines. What makes an SSD so much faster than a thumb drive is a combination of how the NAND chips are addressed, and the various caching and computing shortcuts that the SSD's built-in controller uses to read and write the data.

Flash memory's non-volatility comes from the types of transistors used in its makeup—namely, floating gate transistors. Normal transistors are simple things; they're essentially just electronically controlled switches. Volatile memory, like a computer's RAM, uses a transistor coupled with a capacitor to indicate a zero or a one. The transistor is used to transfer charge to or drain charge from the capacitor, and that charge must be refreshed every few microseconds. A floating gate transistor, on the other hand, is more than just a switch, and doesn't have a needy external capacitor to hold a charge. Rather, a floating gate transistor creates a tiny cage (called the floating gate), and then encourages electrons to migrate into or out of that cage using a particular kind of quantum tunneling effect. The charge those electrons represent is permanently trapped inside the cage, regardless of whether or not the computer it's in is currently drawing power or not.


The pictures above illustrates the structure of a NAND cell; Floating gate is covered by oxide layer. Control Gates are connected to WL (Word Line), while Float Gates are connected to BL (Bit Line) and sensor. When doing a read operation, a specific WL is selected and the drain will also be put a voltage, then the status of Floating gate will be read and translated by the sensor. For the write operation, the control gate will be put a high voltage and so is drain to activate the electrons around the channel and so some electrons will be injected into the floating gate through the tunnel oxide. Here, a cell could only store 1 bit information, also known as SLC (Single-Level Cell), which makes SSD extremely expensive and not even affordable for storage consumer market when SSD first came out. To decrease the cost of SSD, MLC (Multi-Level Cell) that could store 2 bits were invented. Since a MLC could store 4 states, there're four voltage gauges for 00, 01, 10, 11. Nowadays, TLC (Triple-Level Cell) storing 3 bits per cell are largely used, and price for SSD becomes much less inexpensive than it used to be. Each cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. NAND-flash memory wears off and has a limited lifespan. The different types of NAND-flash memory have different lifespans. Typically, SLC has 100,000 P/E cycles on average; MLC has 5,000 - 10,000 P/E cycles and TLC has less. In addition, P/E cycle is also associated with transistor process technology; For instance, a MLC made in 25nm will have more P/E cycles than in 20nm.

NAND cells are arranged in an array. The picture below shows a typical flash cell array. Let's see SLC, multiple cells in word line are connected in series while cell in bit line are in parallel. when a word line is selected, all the bits in the same word could be accessed in parallel.


Cells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each.


The pictures right above shows the architecture for a SSD. The RAM buffer stores the page need to be read or writen. The SSD controller plays a very important role in a SSD, it's the heart for SSD as CPU for computer. SSD controller controls traffics inside flash arrays and outside SATA or PCIE interface. In addition, as I mentioned, flash cell has limited P/E cycles, thus the wearing leveling is the responsibility of SSD controller.

I/O operations in Low-level View

Now let's see what the SSD will do when read and write operations occur:

When a read signal is raised in SSD controller, the voltage of word lines in the selected page will be set to 0 while those in other pages will be set to a voltage and this voltage should not be that large in case of eletrons fleed into float gate. After the word line is selected, all the bits are read from float gates and stored in the page buffer (RAM buffer).

For the write operation, the minimum gauge is block; And in addition, SSD cannot overwrite the existing data as HDD. Let's recall the internal structure of a HDD, a magnetic storage. If HDD need to modify the data in a sector, the read/write head first locate the sector, and then do the overwrite directly. If the original datum of a single bit is S, the new datum is S, it remains unchanged; If the new datum is N, the original datum will be magnetized to be N. However, SSD must erase the whole block to 1 first and then overwrite the data. That is to say, since an erase of the cells in the page is needed before it can be written again, but only entire blocks can be erased, an overwrite will initiate a read-erase-modify-write cycle: the contents of the entire block are stored in cache, then the entire block is erased from the SSD, then the overwritten page is written to the cached block, and only then can the entire updated block be written to the flash medium.

Drawbacks and Trade-off in SSD

SSD seems to have overwhealming advantage over HDD for it's low access latency. However, it still has some problems and trade-off in design and manufacture.

Write Amplification

As we just learned above, to modify a single page, the SSD has to erase the whole block before overwrite the new one back. This multiplying effect increases the number of writes required over the life of the SSD which shortens the time it can reliably operate. The increased writes also consume bandwidth to the flash memory which mainly reduces random write performance to the SSD. All SSDs have a write amplification value and it is based on both what is currently being written and what was previously written to the SSD. In order to accurately measure the value for a specific SSD, the selected test should be run for enough time to ensure the drive has reached a steady state condition. A simple formula to calculate the write amplification of an SSD is:


Cell Wear Off

We've known that NAND cell has its P/E cycles after which the cell is no longer usable. The fact is that once a single cell in a page is weared off, the whole page will be marked as broken. The logic address mapped to this page will redirect to another page.

Since cells have limited P/E cycles, why would the whole block need to be erased before overwrite? The reason for that is circuit interference and granularity issues. Let's consider the same page, some bytes are being erased while some bytes are being writen. The interference will occur and neither of them will succeed. If it is the page that need to erase each time, the much more complex will the circuit be.

Problem Solver

There're several solutions to minimize write amplification and leveling cell wear off.

Garbage Collector

In the file system level, some files are overwriten frequently. So the cells will be wear off very soon. In order to avoid this situation, each block will write to a new space rather than the same block. And the previous block is marked as garbage and will be erased later. The logic address mapped to the previous block will redirect to the new block. This method will not only minimize write amplification but also level cell wear off problem.

As I mentioned in the last section, the SSD controller is the heart for SSD. The algorithm of cell wear off and redirection is different between multiple SSD controllers. Thus, one key point when you choose SSD is SSD controller.

TRIM Instruction

TRIM is an instruction in ATA command set which allows OS to inform a SSD which blocks of data are no longer considered in use and could be erased. Trimming enables the SSD to more efficiently handle garbage collection, which would otherwise slow future write operations to the involved blocks. Because of the way that many file systems handle delete operations, by flagging data blocks as "not in use", storage media (SSDs, but also traditional hard drives) generally DO NOT know which sectors/pages are truly in use and which can be considered free space. Contrary to, for example an overwrite operation, a delete will not involve a physical write to the sectors that contain the data. Since a common SSD has no knowledge of the file system structures, including the list of unused blocks/sectors, the storage medium remains unaware that the blocks have become available. While this often enables undelete tools to recover files from electromechanical hard disks, despite the files being reported as "deleted" by the operating system, it also means that when the operating system later performs a write operation to one of the sectors, which it considers free space, it effectively becomes an overwrite operation from the point of view of the storage medium. The TRIM command enables an operating system to notify the SSD of pages which no longer contain valid data. For a file deletion operation, the operating system will mark the file's sectors as free for new data, then send a TRIM command to the SSD. After trimming, the SSD will not preserve any contents of the block when writing new data to a page of flash memory, resulting in less write amplification (fewer writes), higher write throughput (no need for a read-erase-modify sequence), thus increasing drive life.


In order to maintain the GC and redirection mechanism, SSD cannot be full. So most of SSD has some reserved blocks for write redirections and users or even operating system have no idea about its existence.