REN

Ph.D. in Computer Science at Rutgers University

I/O

In Computer Architecture catalog, I posted Memory Hierarchy series, and talked about hardware structure of Disk. This post will discuss more on I/O in operating system level. First of all, I'll discuss on RAID, then interrupt and DMA.


I/O basic concept

I/O, abbreviation for Input/Output, is an important part for modern Von Neumman Architecture. We know memory is like a hierarchical structure, the faster the access speed, the smaller the capacity and the higher the cost. External storage (Disk/Tape) enables a computer to store data when power is off. However to access Disk is slow because the data has to go through ICH (I/O Controller Hub), previous known as South Bridge Chipset via DMI bus (5 GB/s in DMI 2.0 and 8 GB/s in DMI 3.0). Compared to access disk, access RAM is much more faster because RAM is connected to CPU directly via Memory Bus (12.8 GB/s for 1600mhz DDR3 RAM).

         

The two pictures above shows the architecture for Intel Nehalem and Skylake. Nehalem encapsulate memory controller inside CPU, and from then, IOH (also known as North Bridge Chipset) is evicted and everything was in IOH is now in CPU; So we don't see IOH in Skylake.


Redundant Array of Independent Disks

RAID (redundant array of independent disks) is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both. Data is distributed across the drives in one of several ways, referred to as RAID levels, depending on the required level of redundancy and performance.

RAID Implementation

RAID could be either implemented in software or hardware. For software-based RAID implementation, there're a bunch of solutions:

For hardware-based RAID, there's typically a hardware RAID controller. During early stage of system bootup, it is the firmware that is in charge of RAID control; Once the operating system has been loaded completely in RAM, the device driver takes over the control. Though hardware-based RAID brings additional cost, it efficient compared with software-based RAID.

RAID 0 Stripes

RAID 0 splits data evenly across multiple disks, which means a set of data is divided into blocks and these blocks a evenly distributed in multiple disks in a RAID 0 system. The following left diagram shows the structure of RAID 0.

         

As the diagram illustrates, a set of data is divided into 8 blocks which are also numbered. Blocks with odd number is stored in disk 0 and blocks with even number is stored in disk 1. From the structure above we could deduce that when writing data to RAID 0 system, parallel writing could happen since the RAID controller split the data and each disk have its own I/O operation. Theoretically, RAID 0 could speed up the writing speed compared with a single disk.

The stripe length and stripe depth of a RAID 0 system is important. It's intuitive to conclude that the larger the stripe length, the higer the performance of RAID 0 read and write; However this will possibly lead to poor I/O concurrency because each I/O operation will access all the disks. To increase I/O concurrency, stripe depth should not be that small.

RAID 1 Mirror

RAID 1 provides pure data mirroring, without any parity or striping. Data is written identically to two drives, thereby producing a "mirrored set" of drives. The diagram right above illustrates how RAID 1 works. In the diagram, the data is divided into 4 blocks, and both disk stores exactly the same. We could deduce that RAID 1 would bring additional overhead in writing operation and for reading operation, random read performance may increase.

RAID 2 Bit-level Striping and Hamming Code

RAID 2, like RAID 0, strips data among different disks. However, the stripping in RAID 2 is at the bit level. In addition, RAID 2 uses Hamming Code for error correction. The following diagram shows RAID 2 structure.

   

Hamming code is a linear error-correcting code that encodes message length of data into block length of data by parity bits. Let n be the message bit length, k be the parity bit length. They must meet the formula of 2k ≥ n+k+1.

Since bit-level striping and overhead of hamming code, RAID 2 is rarely used.

RAID 3 Byte-level Striping and Parity Disk

Like RAID 2, RAID 3 is also rarely used in practice. One of the characteristics of RAID 3 is that it generally cannot service multiple requests simultaneously, which happens because any single block of data will, by definition, be spread across all members of the set and will reside in the same location. Therefore, any I/O operation requires activity on every disk and usually requires synchronized spindles.

   

This makes it suitable for applications that demand the highest transfer rates in long sequential reads and writes, for example uncompressed video editing. Applications that make small reads and writes from random disk locations will get the worst performance out of this level.

RAID 4 Block-level Striping and Parity Disk

RAID 4 uses block-level striping with a dedicated parity disk compared with RAID 3 byte-level and RAID 2 bit-level.

   

RAID 4 provides good performance of random reads, while the performance of random writes is low due to the need to write all parity data to a single disk.

RAID 5 Block-level Striping and Distributed Parity

As I said above, RAID 4 will suffer from poor performance of ramdom writes since all parity data will be written to the parity disk. Thus, RAID 5 improved this by distributing parity information among disks.

   

In comparison to RAID 4, RAID 5's distributed parity evens out the stress of a dedicated parity disk among all RAID members. Additionally, write performance is increased since all RAID members participate in serving of the write requests. Although it won't be as efficient as a non RAID setup, (because parity must still be written), this is just no longer a bottleneck.

RAID 6 Block-level Striping and More Distributed Parity

RAID 6 extends RAID 5 by adding another parity block; thus, it uses block-level striping with two parity blocks distributed across all member disks.

   

RAID 6 does not have a performance penalty for read operations, but it does have a performance penalty on write operations because of the overhead associated with parity calculations. Performance varies greatly depending on how RAID 6 is implemented in the manufacturer's storage architecture—in software, firmware, or by using firmware and specialized ASICs for intensive parity calculations. RAID 6 can read up to the same speed as RAID 5 with the same number of physical drives.

RAID 01 and RAID 10

Since RAID 0 and RAID 1 has advantages on performance and reduncency respectively, thus the combination of RAID 0 and RAID 1 provides both speedup and fault tolerance. But how RAID 01 and RAID 10 differ from each other?

         

The minimum configuration of this combination, as illustrated in the pictures above, is 4 drives. Apparently, RAID 01 is a mirrored configuration of two stripd sets while RAID 10 is a stripe across a number of mirrored sets. The fault resilience are different. In RAID 01, the RAID 0 is on the lower layer and the mirroring RAID 1 is on the upper layer. Should one disk fail on a RAID 01, you would be forced to replace both RAID 0 disks on that side of the parent RAID 1. The RAID 1 would then rebuild the two new disks. Should a second disk fail on the other side of the array for some reason, even if it held the other half of the RAID0 data, the controller will fail the RAID 0 portion on that side as well and you will have no disks left. In RAID 10, the mirroring is on the lower layer rather than the upper layer. You would be able to lose one disk on one side of the RAID0 and safely replace just that disk. In fact, you would be able to lose up to two disks total (one from each side) and still have a functioning RAID array. For this reason, RAID 10 is preferable to RAID 01.


Polling vs Interrupt

Polling and interrupt are two mechanisms in how CPU get notified of I/O operations. In Polling method, the CPU must "access by himself" the device and “ask” for the information it needs for processing. Interrupt is the signal sent to the micro to mark the event that requires immediate attention. Interrupt is “requesting" the processor to stop to perform the current program and to “make time” to execute a special code. Whenever any device needs its service, the device notifies the microcontroller by sending it an interrupt signal. Upon receiving an interrupt signal, the CPU interrupts whatever it is doing and serves the device. To make it easier to understand, let me use a daily example. Assuming you need to boil the water, there're two methods. You could either take a look to see whether the water is boiled every 1 minute (polling), or boil the water in a kattle that could make a sound when water is boiled (interrupt).

Polling-Driven I/O is that CPU periodically checks whether I/O operation is finished or data is received. Interrupt-Driven I/O that the CPU set a signal marks I/O operation. When the data is being received via NIC, the CPU could do other stuffs instead of busy waiting and polling. After the data is all received into buffer, the signal is raised so the CPU will handle it immediately.


Direct-mapped I/O vs Memory-mapped I/O

There're two mechanisms of data transfer between CPU and peripheral devices: direct-mapped I/O and memory-mapped I/O.

Direct-mapped I/O: the I/O controller has several registers for data and control signals. The processor communicates with the devices by reading and writing data or signals in these registers.

Memory-mapped I/O: those registers are mapped to address space of the processor. The CPU executes I/O requests using the standard data-transfer instructions to read and write the device-control registers at their mapped locations in physical memory.


PIO and DMA

In Programmed I/O (PIO), the CPU manually check if there are any I/O requests available periodically. If there isn't, it keeps executing its normal workflow. If there is, it handles the IO request instead. Obviously, it's based on polling mechanism.

In Direct Memory Access (DMA), peripheral devices could access main memory without the handle from CPU. When the CPU initiates data transfer from I/O devices to main memory, CPU instructs the DMA controller to handle the transfer task. And once the transfer is finished, a interrupt signal is raised to notify CPU. That is to say, DMA bypasses CPU to transfer data directly between I/O device and main memory. DMA can also be used for "memory to memory" copying or moving of data within memory. DMA can offload expensive memory operations, such as large copies or scatter-gather operations, from the CPU to a dedicated DMA engine.

DMA has three modes of operation.

Burst Mode: An entire block of data is transferred in one contiguous sequence. Once the DMA controller is granted access to the system bus by the CPU, it transfers all bytes of data in the data block before releasing control of the system buses back to the CPU, but renders the CPU inactive for relatively long periods of time. The advantage of this mode is obvious - fast.

Cycle Stealing Mode: In the cycle stealing mode, the DMA controller obtains access to the system bus the same way as in burst mode, using BR (Bus Request) and BG (Bus Grant) signals, which are the two signals controlling the interface between the CPU and the DMA controller. After one byte of data transfer, the control of the system bus is deasserted to the CPU via BG. It is then continually requested again via BR, transferring one byte of data per request, until the entire block of data has been transferred. By continually obtaining and releasing the control of the system bus, the DMA controller essentially interleaves instruction and data transfers.

Transparent Mode: the DMA controller transfers data only when the CPU is performing operations that do not use the system buses. The primary advantage of transparent mode is that the CPU never stops executing its programs and the DMA transfer is free in terms of time, while the disadvantage is that the hardware needs to determine when the CPU is not using the system buses, which can be complex.