In last article we've already discussed the segmentation mechanism in x86 memory management. This article will cover some detailed explanation on paging in x86 memory management, including PAE and PSE-36, and what's new in x64.
Why Still need Paging on Segmentation
From the last article Segmentation in x86, we've already known the emergence of segmentation in x86 origins from 8086 - address bus width is 20 bit while data bus width is 16 bit. Thus, segmentation helps multiplexing address into data bus in real mode; And in protected mode, segmentation helps memory isolation. However, the paging mechanism is motivated totally differently from that of segmentation. Paging has everything to do with the term "Virtual Memory", which is motivated in the age when physical memory was not large enough to carry working set of every program. As we know, in OS level, each process has its own memory address space, paging maps it's virtual address to available physical address. For more about virtual memory, I'll write another article in the directory of Operating System - General.
Paging in x86
In x86, the address bus width is 32 bit, and the data bus width is also 32 bit. In a CPU executing context, if there's a memory access instruction, the address in this instruction is a virtual address. Thus, before memory addressing, the virtual address needs to be translated into physical address. So the MMU (Memory Management Unit) inside CPU looks up the maps between virtual addresses and physical addresses, which is called the page table. And finally, the physical address is used to get the data. (Here I omit some parts: Cache, MMU Cache and TLB for simplicity).
Regular Paging
In x86, a regular page size is 4KB, which occupies the least 12 bits of an virtual address/physical address; The rest of 20 bits in a virtual adress represents a virtual page number and in a physical address represents a physical frame number. Thus the page table is the map of this two "20 bits". This "20 bits" is divided into 2-level, as the picture below:
The most 10 bits repesents Page Directory, whose 1024 entry pointing to its corresponding Page Table. And the middle 10 bits represents Page Table, whose 1024 entry pointing to its corresponding 4K page. Thus, each entry in Page Directory (PDE) stores the 4K-aligned physical address (Page-Table Base Address) of its corresponding Page Table because each entry's size is 4 Byte, totally 4KB; only 20 most bits are needed. Similarly, each entry in Page Table (PTE) stores the physical frame number (Page Base Address); physical frame number only occupies 20 bits. The picture below shows the organization of PDE and PTE:
Using multi-level page table greatly shrinks the page table size. If we only use single-level page table, the page table size could be 4 * 220 B = 4 MB, most of page table entries will never be used. By using two-level page table, we could only use one page direcotry (4KB) and several page tables (N * 4KB), only some entries in page directory pointing to a valid page table, the rest of them could be set to be NULL.
The process that MMU find physical frame number according to virtual page number through multi-level page table is called Page Walk. Now, let's use an example to illustrate this process again: suppose we have a virtual address 0x08048001, and need to get the physical address.
(1) the MMU firstly get the page directory via CR3 register, and find the page dirctory entry by the most 10 bits (0000 1000 00 B = 0x20);
(2) Then get the page table via the 4K-aligned physical address stored in page directory entry, and find the page table entry by the middle 10 bits (00 0100 1000 B = 0x48);
(3) Then get the physical address base (0000 0000 0000 1111 0000 B) in the page table entry just found, plusing the least 12 bits (0000 0000 0001 B = 0x01), forming the physical address (0000 0000 0000 1111 0000 | 0000 0000 0001 B = 0x000F0001)
Extended Paging (PSE)
The page size in regular paging is 4KB, while in extended paging, page size is 4MB, which means the page directory pointing to a 4M page directly instead of pointing to a page table, also called PSE (Page Size Extension). Extended paging is used to translate large contiguous linear address ranges into corresponding physical ones; in these cases, the kernel can do without intermediate Page Tables and thus save memory and preserve TLB entries.
Extended paging is enabled by setting the Page Size flag of a Page Directory entry. In this case, the paging unit divides the 32 bits of a linear address into two fields: 10-bit Page Directory and 22-bit Page. Extended paging coexists with regular paging; it is enabled by setting the PSE flag of the CR4 processor register.
PAE (Physical Address Extention)
In early IA-32 CPU, the address bus width is 32 bits, thus the RAM addressing space is limited to 232 B = 4GB. Since Pentium Pro came out in 1995, the address bus width is extended to 36 bits. So, theoretically, the maximum physical memory capacity for a 32-bit x86 machine is now 236 B = 64GB. PAE is activated by setting the Physical Address Extension (PAE) flag in the CR4 control register. The Page Size (PS) flag in the page directory entry enables large page sizes (2MB when PAE is enabled).
In PAE paging mechanism, the virtual address is not changed, still 32 bits. While the physical address is extended to 36 bits. Thus, the page table layout must be modified to multiplex the 32-bit virtual address to the 36-bit physical address. The pictures below shows 4K page and 2M page in PAE.
The up to 64GB physical address is divided into 224 physical page frames; The 20-bit virtual page is multiplexed into 24-bit physical page frame. Thus, the page table entry is also extended to 8 Byte from 4 Byte, while the page table size is till 4KB. Hence there're only 29 = 512 entries in page table; Same as page directory. A new level of Page Table called the Page Directory Pointer Table (PDPT) consisting of four 64-bit entries has been introduced. The CR3 control register contains a 27-bit Page Directory Pointer Table base address field. Because PDPTs are stored in the first 4 GB of RAM and aligned to a multiple of 25 Byte = 32 Bytes, 27 bits are sufficient to represent the base address of such tables. Once CR3 is set, it is possible to address up to 4 GB of RAM. If we want to address more RAM, we'll have to put a new value in CR3 or change the content of the PDPT. However, the main problem with PAE is that linear addresses are still 32 bits long.
PSE-36 (36-bit Page Size Extension)
PSE-36 was introduced when Pentium II Xeon was came out in 1999. We've already known PSE; PSE only have one 4KB page directory with 210 = 1024 entry, and each entry pointing to a 4MB page (22-bit offset in page). Thus a page directory entry stores 10-bit physical page frame. Similar to PAE, PSE-36 muliplex 10-bit virtual page to 14-bit physical page frame. The table below shows the page directory entry layout of PSE and PSE-36.
PSE-36 is an alternative to Physical Address Extension (PAE) which also allows 36-bit addressing. PSE-36 has the advantages that the hierarchy of page tables is not changed, and that page entries keep their old 32-bit format and are not extended to 64 bits. The obvious disadvantage of PSE-36 is that only large pages can be located in 64 GB of physical memory, and small pages can still be located only in the first 4 GB of physical memory. In addition, physial page size could extend to 40 bit in AMD designed CPU.
Paging in x64
In 32-bit IA-32 architecture, two-level paging is enough and reasonable for MMU to walk through the page table. However, two-level paging is not suitable for a 64-bit architecture. Say, if we only use 48 bits of the 64-bit address space, besides the 12-bit page offset, still 36 bit need to be split among Page Directory and Page Table; Assuming each field is 18-bit, both the Page Directory and Page Table of each process should be 218 * 8B = 8MB, same dilemma compared to single-level page table in 32-bit architecture.
Thus, 64-bit IA-32 architecture uses four-level page table hierachy called PML4, PDPT (Page Directory Pointer Table), PD (Page Directory) and PT (Page Table) respectively.
Four-level page table hierachy reduces page table size for each process. However, it also brought some difficulties in hardware design. To walk the four-level page table, the MMU need to access RAM for 4 time when a TLB miss. To improve this, MMU could have its own cache for each level to avoid unnecessary RAM access.
Conclusion
This post, I think, made a very detailed description on paging mechanism in both 32-bit and 64-bit IA-32 architecture. I omit some important parts such as Cache, TLB and MMU Cache just for simplicity. To fully understand the whole process of page walk, you'll really need the knowledge of those. You could find them in the directory of Computer Artchitecure - General.