(Page Directory)
(→Page Directory: Just had a mystifying page fault. Toggling R/W bit in parent tables fixed it, with the obvious conclusion those matters too (in QEMU TGC at least). Explicitly note this for poor souls like me.)
(109 intermediate revisions by 52 users not shown)
Line 1:
Line 1:
{{(削除) In Progress (削除ここまで)}}
{{(追記) Disputed (追記ここまで)}}
(削除) {{Stub}} (削除ここまで)
(追記) [[image:Paging_Structure.gif|right|thumb|600x350px|x86 Paging Structure]] (追記ここまで)
(削除) ==Overview== (削除ここまで)
Paging is a (追記) system which allows each process to see a full virtual address space, without actually requiring the full amount of physical (追記ここまで)memory (追記) to be available or present. 32-bit x86 processors support 32-bit virtual addresses and 4-GiB virtual address spaces, and current 64-bit processors support 48-bit virtual addressing (追記ここまで)and (追記) 256-TiB virtual address spaces (追記ここまで). (追記) Intel has released [https://en.wikipedia.org/wiki/Intel_5-level_paging documentation] (追記ここまで)for (追記) a extension (追記ここまで)to (追記) 57-bit virtual addressing and 128-PiB virtual address spaces. Currently, implementations of x86-64 (追記ここまで)have (追記) a limit of between 4 GiB and 256 TiB of physical address space (and an architectural limit of 4 PiB of physical address space) (追記ここまで).
Paging is a memory (削除) scheme that breaks up memory in groups of pages that are constantly swapped between hard disk (削除ここまで)and (削除) computer (削除ここまで). (削除) This allows (削除ここまで)for (削除) one (削除ここまで)to (削除) appear as though they (削除ここまで)have (削除) more memory than they actually do (削除ここまで).
(削除) ==MMU== (削除ここまで)
(追記) In addition to this, paging introduces (追記ここまで)the (追記) benefit (追記ここまで)of (追記) page-level protection. In this system, user processes can only see and modify data which is paged in on their own address space, providing hardware-based isolation. System pages are also protected from user processes. On (追記ここまで)the (追記) x86-64 architecture, page-level protection now completely supersedes (追記ここまで)[[(追記) Segmentation (追記ここまで)]] (追記) as (追記ここまで)the (追記) memory protection mechanism (追記ここまで). (追記) On (追記ここまで)the (追記) IA-32 architecture, both paging and segmentation exist, but segmentation is now considered 'legacy' (追記ここまで).
(削除) Paging is achieved through (削除ここまで)the (削除) use (削除ここまで)of the [[(削除) MMU (削除ここまで)]](削除) . The MMU is a unit that transforms virtual addresses into physical addresses based on (削除ここまで)the (削除) current page table (削除ここまで).(削除) This section focuses on (削除ここまで)the (削除) x86 MMU (削除ここまで).
===(削除) Page Directory (削除ここまで)===
(追記) Once an Operating System has paging, it can also make use of other benefits and workarounds, such as linear framebuffer simulation for memory-mapped IO and paging out to disk, where disk storage space is used to free up physical RAM. (追記ここまで)
(削除) The topmost paging structure (削除ここまで)is the (削除) page directory (削除ここまで). (削除) It is essentially an array (削除ここまで)of (削除) page (削除ここまで)directory (削除) entries that take (削除ここまで)the (削除) following form (削除ここまで).
== (追記) 32-bit Paging (Protected Mode) (追記ここまで)=(追記) = (追記ここまで)
(追記) Paging (追記ここまで)is (追記) achieved through the use of (追記ここまで)the (追記) [[Memory Management Unit]] (MMU) (追記ここまで). (追記) On the x86, the MMU maps memory through a series (追記ここまで)of (追記) [[Page Tables|tables]], two to be exact. They are the paging (追記ここまで)directory (追記) (PD), and (追記ここまで)the (追記) paging table (PT) (追記ここまで).
(削除) '''Note: With 5mb pages (削除ここまで), bits (削除) 21 through 12 are Reserved!''' (削除ここまで)
(追記) Both [[Page Tables| tables]] contain 1024 4-byte entries (追記ここまで), (追記) making them 4 KiB each. In the page directory, each entry points to a page table. In the page table, each entry points to a 4 KiB physical page frame. Additionally, each entry has (追記ここまで)bits (追記) controlling access protection and caching features of the structure to which it points. The entire system consisting of a page directory and page tables represents a linear 4-GiB virtual memory map. (追記ここまで)
(削除) [[Image (削除ここまで):(削除) Page dir (削除ここまで).(削除) png| (削除ここまで)frame(削除) |A Page Table Entry]] (削除ここまで)
(追記) Translation of a virtual address into a physical address first involves dividing the virtual address into three parts (追記ここまで): (追記) the most significant 10 bits (bits 22-31) specify the index of the page directory entry, the next 10 bits (bits 12-21) specify the index of the page table entry, and the least significant 12 bits (bits 0-11) specify the page offset (追記ここまで). (追記) The then MMU walks through the paging structures, starting with the page directory, and uses the page directory entry to locate the page table. The page table entry is used to locate the base address of the physical page (追記ここまで)frame(追記) , and the page offset is added to the physical base address to produce the physical address. If translation fails for some reason (entry is marked as not present, for example), then the processor issues a page fault. (追記ここまで)
The (削除) page table address field represents the physical address of (削除ここまで)the page (削除) table that managers the four megabytes at that point. Please note that it is very important that this address be 4kb aligned (削除ここまで). (削除) This (削除ここまで)is (削除) needed, due to the fact (削除ここまで)that the (削除) last bits of the dword are overwritten by access bits and such (削除ここまで).
(追記) === Page Directory === (追記ここまで)
The (追記) topmost paging structure is (追記ここまで)the page (追記) directory (追記ここまで). (追記) It (追記ここまで)is (追記) essentially an array of page directory entries (追記ここまで)that (追記) take (追記ここまで)the (追記) following form (追記ここまで).
(削除) The next valid flag, S, or ' (削除ここまで)Page (削除) Size', stores the page size for that specific entry. If the bit is set, then pages are 4mb in size. Otherwise, they are 4kb. (削除ここまで)
(追記) [[Image:Page_directory_entry.png|frame|A (追記ここまで)Page (追記) Directory Entry]] (追記ここまで)
(削除) A (削除ここまで), (削除) or 'Accessed' is used to discover whether a (削除ここまで)page (削除) has been read or written to (削除ここまで). (削除) If it has, then the bit is set, otherwise, (削除ここまで)it is (削除) not. Note (削除ここまで)that(削除) , (削除ここまで)this (削除) bit will not (削除ここまで)be (削除) cleared by (削除ここまで)the (削除) CPU, so (削除ここまで)that (削除) burden falls on (削除ここまで)the (削除) OS (削除ここまで). (削除) (ie. if it needs this bit at all (削除ここまで).(削除) ) (削除ここまで)
(追記) When PS=0 (追記ここまで), (追記) the page table address field represents the physical address of the (追記ここまで)page (追記) table that manages the four megabytes at that point (追記ここまで). (追記) Please note that (追記ここまで)it is (追記) very important (追記ここまで)that this (追記) address (追記ここまで)be (追記) 4-KiB aligned. This is needed, due to (追記ここまで)the (追記) fact (追記ここまで)that the (追記) last 12 bits of the 32-bit value are overwritten by access bits and such (追記ここまで). (追記) Similarly, when PS=1, the address must be 4-MiB aligned (追記ここまで).
D, is the 'Cache Disable' bit. If set, the page will not be cached. Otherwise, it will be.
(追記) * PAT, or ''''P'''age '''A'''ttribute '''T'''able'. If [https://en.wikipedia.org/wiki/Page_attribute_table PAT] is supported, then PAT along with PCD and PWT shall indicate the memory caching type. Otherwise, it is reserved and must be set to 0. (追記ここまで)
(追記) * G, or ''''G'''lobal' tells the processor not to invalidate the TLB entry corresponding to the page upon a MOV to CR3 instruction. Bit 7 (PGE) in CR4 must be set to enable global pages. (追記ここまで)
(追記) * PS, or ''''P'''age '''S'''ize' stores the page size for that specific entry. If the bit is set, then the PDE maps to a page that is 4 MiB in size. Otherwise, it maps to a 4 KiB page table. Please note that 4-MiB pages require PSE to be enabled. (追記ここまで)
(追記) * (追記ここまで)D(追記) , or ''''D'''irty' is used to determine whether a page has been written to. (追記ここまで)
(追記) * A, or ''''A'''ccessed' is used to discover whether a PDE or PTE was read during virtual address translation. If it has, then the bit is set, otherwise, it is not. Note that, this bit will not be cleared by the CPU, so that burden falls on the OS (if it needs this bit at all). (追記ここまで)
(追記) * PCD (追記ここまで), is the 'Cache Disable' bit. If (追記) the bit is (追記ここまで)set, the page will not be cached. Otherwise, it will be.
(追記) * PWT, controls Write-Through' abilities of the page. If the bit is set, write-through caching is enabled. If not, then write-back is enabled instead. (追記ここまで)
(追記) * U/S, the ''''U'''ser/Supervisor' bit, controls access to the page based on privilege level. If the bit is set, then the page may be accessed by all; if the bit is not set, however, only the supervisor can access it. For a page directory entry, the user bit controls access to all the pages referenced by the page directory entry. Therefore if you wish to make a page a user page, you must set the user bit in the relevant page directory entry as well as the page table entry. (追記ここまで)
(追記) * R/W, the ''''R'''ead/'''W'''rite' permissions flag. If the bit is set, the page is read/write. Otherwise when it is not set, the page is read-only. The WP bit in CR0 determines if this is only applied to userland, always giving the kernel write access (the default) or both userland and the kernel (see Intel Manuals 3A 2-20). The R/W bit of the parent tables is also checked: if any are 0, the page is treated as read-only. (追記ここまで)
(追記) * P, or ''''P'''resent'. If the bit is set, the page is actually in physical memory at the moment. For example, when a page is swapped out, it is not in physical memory and therefore not 'Present'. If a page is called, but not present, a page fault will occur, and the OS should handle it. (See below.) (追記ここまで)
(削除) W (削除ここまで), the (削除) controls 'Write-Through' abilities (削除ここまで)of the page. (削除) If (削除ここまで)the (削除) bit (削除ここまで)is (削除) set (削除ここまで), (削除) write-through caching is enabled (削除ここまで). (削除) If (削除ここまで)not(削除) , then write-back is enabled instead (削除ここまで).
(追記) The remaining bits 9 through 11 (if PS=0, also bits 6 & 8) are not used by the processor (追記ここまで), (追記) and are free for (追記ここまで)the (追記) OS to store some of its own accounting information. In addition, when P is not set, the processor ignores the rest (追記ここまで)of (追記) the entry and you can use all remaining 31 bits for extra information, like recording where (追記ここまで)the page (追記) has ended up in swap space (追記ここまで). (追記) When changing (追記ここまで)the (追記) accessed or dirty bits from 1 to 0 while an entry (追記ここまで)is (追記) marked as present (追記ここまで), (追記) it's recommended to invalidate the associated page (追記ここまで). (追記) Otherwise, the processor may (追記ここまで)not (追記) set those bits upon subsequent read/writes due to TLB caching (追記ここまで).
(削除) U, the user\supervisor bit, controls access to the page based on privilege level. If the bit is set, then the page may be accessed by all; if the bit is not set, however, only the supervisor can access it (削除ここまで).
(追記) [[Image:Page table entry (追記ここまで).(追記) png|frame|A Page Table Entry]] (追記ここまで)
(削除) R, (削除ここまで)the (削除) read and write permissions flag, either (削除ここまで)makes the page (削除) only readable, that (削除ここまで)is, (削除) when it is (削除ここまで)not (削除) set (削除ここまで), (削除) or makes the page both readable and writable (削除ここまで), (削除) that is (削除ここまで), (削除) being set (削除ここまで).
(追記) Setting (追記ここまで)the (追記) PS bit (追記ここまで)makes the page (追記) directory entry point directly to a 4-MiB page. There (追記ここまで)is (追記) no paging table involved in the address translation. (追記ここまで)
(追記) Note: With 4-MiB pages (追記ここまで), (追記) whether or (追記ここまで)not (追記) bits 20 through 13 are reserved depends on PSE being enabled and how many PSE bits are supported by the processor (PSE (追記ここまで), (追記) PSE-36 (追記ここまで), (追記) PSE-40). [[CPUID]] should be used to determine this. Thus (追記ここまで), (追記) the physical address must also be 4-MiB-aligned. Physical addresses above 4 GiB can only be mapped using 4 MiB PDEs (追記ここまで).
(削除) P, or 'Presence', determines if the (削除ここまで)page (削除) is actually in physical memory at the moment. (eg. if a page only exists on the hard drive (削除ここまで), it is (削除) not in physical memory (削除ここまで).(削除) ) If a page is (削除ここまで)called(削除) , but not present, a (削除ここまで)page (削除) fault will occur (削除ここまで), and (削除) the OS should handle it. (See below (削除ここまで).(削除) ) (削除ここまで)
(追記) === Page Table === (追記ここまで)
(追記) In each (追記ここまで)page (追記) table (追記ここまで), (追記) as (追記ここまで)it is(追記) , there are also 1024 entries (追記ここまで). (追記) These are (追記ここまで)called page (追記) table entries (追記ここまで), and (追記) are '''very''' similar to page directory entries (追記ここまで).
(削除) ===Page Table=== (削除ここまで)
(追記) The first item, is once again, a 4-KiB aligned physical address. Unlike previously, however, the address is not that of a page table, but instead a 4 KiB block of physical memory that is then mapped to that location in the page table and directory. Note that the PAT bit is bit 7 instead of bit 12 as in the 4 MiB PDE. (追記ここまで)
(削除) = (削除ここまで)===Example(削除) = (削除ここまで)===
=== Example ===
Say (削除) I (削除ここまで)loaded (削除) my kernel (削除ここまで)to 0x100000. However, (削除) I want (削除ここまで)it (削除) mapped (削除ここまで)to (削除) 0xc0000000 (削除ここまで). After loading (削除) my (削除ここまで)kernel, (削除) I (削除ここまで)initiate paging, and set up the appropriate tables. (See [[Higher Half Kernel]]) After [[Identity Paging]] the first megabyte, (削除) I start (削除ここまで)to create (削除) my (削除ここまで)second table (ie. at entry #768 in (削除) my (削除ここまで)directory.) to map 0x100000 to (削除) 0xc0000000 (削除ここまで). (削除) My (削除ここまで)code (削除) could (削除ここまで)be like:
Say (追記) the kernel is (追記ここまで)loaded to 0x100000. However, it (追記) needed (追記ここまで)to (追記) be remapped to 0xC0000000 (追記ここまで). After loading (追記) the (追記ここまで)kernel, (追記) it'll (追記ここまで)initiate paging, and set up the appropriate tables. (See [[Higher Half Kernel]]) After [[Identity Paging]] the first megabyte, (追記) it'll need (追記ここまで)to create (追記) a (追記ここまで)second table (ie. at entry #768 in (追記) the paging (追記ここまで)directory.) to map 0x100000 to (追記) 0xC0000000 (追記ここまで). (追記) The (追記ここまで)code (追記) may (追記ここまで)be like:
(追記) <source lang="ASM"> (追記ここまで)
mov eax, 0x0
mov eax, 0x0
mov ebx, 0x100000
mov ebx, 0x100000
Line 47:
Line 62:
cmp eax, 1024
cmp eax, 1024
(削除) je (削除ここまで).(削除) end (削除ここまで)
(追記) jne (追記ここまで).(追記) fill_table (追記ここまで)
(削除) jmp .fill_table (削除ここまで)
(追記) </source> (追記ここまで)
(削除) (削除ここまで).(削除) end: (削除ここまで)
(追記) == 64-Bit Paging == (追記ここまで)
(追記) [[Image:64-bit page tables1 (追記ここまで).(追記) png|thumb|Page map table entry structure (non-page-sized)]] (追記ここまで)
(削除) ==Enabling== (削除ここまで)
(追記) Paging in [[x86-64|long mode]] is similar to that of 32-bit (追記ここまで)paging(追記) , except [[PAE|Physical Address Extension]] (PAE) (追記ここまで)is (追記) required (追記ここまで). (追記) Registers CR2 and CR3 are extended to 64 bits. Instead of just having (追記ここまで)to (追記) utilize 3 levels (追記ここまで)of (追記) page maps: page directory pointer table, page directory, and page table, a fourth page-map table is used: (追記ここまで)the (追記) level-4 page map table (PML4). This allows a processor to map 48-bit virtual addresses to 52-bit physical addresses. If level-5 (追記ここまで)page (追記) maps are supported (追記ここまで)and (追記) enabled, then a fifth page-map table, the level-5 page map table (PML5), allows the processor to map 57-bit virtual addresses (追記ここまで)to (追記) 52-bit physical addresses. Both (追記ここまで)the (追記) PML4 and PML5 contain 512 64-bit entries of which each may point to a lower-level page map table. Do note that with each additional level of (追記ここまで)paging(追記) , virtual addressing becomes slower, especially in the case (追記ここまで)of (追記) TLB cache misses (追記ここまで).
(削除) Enabling (削除ここまで)paging is (削除) actually very simple (削除ここまで). (削除) All that is needed is (削除ここまで)to (削除) load CR3 with the address (削除ここまで)of the page (削除) directory (削除ここまで)and to (削除) set (削除ここまで)the paging (削除) bit (削除ここまで)of (削除) CR0 (削除ここまで).
mov eax, (削除) [ (削除ここまで)page_directory(削除) ] (削除ここまで)
(追記) Virtual addresses in 64-bit mode must be '''canonical''', that is, the upper bits of the address must either be all 0s or all 1s. For systems supporting 48-bit virtual address spaces, the upper 16 bits must be the same, and for systems supporting 57-bit virtual addresses, the upper 7 bits must match. Although 32-bit code running in [[x86-64|long mode]] (compatibility mode) is still limited to 32-bit virtual addresses, they can still map to a 52-bit physical addresses. (追記ここまで)
(追記) === Page Map Table Entries === (追記ここまで)
(追記) [[Image:64-bit page tables2.png|thumb|Page map table entry structure (page-sized)]] (追記ここまで)
(追記) New bits have been added to page map table entries for long-mode paging: (追記ここまで)
(追記) * XD, or ''''E'''xecute '''D'''isable'. If the NXE bit (bit 11) is set in the [[CPU_Registers_x86-64#IA32_EFER|EFER register]], then instructions are not allowed to be executed at addresses within the page whenever XD is set. If EFER.NXE bit is 0, then the XD bit is reserved and should be set to 0. (追記ここまで)
(追記) * PK, or ''''P'''rotection '''K'''ey'. The protection key is a 4-bit corresponding to each virtual address that is used to control user-mode and supervisor-mode memory accesses. If the PKE bit (bit 22) in CR4 is set, then the PKRU register is used for determining access rights for user-mode based on the protection key. If the PKS bit (bit 24) is set in CR4, then the PKRS register is used for determining access rights for supervisor-mode based on the protection key. A protection key allows the system to enable/disable access rights for multiple page entries across different address spaces at once. (追記ここまで)
(追記) M signifies the physical address width supported by a processor using PAE. Currently, up to 52 bits are supported, but the actual supported width may be less. (追記ここまで)
(追記) Bits marked as reserved must all be set to 0, otherwise, a page fault will occur with a reserved error code. (追記ここまで)
(追記) Support for 1 GiB pages, (NX) execute disable, (PKS/PKU) protection keys for supervisor-mode and user-mode pages, shadow stack pages, (M) physical address width, virtual address width, (PAT) page attribute table, (PCID) process context identifiers, and (LA57) 5-level paging can be determined with the [[CPUID|CPUID]] instruction (EAX:0x01; EAX:0x07, ECX=0x00; EAX:0x80000001; EAX:0x80000008). (追記ここまで)
(追記) === Process Context Identifiers === (追記ここまで)
(追記) If process context ids (PCID) are supported, then bits 0-11 of CR3 specify the process context id. Otherwise, bit 3 is PWT for PML4, and bit 4 is PCD for PML4. PCIDs are used to control TLB caching across multiple address spaces. The INVPCID instruction uses PCIDs to allow more control over page invalidation. (追記ここまで)
(追記) == Enabling == (追記ここまで)
(追記) === 32-bit Paging === (追記ここまで)
(追記) Enabling paging is actually very simple. All that is needed is to load CR3 with the address of the page directory and to set the paging (PG) and protection (PE) bits of CR0. (追記ここまで)
(追記) <source lang="ASM"> (追記ここまで)
mov cr3, eax
mov cr3, eax
mov eax, cr0
mov eax, cr0
or eax, (削除) 0x80000000 (削除ここまで)
or eax, (追記) 0x80000001 (追記ここまで)
mov cr0, eax
mov cr0, eax
(追記) Note: setting the paging flag when the protection flag is clear causes a [[Exceptions#General_Protection_Fault|general protection exception]]. Also, once paging has been enabled, any attempt to enable long mode by setting LME (bit 8) of the [[CPU_Registers_x86-64#IA32_EFER|EFER register]] will trigger a [[Exceptions#General_Protection_Fault|GPF]]. The CR0.PG must first be cleared before EFER.LME can be set. (追記ここまで)
(追記) If you want to set pages as read-only for both userspace and supervisor, replace 0x80000001 above with 0x80010001, which also sets the WP bit. (追記ここまで)
(削除) ==Usage== (削除ここまで)
(追記) To enable PSE (4 MiB pages) the following code is required. (追記ここまで)
=(削除) =Page Faults== (削除ここまで)
(追記) <source lang (追記ここまで)=(追記) "ASM"> (追記ここまで)
(削除) A page fault is an exception caused when a process is seeking to access an area of virtual memory that is not mapped to any physical memory. (削除ここまで)
(追記) mov eax, cr4 (追記ここまで)
(追記) or eax, 0x00000010 (追記ここまで)
(追記) mov cr4, eax (追記ここまで)
===(削除) Handling (削除ここまで)===
=== (追記) 64-bit Paging (追記ここまで)===
(削除) Todo (削除ここまで)
(追記) Enabling paging in long mode requires a few more additional steps. Since it is not possible to enter long mode without paging with PAE active, the order in which one enables the bits are important. Firstly, paging must not be active (i.e. CR0.PG must be cleared.) Then, CR4.PAE (bit 5) and EFER.LME (bit 8 of MSR 0xC0000080) are set. If 57-bit virtual addresses are to be enabled, then CR4.LA57 (bit 12) is set. Finally, CR0.PG is set to enable paging. (追記ここまで)
==See Also==
(追記) <source lang="ASM"> (追記ここまで)
===Articles===
(追記) ; Skip these 3 lines if paging is already disabled (追記ここまで)
(追記) mov ebx, cr0 (追記ここまで)
(追記) and ebx, ~(1 << 31) (追記ここまで)
(追記) mov cr0, ebx (追記ここまで)
(追記) ; Enable PAE (追記ここまで)
(追記) mov edx, cr4 (追記ここまで)
(追記) or edx, (1 << 5) (追記ここまで)
(追記) mov cr4, edx (追記ここまで)
(追記) ; Set LME (long mode enable) (追記ここまで)
(追記) mov ecx, 0xC0000080 (追記ここまで)
(追記) or eax, (1 << 8) (追記ここまで)
(追記) ; Replace 'pml4_table' with the appropriate physical address (and flags, if applicable) (追記ここまで)
(追記) mov eax, pml4_table (追記ここまで)
(追記) mov cr3, eax (追記ここまで)
(追記) ; Enable paging (and protected mode, if it isn't already active) (追記ここまで)
(追記) or ebx, (1 << 31) | (1 << 0) (追記ここまで)
(追記) mov cr0, ebx (追記ここまで)
(追記) ; Now reload the segment registers (CS, DS, SS, etc.) with the appropriate segment selectors... (追記ここまで)
(追記) mov ax, DATA_SEL (追記ここまで)
(追記) ; Reload CS with a 64-bit code selector by performing a long jmp (追記ここまで)
(追記) jmp CODE_SEL:reloadCS (追記ここまで)
(追記) hlt ; Done. Replace these lines with your own code (追記ここまで)
(追記) jmp reloadCS (追記ここまで)
(追記) Once paging has been enabled, you cannot switch from 4-level paging to 5-level paging (and vice-versa) directly. The same is true for switching to legacy 32-bit paging. You must first disable paging by clearing CR0.PG before making changes. Failure to do so will result in a [[Exceptions#General_Protection_Fault|general protection fault]]. (追記ここまで)
(追記) == Physical Address Extension == (追記ここまで)
(追記) All Intel processors since Pentium Pro (with exception of the Pentium M at 400 Mhz) and all AMD since the Athlon series implement the [[PAE|Physical Address Extension]] (PAE). This feature allows you to access up to 4 PiB (2<sup>52</sup>) of RAM. You can check for this feature using [[CPUID|CPUID]]. Once checked, you can activate this feature by setting bit 5 in CR4. (追記ここまで)
(追記) For legacy 32-bit PAE, the CR3 register points to a page directory pointer table (PDPT) of 4 64-bit entries, each one pointing to a page directory made of 4096 bytes (like in normal paging), divided into 512 64-bit entries, each pointing to a 4096-byte page table, divided into 512 64bit page entries. Keep in mind that virtual addresses are still limited to 4 GiB (2<sup>32</sup> bytes). (追記ここまで)
(追記) For 4-level and 5-level PAE, as used in compatibility mode and [[x86-64|long mode]], the CR3 register points to the top-level page map table: the PML4 table and PML5 table, respectively. Each of the page map tables: PML5 table, PML4 table, page directory pointer table, page directory, page table, contain 512 64-bit entries. (追記ここまで)
(追記) If paging is enabled then PAE must also be enabled before entering long mode. Attempting to enter long mode with CR0.PG set and CR4.PAE cleared will trigger a general protection fault. (追記ここまで)
(追記) == Usage == (追記ここまで)
(追記) Due to the simplicity in the design of paging, it has many uses. (追記ここまで)
(追記) === Virtual Address Spaces === (追記ここまで)
(追記) In a paged system, each process may execute in its own area of memory, without any chance of affecting any other process's memory, or the kernel's. Two or more processes may opt to share memory by mapping the same physical page(s) to addresses in their own address spaces. The virtual address of each mapping do not need to be the same. Consequently, a virtual address in one address space won't point to the same data in other address spaces, in general. (追記ここまで)
(追記) [[Image:Virtual memory.png|frame|none|paging illustrated: two process with different views of the same physical memory]] (追記ここまで)
(追記) === Virtual Memory === (追記ここまで)
(追記) Because paging allows for the dynamic handling of unallocated page tables, an OS can swap entire pages, not in current use, to the hard drive where they can wait until they are called. In the mean time, however, the physical memory that they were using can be used elsewhere. In this way, the OS can manipulate the system so that programs actually seem to have more RAM than there actually is. (追記ここまで)
(追記) ''More...'' (追記ここまで)
(追記) == Manipulation == (追記ここまで)
(追記) The CR3 value, that is, the value containing the address of the page directory, is in physical form. Once, then, the computer is in paging mode, only recognizing those virtual addresses mapped into the paging tables, how can the tables be edited and dynamically changed? (追記ここまで)
(追記) Many prefer to map the last PDE to itself. The page directory will look like a page table to the system. To get the physical address of any virtual address in the range 0x00000000-0xFFFFF000 is then just a matter of: (追記ここまで)
(追記) <source lang="C"> (追記ここまで)
(追記) void *get_physaddr(void *virtualaddr) { (追記ここまで)
(追記) unsigned long pdindex = (unsigned long)virtualaddr >> 22; (追記ここまで)
(追記) unsigned long ptindex = (unsigned long)virtualaddr >> 12 & 0x03FF; (追記ここまで)
(追記) unsigned long *pd = (unsigned long *)0xFFFFF000; (追記ここまで)
(追記) // Here you need to check whether the PD entry is present. (追記ここまで)
(追記) unsigned long *pt = ((unsigned long *)0xFFC00000) + (0x400 * pdindex); (追記ここまで)
(追記) // Here you need to check whether the PT entry is present. (追記ここまで)
(追記) return (void *)((pt[ptindex] & ~0xFFF) + ((unsigned long)virtualaddr & 0xFFF)); (追記ここまで)
(追記) To map a virtual address to a physical address can be done as follows: (追記ここまで)
(追記) <source lang="C"> (追記ここまで)
(追記) void map_page(void *physaddr, void *virtualaddr, unsigned int flags) { (追記ここまで)
(追記) // Make sure that both addresses are page-aligned. (追記ここまで)
(追記) unsigned long pdindex = (unsigned long)virtualaddr >> 22; (追記ここまで)
(追記) unsigned long ptindex = (unsigned long)virtualaddr >> 12 & 0x03FF; (追記ここまで)
(追記) unsigned long *pd = (unsigned long *)0xFFFFF000; (追記ここまで)
(追記) // Here you need to check whether the PD entry is present. (追記ここまで)
(追記) // When it is not present, you need to create a new empty PT and (追記ここまで)
(追記) // adjust the PDE accordingly. (追記ここまで)
(追記) unsigned long *pt = ((unsigned long *)0xFFC00000) + (0x400 * pdindex); (追記ここまで)
(追記) // Here you need to check whether the PT entry is present. (追記ここまで)
(追記) // When it is, then there is already a mapping present. What do you do now? (追記ここまで)
(追記) pt[ptindex] = ((unsigned long)physaddr) | (flags & 0xFFF) | 0x01; // Present (追記ここまで)
(追記) // Now you need to flush the entry in the TLB (追記ここまで)
(追記) // or you might not notice the change. (追記ここまで)
(追記) Unmapping an entry is essentially the same as above, but instead of assigning the <code>pt[ptindex]</code> a value, you set it to 0x00000000 (i.e. not present). When the entire page table is empty, you may want to remove it and mark the page directory entry 'not present'. Of course you don't need the 'flags' or 'physaddr' for unmapping. (追記ここまで)
(追記) == Page Faults == (追記ここまで)
(追記) A [[Exceptions#Page_Fault|page fault]] exception is caused when a process is seeking to access an area of virtual memory that is not mapped to any physical memory, when a write is attempted on a read-only page, when accessing a PTE or PDE with the reserved bit or when permissions are inadequate. A [[Exceptions#Page_Fault|page fault]] can either be pure, which occurs when the faulting process has permission to access the page, or invalid, which is due to a protection violation. Pure [[Exceptions#Page_Fault|page faults]] aren't errors, but are resolved through the page fault handler by performing the appropriate map operation and/or page swap. (追記ここまで)
(追記) === Handling === (追記ここまで)
(追記) The CPU pushes an error code on the stack before firing a [[Exceptions#Page_Fault|page fault exception]]. The error code must be analyzed by the exception handler to determine how to handle the exception. The following bits are the only ones used, all others are reserved. (追記ここまで)
(追記) Bit 0 (P) is the Present flag. (追記ここまで)
(追記) Bit 1 (R/W) is the Read/Write flag. (追記ここまで)
(追記) Bit 2 (U/S) is the User/Supervisor flag. (追記ここまで)
(追記) Bit 3 (RSVD) indicates whether a reserved bit was set in some page-structure entry (追記ここまで)
(追記) Bit 4 (I/D) is the Instruction/Data flag (1=instruction fetch, 0=data access) (追記ここまで)
(追記) Bit 5 (PK) indicates a protection-key violation (追記ここまで)
(追記) Bit 6 (SS) indicates a shadow-stack access fault (追記ここまで)
(追記) Bit 15 (SGX) indicates an [https://en.wikipedia.org/wiki/Software_Guard_Extensions SGX violaton] (追記ここまで)
(追記) The combination of these flags specify the details of the page fault and indicate what action to take: (追記ここまで)
(追記) US RW P - Description (追記ここまで)
(追記) 0 0 0 - Supervisory process tried to read a non-present page entry (追記ここまで)
(追記) 0 0 1 - Supervisory process tried to read a page and caused a protection fault (追記ここまで)
(追記) 0 1 0 - Supervisory process tried to write to a non-present page entry (追記ここまで)
(追記) 0 1 1 - Supervisory process tried to write a page and caused a protection fault (追記ここまで)
(追記) 1 0 0 - User process tried to read a non-present page entry (追記ここまで)
(追記) 1 0 1 - User process tried to read a page and caused a protection fault (追記ここまで)
(追記) 1 1 0 - User process tried to write to a non-present page entry (追記ここまで)
(追記) 1 1 1 - User process tried to write a page and caused a protection fault (追記ここまで)
(追記) When the CPU fires a page-not-present exception the CR2 register is populated with the linear address that caused the exception. The upper 10 bits specify the page directory entry (PDE) and the middle 10 bits specify the page table entry (PTE). First check the PDE and see if it's present bit is set, if not setup a page table and point the PDE to the base address of the page table, set the present bit and iretd. If the PDE is present then the present bit of the PTE will be cleared. You'll need to map some physical memory to the page table, set the present bit and then iretd to continue processing. (追記ここまで)
(追記) == INVLPG == (追記ここまで)
(追記) INVLPG is an instruction available since the i486 that invalidates a single page in the TLB. Intel notes that this instruction may be implemented differently on future processors, but that this alternate behavior must be explicitly enabled. INVLPG modifies no flags. (追記ここまで)
(追記) NASM example: (追記ここまで)
(追記) <source lang="ASM"> (追記ここまで)
(追記) Inline assembly for GCC (from Linux kernel source): (追記ここまで)
(追記) <source lang="C"> (追記ここまで)
(追記) static inline void __native_flush_tlb_single(unsigned long addr) { (追記ここまで)
(追記) asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); (追記ここまで)
(追記) This only invalidates the page on the current processor. If you're using SMP, you'll need to send an IPI to the other processors so that they can also invalidate the page (this is called a TLB shootdown; it's very slow), making sure to avoid any nasty race conditions. You may only want to do this when removing a mapping, and just make your page fault handler invalidate a page if it you didn't invalidate a mapping addition on that processor by looking through the page directory, again avoiding race conditions. (追記ここまで)
(追記) When you modify an entry in the page directory, rather than just a page table, you'll need to invalidate each page in the table. Alternatively, you could reload CR3 which will invalidates the whole directory, but this may be slower. (TODO time this) (追記ここまで)
(追記) == Paging Tricks == (追記ここまで)
(追記) The processor always fires a page fault exception when the present bit is cleared in the PDE or PTE regardless of the address. This means the contents of the PTE or PDE can be used to indicate a location of the page saved on mass storage and to quickly load it. When a page gets swapped to disk, use these entries to identify the location in the paging file where they can be quickly loaded from then set the present bit to 0. Similarly, blocks from disk can be mapped to memory this way. When a process accesses the memory-mapped region, a page fault occurs. The fault handler reads the appropriate tables, loads the disk block(s) into a page, and maps it. The process can then read/write to memory as if it were accessing the device directly. The contents of the page would then be written back to disk to save the changes. (追記ここまで)
(追記) For memory efficiency, two or more processes can share pages as read-only. If one process were to write to its page, then a page fault would occur and the system could duplicate the page and then mark it as read-write. This is known as copy-on-write (COW). Copy-on-write allows the system to delay memory allocation until a process actually requires it, preventing unnecessary copying. (追記ここまで)
(追記) The Page Attribute Table determines caching attributes on a page granularity. This is similar to [[MTRR]]s, but those apply to physical addresses and are more limited. (追記ここまで)
(追記) The PAT is set via the IA32_PAT_MSR [[MSR]] (0x277). It has 8 entries, taking the low order 3 bits of each byte, in standard little endian order. So the high byte is PAT7, low byte is PAT0. (追記ここまで)
(追記) The following are the different caching types. (追記ここまで)
(追記) {| class="wikitable" border="1" (追記ここまで)
(追記) ! Description (追記ここまで)
(追記) | UC — Uncacheable (追記ここまで)
(追記) | All accesses are uncacheable. Write combining is not allowed. Speculative accesses are not allowed. (追記ここまで)
(追記) | WC — Write-Combining (追記ここまで)
(追記) | All accesses are uncacheable. Write combining is allowed. Speculative reads are allowed. (追記ここまで)
(追記) | WT — Writethrough (追記ここまで)
(追記) | Reads allocate cache lines on a cache miss. Cache lines are not allocated on a write miss. (追記ここまで)
(追記) Write hits update the cache and main memory. (追記ここまで)
(追記) | WP — Write-Protect (追記ここまで)
(追記) | Reads allocate cache lines on a cache miss. All writes update main memory. (追記ここまで)
(追記) Cache lines are not allocated on a write miss. Write hits invalidate the cache (追記ここまで)
(追記) line and update main memory. (追記ここまで)
(追記) | WB — Writeback (追記ここまで)
(追記) | Reads allocate cache lines on a cache miss, and can allocate to either the shared, (追記ここまで)
(追記) exclusive, or modified state. Writes allocate to the modified state on a cache miss. (追記ここまで)
(追記) | UC- — Uncached (追記ここまで)
(追記) | Same as uncacheable, ''except'' that this can be overriden by Write-Combining MTRRs. (追記ここまで)
(追記) The PAT has a reset value of 0x0007040600070406. This ensures compatibility with non-PAT usage. This corresponds to the following: (追記ここまで)
(追記) {| class="wikitable" border="1" (追記ここまで)
(追記) The PAT is indexed by the three page table bits: (追記ここまで)
(追記) {| class="wikitable" border="1" (追記ここまで)
(追記) The PAT bit is reserved when there isn't a PAT, and the default value of the MSR ensures backwards comaptibility with the PCD and PWT bit. (追記ここまで)
(追記) You will need to modify the PAT if you want Write-Combining cache, which is very useful for framebuffers. (追記ここまで)
*[[Identity Paging]]
*[[Identity Paging]]
*[[Page Frame Allocation]]
*[[Page Frame Allocation]]
(追記) *[[Setting Up Paging]] (追記ここまで)
(追記) *[[Page Tables]] (追記ここまで)
(追記) *[[Memory Management]] (追記ここまで)
(追記) *[[Memory Management Unit]] (追記ここまで)
(追記) *[https://forum.osdev.org/viewtopic.php?p=282061 Page Coloring] (追記ここまで)
(追記) === External Links === (追記ここまで)
(追記) *[https://forum.osdev.org/viewtopic.php?f=1&t=18222 INVLPG thread] (追記ここまで)
(追記) *[http://www.dumaisnet.ca/index.php?article=ff3b7adb128cb438ac1e306b3fbe37e7 Process Context ID] (追記ここまで)
(追記) *[https://en.wikipedia.org/wiki/Intel_5-level_paging 5-Level Paging] (追記ここまで)
(削除) ===External Links=== (削除ここまで)
(削除) [http://www.viralpatel.net/taj/tutorial/paging.php Paging Tutorial] (削除ここまで)
[[Category:Memory management]]
[[Category:Memory management]]
(追記) [[Category:Paging]] (追記ここまで)
(追記) [[Category:Virtual Memory]] (追記ここまで)
(追記) [[Category:Security]] (追記ここまで)
(追記) [[de:Paging]] (追記ここまで)
Latest revision as of 17:23, 21 October 2025
The factual accuracy of this article or section is disputed.
Please see the relevant discussion on the talk page.
Paging is a system which allows each process to see a full virtual address space, without actually requiring the full amount of physical memory to be available or present. 32-bit x86 processors support 32-bit virtual addresses and 4-GiB virtual address spaces, and current 64-bit processors support 48-bit virtual addressing and 256-TiB virtual address spaces. Intel has released documentation for a extension to 57-bit virtual addressing and 128-PiB virtual address spaces. Currently, implementations of x86-64 have a limit of between 4 GiB and 256 TiB of physical address space (and an architectural limit of 4 PiB of physical address space).
In addition to this, paging introduces the benefit of page-level protection. In this system, user processes can only see and modify data which is paged in on their own address space, providing hardware-based isolation. System pages are also protected from user processes. On the x86-64 architecture, page-level protection now completely supersedes Segmentation as the memory protection mechanism. On the IA-32 architecture, both paging and segmentation exist, but segmentation is now considered 'legacy'.
Once an Operating System has paging, it can also make use of other benefits and workarounds, such as linear framebuffer simulation for memory-mapped IO and paging out to disk, where disk storage space is used to free up physical RAM.
32-bit Paging (Protected Mode)
MMU
Paging is achieved through the use of the Memory Management Unit (MMU). On the x86, the MMU maps memory through a series of tables, two to be exact. They are the paging directory (PD), and the paging table (PT).
Both tables contain 1024 4-byte entries, making them 4 KiB each. In the page directory, each entry points to a page table. In the page table, each entry points to a 4 KiB physical page frame. Additionally, each entry has bits controlling access protection and caching features of the structure to which it points. The entire system consisting of a page directory and page tables represents a linear 4-GiB virtual memory map.
Translation of a virtual address into a physical address first involves dividing the virtual address into three parts: the most significant 10 bits (bits 22-31) specify the index of the page directory entry, the next 10 bits (bits 12-21) specify the index of the page table entry, and the least significant 12 bits (bits 0-11) specify the page offset. The then MMU walks through the paging structures, starting with the page directory, and uses the page directory entry to locate the page table. The page table entry is used to locate the base address of the physical page frame, and the page offset is added to the physical base address to produce the physical address. If translation fails for some reason (entry is marked as not present, for example), then the processor issues a page fault.
Page Directory
The topmost paging structure is the page directory. It is essentially an array of page directory entries that take the following form.
When PS=0, the page table address field represents the physical address of the page table that manages the four megabytes at that point. Please note that it is very important that this address be 4-KiB aligned. This is needed, due to the fact that the last 12 bits of the 32-bit value are overwritten by access bits and such. Similarly, when PS=1, the address must be 4-MiB aligned.
- PAT, or 'Page Attribute Table'. If PAT is supported, then PAT along with PCD and PWT shall indicate the memory caching type. Otherwise, it is reserved and must be set to 0.
- G, or 'Global' tells the processor not to invalidate the TLB entry corresponding to the page upon a MOV to CR3 instruction. Bit 7 (PGE) in CR4 must be set to enable global pages.
- PS, or 'Page Size' stores the page size for that specific entry. If the bit is set, then the PDE maps to a page that is 4 MiB in size. Otherwise, it maps to a 4 KiB page table. Please note that 4-MiB pages require PSE to be enabled.
- D, or 'Dirty' is used to determine whether a page has been written to.
- A, or 'Accessed' is used to discover whether a PDE or PTE was read during virtual address translation. If it has, then the bit is set, otherwise, it is not. Note that, this bit will not be cleared by the CPU, so that burden falls on the OS (if it needs this bit at all).
- PCD, is the 'Cache Disable' bit. If the bit is set, the page will not be cached. Otherwise, it will be.
- PWT, controls Write-Through' abilities of the page. If the bit is set, write-through caching is enabled. If not, then write-back is enabled instead.
- U/S, the 'User/Supervisor' bit, controls access to the page based on privilege level. If the bit is set, then the page may be accessed by all; if the bit is not set, however, only the supervisor can access it. For a page directory entry, the user bit controls access to all the pages referenced by the page directory entry. Therefore if you wish to make a page a user page, you must set the user bit in the relevant page directory entry as well as the page table entry.
- R/W, the 'Read/Write' permissions flag. If the bit is set, the page is read/write. Otherwise when it is not set, the page is read-only. The WP bit in CR0 determines if this is only applied to userland, always giving the kernel write access (the default) or both userland and the kernel (see Intel Manuals 3A 2-20). The R/W bit of the parent tables is also checked: if any are 0, the page is treated as read-only.
- P, or 'Present'. If the bit is set, the page is actually in physical memory at the moment. For example, when a page is swapped out, it is not in physical memory and therefore not 'Present'. If a page is called, but not present, a page fault will occur, and the OS should handle it. (See below.)
The remaining bits 9 through 11 (if PS=0, also bits 6 & 8) are not used by the processor, and are free for the OS to store some of its own accounting information. In addition, when P is not set, the processor ignores the rest of the entry and you can use all remaining 31 bits for extra information, like recording where the page has ended up in swap space. When changing the accessed or dirty bits from 1 to 0 while an entry is marked as present, it's recommended to invalidate the associated page. Otherwise, the processor may not set those bits upon subsequent read/writes due to TLB caching.
Setting the PS bit makes the page directory entry point directly to a 4-MiB page. There is no paging table involved in the address translation.
Note: With 4-MiB pages, whether or not bits 20 through 13 are reserved depends on PSE being enabled and how many PSE bits are supported by the processor (PSE, PSE-36, PSE-40). CPUID should be used to determine this. Thus, the physical address must also be 4-MiB-aligned. Physical addresses above 4 GiB can only be mapped using 4 MiB PDEs.
Page Table
In each page table, as it is, there are also 1024 entries. These are called page table entries, and are very similar to page directory entries.
The first item, is once again, a 4-KiB aligned physical address. Unlike previously, however, the address is not that of a page table, but instead a 4 KiB block of physical memory that is then mapped to that location in the page table and directory. Note that the PAT bit is bit 7 instead of bit 12 as in the 4 MiB PDE.
Example
Say the kernel is loaded to 0x100000. However, it needed to be remapped to 0xC0000000. After loading the kernel, it'll initiate paging, and set up the appropriate tables. (See Higher Half Kernel) After Identity Paging the first megabyte, it'll need to create a second table (ie. at entry #768 in the paging directory.) to map 0x100000 to 0xC0000000. The code may be like:
moveax,0x0
movebx,0x100000
.fill_table:
movecx,ebx
orecx,3
mov[table_768+eax*4],ecx
addebx,4096
inceax
cmpeax,1024
jne.fill_table
64-Bit Paging
Page map table entry structure (non-page-sized)
Paging in long mode is similar to that of 32-bit paging, except Physical Address Extension (PAE) is required. Registers CR2 and CR3 are extended to 64 bits. Instead of just having to utilize 3 levels of page maps: page directory pointer table, page directory, and page table, a fourth page-map table is used: the level-4 page map table (PML4). This allows a processor to map 48-bit virtual addresses to 52-bit physical addresses. If level-5 page maps are supported and enabled, then a fifth page-map table, the level-5 page map table (PML5), allows the processor to map 57-bit virtual addresses to 52-bit physical addresses. Both the PML4 and PML5 contain 512 64-bit entries of which each may point to a lower-level page map table. Do note that with each additional level of paging, virtual addressing becomes slower, especially in the case of TLB cache misses.
Virtual addresses in 64-bit mode must be canonical, that is, the upper bits of the address must either be all 0s or all 1s. For systems supporting 48-bit virtual address spaces, the upper 16 bits must be the same, and for systems supporting 57-bit virtual addresses, the upper 7 bits must match. Although 32-bit code running in long mode (compatibility mode) is still limited to 32-bit virtual addresses, they can still map to a 52-bit physical addresses.
Page Map Table Entries
Page map table entry structure (page-sized)
New bits have been added to page map table entries for long-mode paging:
- XD, or 'Execute Disable'. If the NXE bit (bit 11) is set in the EFER register, then instructions are not allowed to be executed at addresses within the page whenever XD is set. If EFER.NXE bit is 0, then the XD bit is reserved and should be set to 0.
- PK, or 'Protection Key'. The protection key is a 4-bit corresponding to each virtual address that is used to control user-mode and supervisor-mode memory accesses. If the PKE bit (bit 22) in CR4 is set, then the PKRU register is used for determining access rights for user-mode based on the protection key. If the PKS bit (bit 24) is set in CR4, then the PKRS register is used for determining access rights for supervisor-mode based on the protection key. A protection key allows the system to enable/disable access rights for multiple page entries across different address spaces at once.
M signifies the physical address width supported by a processor using PAE. Currently, up to 52 bits are supported, but the actual supported width may be less.
Bits marked as reserved must all be set to 0, otherwise, a page fault will occur with a reserved error code.
Support for 1 GiB pages, (NX) execute disable, (PKS/PKU) protection keys for supervisor-mode and user-mode pages, shadow stack pages, (M) physical address width, virtual address width, (PAT) page attribute table, (PCID) process context identifiers, and (LA57) 5-level paging can be determined with the CPUID instruction (EAX:0x01; EAX:0x07, ECX=0x00; EAX:0x80000001; EAX:0x80000008).
Process Context Identifiers
If process context ids (PCID) are supported, then bits 0-11 of CR3 specify the process context id. Otherwise, bit 3 is PWT for PML4, and bit 4 is PCD for PML4. PCIDs are used to control TLB caching across multiple address spaces. The INVPCID instruction uses PCIDs to allow more control over page invalidation.
Enabling
32-bit Paging
Enabling paging is actually very simple. All that is needed is to load CR3 with the address of the page directory and to set the paging (PG) and protection (PE) bits of CR0.
moveax,page_directory
movcr3,eax
moveax,cr0
oreax,0x80000001
movcr0,eax
Note: setting the paging flag when the protection flag is clear causes a general protection exception. Also, once paging has been enabled, any attempt to enable long mode by setting LME (bit 8) of the EFER register will trigger a GPF. The CR0.PG must first be cleared before EFER.LME can be set.
If you want to set pages as read-only for both userspace and supervisor, replace 0x80000001 above with 0x80010001, which also sets the WP bit.
To enable PSE (4 MiB pages) the following code is required.
moveax,cr4
oreax,0x00000010
movcr4,eax
64-bit Paging
Enabling paging in long mode requires a few more additional steps. Since it is not possible to enter long mode without paging with PAE active, the order in which one enables the bits are important. Firstly, paging must not be active (i.e. CR0.PG must be cleared.) Then, CR4.PAE (bit 5) and EFER.LME (bit 8 of MSR 0xC0000080) are set. If 57-bit virtual addresses are to be enabled, then CR4.LA57 (bit 12) is set. Finally, CR0.PG is set to enable paging.
; Skip these 3 lines if paging is already disabled
movebx,cr0
andebx,~(1<<31)
movcr0,ebx
; Enable PAE
movedx,cr4
oredx,(1<<5)
movcr4,edx
; Set LME (long mode enable)
movecx,0xC0000080
rdmsr
oreax,(1<<8)
wrmsr
; Replace 'pml4_table' with the appropriate physical address (and flags, if applicable)
moveax,pml4_table
movcr3,eax
; Enable paging (and protected mode, if it isn't already active)
orebx,(1<<31)|(1<<0)
movcr0,ebx
; Now reload the segment registers (CS, DS, SS, etc.) with the appropriate segment selectors...
movax,DATA_SEL
movds,ax
moves,ax
movfs,ax
movgs,ax
; Reload CS with a 64-bit code selector by performing a long jmp
jmpCODE_SEL:reloadCS
[BITS64]
reloadCS:
hlt; Done. Replace these lines with your own code
jmpreloadCS
Once paging has been enabled, you cannot switch from 4-level paging to 5-level paging (and vice-versa) directly. The same is true for switching to legacy 32-bit paging. You must first disable paging by clearing CR0.PG before making changes. Failure to do so will result in a general protection fault.
Physical Address Extension
All Intel processors since Pentium Pro (with exception of the Pentium M at 400 Mhz) and all AMD since the Athlon series implement the Physical Address Extension (PAE). This feature allows you to access up to 4 PiB (252) of RAM. You can check for this feature using CPUID. Once checked, you can activate this feature by setting bit 5 in CR4.
For legacy 32-bit PAE, the CR3 register points to a page directory pointer table (PDPT) of 4 64-bit entries, each one pointing to a page directory made of 4096 bytes (like in normal paging), divided into 512 64-bit entries, each pointing to a 4096-byte page table, divided into 512 64bit page entries. Keep in mind that virtual addresses are still limited to 4 GiB (232 bytes).
For 4-level and 5-level PAE, as used in compatibility mode and long mode, the CR3 register points to the top-level page map table: the PML4 table and PML5 table, respectively. Each of the page map tables: PML5 table, PML4 table, page directory pointer table, page directory, page table, contain 512 64-bit entries.
If paging is enabled then PAE must also be enabled before entering long mode. Attempting to enter long mode with CR0.PG set and CR4.PAE cleared will trigger a general protection fault.
Usage
Due to the simplicity in the design of paging, it has many uses.
Virtual Address Spaces
In a paged system, each process may execute in its own area of memory, without any chance of affecting any other process's memory, or the kernel's. Two or more processes may opt to share memory by mapping the same physical page(s) to addresses in their own address spaces. The virtual address of each mapping do not need to be the same. Consequently, a virtual address in one address space won't point to the same data in other address spaces, in general.
paging illustrated: two process with different views of the same physical memory
Virtual Memory
Because paging allows for the dynamic handling of unallocated page tables, an OS can swap entire pages, not in current use, to the hard drive where they can wait until they are called. In the mean time, however, the physical memory that they were using can be used elsewhere. In this way, the OS can manipulate the system so that programs actually seem to have more RAM than there actually is.
More...
Manipulation
The CR3 value, that is, the value containing the address of the page directory, is in physical form. Once, then, the computer is in paging mode, only recognizing those virtual addresses mapped into the paging tables, how can the tables be edited and dynamically changed?
Many prefer to map the last PDE to itself. The page directory will look like a page table to the system. To get the physical address of any virtual address in the range 0x00000000-0xFFFFF000 is then just a matter of:
void*get_physaddr(void*virtualaddr){
unsignedlongpdindex=(unsignedlong)virtualaddr>>22;
unsignedlongptindex=(unsignedlong)virtualaddr>>12&0x03FF;
unsignedlong*pd=(unsignedlong*)0xFFFFF000;
// Here you need to check whether the PD entry is present.
unsignedlong*pt=((unsignedlong*)0xFFC00000)+(0x400*pdindex);
// Here you need to check whether the PT entry is present.
return(void*)((pt[ptindex]&~0xFFF)+((unsignedlong)virtualaddr&0xFFF));
}
To map a virtual address to a physical address can be done as follows:
voidmap_page(void*physaddr,void*virtualaddr,unsignedintflags){
// Make sure that both addresses are page-aligned.
unsignedlongpdindex=(unsignedlong)virtualaddr>>22;
unsignedlongptindex=(unsignedlong)virtualaddr>>12&0x03FF;
unsignedlong*pd=(unsignedlong*)0xFFFFF000;
// Here you need to check whether the PD entry is present.
// When it is not present, you need to create a new empty PT and
// adjust the PDE accordingly.
unsignedlong*pt=((unsignedlong*)0xFFC00000)+(0x400*pdindex);
// Here you need to check whether the PT entry is present.
// When it is, then there is already a mapping present. What do you do now?
pt[ptindex]=((unsignedlong)physaddr)|(flags&0xFFF)|0x01;// Present
// Now you need to flush the entry in the TLB
// or you might not notice the change.
}
Unmapping an entry is essentially the same as above, but instead of assigning the pt[ptindex] a value, you set it to 0x00000000 (i.e. not present). When the entire page table is empty, you may want to remove it and mark the page directory entry 'not present'. Of course you don't need the 'flags' or 'physaddr' for unmapping.
Page Faults
A page fault exception is caused when a process is seeking to access an area of virtual memory that is not mapped to any physical memory, when a write is attempted on a read-only page, when accessing a PTE or PDE with the reserved bit or when permissions are inadequate. A page fault can either be pure, which occurs when the faulting process has permission to access the page, or invalid, which is due to a protection violation. Pure page faults aren't errors, but are resolved through the page fault handler by performing the appropriate map operation and/or page swap.
Handling
The CPU pushes an error code on the stack before firing a page fault exception. The error code must be analyzed by the exception handler to determine how to handle the exception. The following bits are the only ones used, all others are reserved.
Bit 0 (P) is the Present flag.
Bit 1 (R/W) is the Read/Write flag.
Bit 2 (U/S) is the User/Supervisor flag.
Bit 3 (RSVD) indicates whether a reserved bit was set in some page-structure entry
Bit 4 (I/D) is the Instruction/Data flag (1=instruction fetch, 0=data access)
Bit 5 (PK) indicates a protection-key violation
Bit 6 (SS) indicates a shadow-stack access fault
Bit 15 (SGX) indicates an SGX violaton
The combination of these flags specify the details of the page fault and indicate what action to take:
US RW P - Description
0 0 0 - Supervisory process tried to read a non-present page entry
0 0 1 - Supervisory process tried to read a page and caused a protection fault
0 1 0 - Supervisory process tried to write to a non-present page entry
0 1 1 - Supervisory process tried to write a page and caused a protection fault
1 0 0 - User process tried to read a non-present page entry
1 0 1 - User process tried to read a page and caused a protection fault
1 1 0 - User process tried to write to a non-present page entry
1 1 1 - User process tried to write a page and caused a protection fault
When the CPU fires a page-not-present exception the CR2 register is populated with the linear address that caused the exception. The upper 10 bits specify the page directory entry (PDE) and the middle 10 bits specify the page table entry (PTE). First check the PDE and see if it's present bit is set, if not setup a page table and point the PDE to the base address of the page table, set the present bit and iretd. If the PDE is present then the present bit of the PTE will be cleared. You'll need to map some physical memory to the page table, set the present bit and then iretd to continue processing.
INVLPG
INVLPG is an instruction available since the i486 that invalidates a single page in the TLB. Intel notes that this instruction may be implemented differently on future processors, but that this alternate behavior must be explicitly enabled. INVLPG modifies no flags.
NASM example:
Inline assembly for GCC (from Linux kernel source):
staticinlinevoid__native_flush_tlb_single(unsignedlongaddr){
asmvolatile("invlpg (%0)"::"r"(addr):"memory");
}
This only invalidates the page on the current processor. If you're using SMP, you'll need to send an IPI to the other processors so that they can also invalidate the page (this is called a TLB shootdown; it's very slow), making sure to avoid any nasty race conditions. You may only want to do this when removing a mapping, and just make your page fault handler invalidate a page if it you didn't invalidate a mapping addition on that processor by looking through the page directory, again avoiding race conditions.
When you modify an entry in the page directory, rather than just a page table, you'll need to invalidate each page in the table. Alternatively, you could reload CR3 which will invalidates the whole directory, but this may be slower. (TODO time this)
Paging Tricks
The processor always fires a page fault exception when the present bit is cleared in the PDE or PTE regardless of the address. This means the contents of the PTE or PDE can be used to indicate a location of the page saved on mass storage and to quickly load it. When a page gets swapped to disk, use these entries to identify the location in the paging file where they can be quickly loaded from then set the present bit to 0. Similarly, blocks from disk can be mapped to memory this way. When a process accesses the memory-mapped region, a page fault occurs. The fault handler reads the appropriate tables, loads the disk block(s) into a page, and maps it. The process can then read/write to memory as if it were accessing the device directly. The contents of the page would then be written back to disk to save the changes.
For memory efficiency, two or more processes can share pages as read-only. If one process were to write to its page, then a page fault would occur and the system could duplicate the page and then mark it as read-write. This is known as copy-on-write (COW). Copy-on-write allows the system to delay memory allocation until a process actually requires it, preventing unnecessary copying.
PAT
The Page Attribute Table determines caching attributes on a page granularity. This is similar to MTRRs, but those apply to physical addresses and are more limited.
The PAT is set via the IA32_PAT_MSR MSR (0x277). It has 8 entries, taking the low order 3 bits of each byte, in standard little endian order. So the high byte is PAT7, low byte is PAT0.
The following are the different caching types.
| Number
|
Name
|
Description
|
| 0
|
UC — Uncacheable
|
All accesses are uncacheable. Write combining is not allowed. Speculative accesses are not allowed.
|
| 1
|
WC — Write-Combining
|
All accesses are uncacheable. Write combining is allowed. Speculative reads are allowed.
|
| 4
|
WT — Writethrough
|
Reads allocate cache lines on a cache miss. Cache lines are not allocated on a write miss.
Write hits update the cache and main memory.
|
| 5
|
WP — Write-Protect
|
Reads allocate cache lines on a cache miss. All writes update main memory.
Cache lines are not allocated on a write miss. Write hits invalidate the cache
line and update main memory.
|
| 6
|
WB — Writeback
|
Reads allocate cache lines on a cache miss, and can allocate to either the shared,
exclusive, or modified state. Writes allocate to the modified state on a cache miss.
|
| 7
|
UC- — Uncached
|
Same as uncacheable, except that this can be overriden by Write-Combining MTRRs.
|
The PAT has a reset value of 0x0007040600070406. This ensures compatibility with non-PAT usage. This corresponds to the following:
UC
UC-
WT
WB
UC
UC-
WT
WB
The PAT is indexed by the three page table bits:
PAT
PCD
PWT
The PAT bit is reserved when there isn't a PAT, and the default value of the MSR ensures backwards comaptibility with the PCD and PWT bit.
You will need to modify the PAT if you want Write-Combining cache, which is very useful for framebuffers.
See Also
Articles
External Links