文章代碼分析基於linux-5.19.13,架構基於aarch64(ARM64)。 涉及頁表代碼分析部分: (1)假設頁表映射層級是4,即配置CONFIG_ARM64_PGTABLE_LEVELS=4; (2)虛擬地址寬度是48,即配置CONFIG_ARM64_VA_BITS=48; (3)物理地址 ...
文章代碼分析基於linux-5.19.13,架構基於aarch64(ARM64)。
涉及頁表代碼分析部分:
(1)假設頁表映射層級是4,即配置CONFIG_ARM64_PGTABLE_LEVELS=4;
(2)虛擬地址寬度是48,即配置CONFIG_ARM64_VA_BITS=48;
(3)物理地址寬度是48,即配置CONFIG_ARM64_PA_BITS=48;
1. 入口分析
1.1 鏈接腳本arch/arm64/kernel/vmlinux.lds.S
這裡只列舉與記憶體初始化相關的定義,其它的採用“......”省略。
......
OUTPUT_ARCH(aarch64) '指定一個特定的輸出機器架構為aarch64'
ENTRY(_text) '設置入口地址,實現在arch/arm64/kernel/head.S'
......
SECTIONS
{
......
'在5.8內核版本發現TEXT_OFFSET沒有任何作用,因此,被重新定義為0x0'
. = KIMAGE_VADDR; '內核映像虛擬的起始地址(在5.8內核之前這裡為KIMAGE_VADDR + TEXT_OFFSET)'
.head.text : { '早期彙編代碼的text段'
_text = .; '入口地址'
HEAD_TEXT 定義在include/asm-generic/vmlinux.lds.h'#define HEAD_TEXT KEEP(*(.head.text))'
}
.text : ALIGN(SEGMENT_ALIGN) { /* Real text segment */
_stext = .; /* Text and read-only data */ 'text段起始'
......
}
......
. = ALIGN(SEGMENT_ALIGN);
_etext = .; /* End of text section */ 'text段結束'
/* everything from this point to __init_begin will be marked RO NX */
RO_DATA(PAGE_SIZE) '只讀數據段'
......
idmap_pg_dir = .; '恆等映射一級頁表地址'
. += IDMAP_DIR_SIZE;
idmap_pg_end = .;
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
tramp_pg_dir = .; '熔斷(安全漏洞引入)'
. += PAGE_SIZE;
#endif
reserved_pg_dir = .;
. += PAGE_SIZE;
swapper_pg_dir = .;
. += PAGE_SIZE;
. = ALIGN(SEGMENT_ALIGN);
__init_begin = .; 'init段起始'
__inittext_begin = .;
......
. = ALIGN(SEGMENT_ALIGN);
__initdata_end = .;
__init_end = .; 'init段結束'
_data = .;
_sdata = .; '數據段起始'
RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
_edata = .; '數據段結束'
BSS_SECTION(SBSS_ALIGN, 0, 0) --- 'BSS段'
. = ALIGN(PAGE_SIZE);
init_pg_dir = .;
. += INIT_DIR_SIZE;
init_pg_end = .;
......
}
1.2 入口
#arch/arm64/kernel/head.S
/*
* Kernel startup entry point.
* ---------------------------
*
* The requirements are:
* MMU = off, D-cache = off, I-cache = on or off,
* x0 = physical address to the FDT blob.
*
* This code is mostly position independent so you call this at
* __pa(PAGE_OFFSET).
*
* Note that the callee-saved registers are used for storing variables
* that are useful before the MMU is enabled. The allocations are described
* in the entry routines.
*/
__HEAD --- 定義在include/linux/init.h中'#define __HEAD .section ".head.text","ax"',緊接著_text
/*
* DO NOT MODIFY. Image header expected by Linux boot-loaders.
*/
efi_signature_nop // special NOP to identity as PE/COFF executable
b primary_entry // branch to kernel start, magic '要重點關註分析的啟動彙編代碼'
......
1.3 啟動 AArch64 Linux的調用約定
內核從上電開始到執行到內核入口"_text",中間要經過bootloader或者bios的引導。引導程式會做一些初始化記憶體,設置device tree,解壓內核,跳轉到內核等等。在跳轉到內核之前,有一些標準的約定,參見Documentation/translations/zh_CN/arm64/booting.txt。這裡僅列出在跳轉入內核前,必須符合以下章節的狀態:
在跳轉入內核前,必須符合以下狀態:
- 停止所有 DMA 設備,這樣記憶體數據就不會因為虛假網路包或磁碟數據而 被破壞。這可能可以節省你許多的調試時間。
- 主 CPU 通用寄存器設置 x0 = 系統 RAM 中設備樹數據塊(dtb)的物理地址。
x1 = 0 (保留,將來可能使用)
x2 = 0 (保留,將來可能使用)
x3 = 0 (保留,將來可能使用)
- CPU 模式 所有形式的中斷必須在 PSTATE.DAIF 中被屏蔽(Debug、SError、IRQ 和 FIQ)。
CPU 必須處於 EL2(推薦,可訪問虛擬化擴展)或非安全 EL1 模式下。'bootloader來切'
- 高速緩存、MMU MMU 必須關閉。 'mmu關閉,指令高速緩存一般可以打開,數據高速緩存必須關閉'
指令緩存開啟或關閉皆可。 已載入的內核映像的相應記憶體區必須被清理,以達到緩存一致性點(PoC)。 當存在系統緩存或其他使能緩存的一致性主控器時,通常需使用虛擬地址 維護其緩存,而非 set/way 操作。 遵從通過虛擬地址操作維護構架緩存的系統緩存必須被配置,並可以被使能。 而不通過虛擬地址操作維護構架緩存的系統緩存(不推薦),必須被配置且 禁用。
*譯者註:對於 PoC 以及緩存相關內容,請參考 ARMv8 構架參考手冊 ARM DDI 0487A
- 架構計時器 CNTFRQ 必須設定為計時器的頻率,且 CNTVOFF 必須設定為對所有 CPU 都一致的值。如果在 EL1 模式下進入內核,則 CNTHCTL_EL2 中的 EL1PCTEN (bit 0) 必須置位。
- 一致性 通過內核啟動的所有 CPU 在內核入口地址上必須處於相同的一致性域中。 這可能要根據具體實現來定義初始化過程,以使能每個CPU上對維護操作的 接收。
- 系統寄存器 在進入內核映像的異常級中,所有構架中可寫的系統寄存器必須通過軟體 在一個更高的異常級別下初始化,以防止在 未知 狀態下運行。
對於擁有 GICv3 中斷控制器並以 v3 模式運行的系統:
- 如果 EL3 存在: ICC_SRE_EL3.Enable (位 3) 必須初始化為 0b1。
ICC_SRE_EL3.SRE (位 0) 必須初始化為 0b1。
- 若內核運行在 EL1: ICC_SRE_EL2.Enable (位 3) 必須初始化為 0b1。
ICC_SRE_EL2.SRE (位 0) 必須初始化為 0b1。
- 設備樹(DT)或 ACPI 表必須描述一個 GICv3 中斷控制器。
對於擁有 GICv3 中斷控制器並以相容(v2)模式運行的系統:
- 如果 EL3 存在: ICC_SRE_EL3.SRE (位 0) 必須初始化為 0b0。
- 若內核運行在 EL1: ICC_SRE_EL2.SRE (位 0) 必須初始化為 0b0。
- 設備樹(DT)或 ACPI 表必須描述一個 GICv2 中斷控制器。
這裡有個很關鍵的問題:為什麼跳轉到內核時指令高速緩存可以打開,數據高速緩存必須關閉?
(1)CPU啟動取數據的時候首先去訪問數據高速緩存,這個數據高速緩存有可能緩存了bootloader的數據,這個數據對於內核可能是錯誤的。因此數據高速緩存必須關閉。
(2)bootloader和內核的指令無衝突。因為bootloader指令運行完成後不會再次運行,直接運行內核的指令。因此指令高速緩存可以不關閉。
2. 啟動彙編介面primary_entry分析
/*
* The following callee saved general purpose registers are used on the
* primary lowlevel boot path:
*
* Register Scope Purpose
* x21 primary_entry() .. start_kernel() FDT pointer passed at boot in x0
* x23 primary_entry() .. start_kernel() physical misalignment/KASLR offset
* x28 __create_page_tables() callee preserved temp register
* x19/x20 __primary_switch() callee preserved temp registers
* x24 __primary_switch() .. relocate_kernel() current RELR displacement
*/
SYM_CODE_START(primary_entry)
bl preserve_boot_args
bl init_kernel_el // w0=cpu_boot_mode
adrp x23, __PHYS_OFFSET --- '__PHYS_OFFSET載入到x23寄存器'
and x23, x23, MIN_KIMG_ALIGN - 1 // KASLR offset, defaults to 0
bl set_cpu_boot_mode_flag
bl __create_page_tables
/*
* The following calls CPU setup code, see arch/arm64/mm/proc.S for
* details.
* On return, the CPU will be ready for the MMU to be turned on and
* the TCR will have been set.
*/
bl __cpu_setup // initialise processor
b __primary_switch
SYM_CODE_END(primary_entry)
2.1 preserve_boot_args
功能:把bootloader傳進來的x0 .. x3保存到boot_args數組中。
/*
* Preserve the arguments passed by the bootloader in x0 .. x3
*/
SYM_CODE_START_LOCAL(preserve_boot_args)
mov x21, x0 // x21=FDT(x0寄存器保存devicetree的地址),devicetree保存到x21寄存器
adr_l x0, boot_args // record the contents of. boot_args數組地址保存到x0
stp x21, x1, [x0] // x0 .. x3 at kernel entry
stp x2, x3, [x0, #16] '參數x0 .. x3保存到boot_args數組中'
dmb sy // needed before dc ivac with.記憶體屏障指令(+sy表示全系統高速緩存範圍內做一次記憶體屏障)
// MMU off
add x1, x0, #0x20 // 4 x 8 bytes
b dcache_inval_poc // tail call 清除boot_args數組對應的高速緩存
SYM_CODE_END(preserve_boot_args)
2.2 init_kernel_el
判斷啟動的模式是EL2還是非安全模式的EL1,併進行相關級別的系統配置(ARMv8中EL2是hypervisor模式,EL1是標準的內核模式),然後使用w0返回啟動模式(BOOT_CPU_MODE_EL1或BOOT_CPU_MODE_EL2)。通常來講系統啟動時運行在EL3,uboot會把處理器置於EL2,內核運行到init_kernel_el會設為EL1。
/*
* Starting from EL2 or EL1, configure the CPU to execute at the highest
* reachable EL supported by the kernel in a chosen default state. If dropping
* from EL2 to EL1, configure EL2 before configuring EL1.
*
* Since we cannot always rely on ERET synchronizing writes to sysregs (e.g. if
* SCTLR_ELx.EOS is clear), we place an ISB prior to ERET.
*
* Returns either BOOT_CPU_MODE_EL1 or BOOT_CPU_MODE_EL2 in w0 if
* booted in EL1 or EL2 respectively.
*/
SYM_FUNC_START(init_kernel_el)
mrs x0, CurrentEL '獲取當前PSTATE異常等級'
cmp x0, #CurrentEL_EL2
b.eq init_el2 '如果PSTATE異常等級為EL2,則跳轉到init_el2'
SYM_INNER_LABEL(init_el1, SYM_L_LOCAL)
mov_q x0, INIT_SCTLR_EL1_MMU_OFF
msr sctlr_el1, x0
isb '因為前面修改了系統控制器'
mov_q x0, INIT_PSTATE_EL1
msr spsr_el1, x0
msr elr_el1, lr
mov w0, #BOOT_CPU_MODE_EL1
eret 'Return from exception'
SYM_INNER_LABEL(init_el2, SYM_L_LOCAL) --- 'EL2切向EL1'
......
msr elr_el1, x0
eret
1:
......
mov w0, #BOOT_CPU_MODE_EL2
eret
__cpu_stick_to_vhe:
mov x0, #HVC_VHE_RESTART
hvc #0
mov x0, #BOOT_CPU_MODE_EL2
ret
SYM_FUNC_END(init_kernel_el)
2.3 set_cpu_boot_mode_flag
根據w0中傳遞的cpu啟動模式設置__boot_cpu_mode標誌。
/*
* Sets the __boot_cpu_mode flag depending on the CPU boot mode passed
* in w0. See arch/arm64/include/asm/virt.h for more info.
*/
SYM_FUNC_START_LOCAL(set_cpu_boot_mode_flag)
adr_l x1, __boot_cpu_mode //x1記錄__boot_cpu_mode[]的地址
cmp w0, #BOOT_CPU_MODE_EL2 //w0記錄啟動時的異常等級
b.ne 1f //如果不是從EL2啟動,則跳轉到1處
add x1, x1, #4 // 如果是從EL2啟動,地址指向__boot_cpu_mode[1]
1: str w0, [x1] // Save CPU boot mode 保存啟動模式到x1指向的地址,如果是從EL1啟動,地址指向__boot_cpu_mode[0]
dmb sy // 保證str指令執行完成
dc ivac, x1 // Invalidate potentially stale cache line 使高速緩存失效
ret
SYM_FUNC_END(set_cpu_boot_mode_flag)
2.4 __create_page_tables
/*
* Setup the initial page tables. We only setup the barest amount which is
* required to get the kernel running. The following sections are required:
* - identity mapping to enable the MMU (low address, TTBR0) (1)恆等映射
* - first few MB of the kernel linear mapping to jump to once the MMU has
* been enabled (2)內核image映射
*/
SYM_FUNC_START_LOCAL(__create_page_tables)
...
SYM_FUNC_END(__create_page_tables)
2.4.1 保存LR值
mov x28, lr //#把LR的值存放到X28
2.4.2 使初始化頁表無效、並清空初始化頁表
/*
* Invalidate the init page tables to avoid potential dirty cache lines
* being evicted. Other page tables are allocated in rodata as part of
* the kernel image, and thus are clean to the PoC per the boot
* protocol.
*/
adrp x0, init_pg_dir //把init_pg_dir的物理地址賦值給x0
adrp x1, init_pg_end //把init_pg_end的物理地址賦值給x1
bl dcache_inval_poc //把init_pg_dir頁表對應的高速緩存清掉(入參是x0和x1)
/*
* Clear the init page tables.//把這個頁表內容清零
*/
adrp x0, init_pg_dir
adrp x1, init_pg_end
sub x1, x1, x0
1: stp xzr, xzr, [x0], #16 //xzr是零寄存器
stp xzr, xzr, [x0], #16
stp xzr, xzr, [x0], #16
stp xzr, xzr, [x0], #16
subs x1, x1, #64
b.ne 1b
(1)init_pg_dir和init_pg_end定義在arch/arm64/kernel/vmlinux.lds.S鏈接文件中:
#arch/arm64/kernel/vmlinux.lds.S
. = ALIGN(PAGE_SIZE);
init_pg_dir = .;
. += INIT_DIR_SIZE;
init_pg_end = .;
(2)adrp指令
作用:以頁為單位的大範圍的地址讀取指令,這裡的P就是page的意思。
原理:符號擴展一個21位的offset(immhi+immlo), 向左移動12位,PC的值的低12位清零,然後把這兩者相加,結果寫入到Xd寄存器,用來得到一塊含有lable的4KB對齊記憶體區域的base地址(也就是說lable所在的地址,一定落在這個4KB的記憶體區域里,指令助記符里Page也就是這個意思), 可用來定址 +/- 4GB的範圍(2^33次冪)。
通俗來講,ADRP指令就是先進行PC+imm(偏移值)。然後找到lable所在的一個4KB的頁,然後取得label的基址,再進行偏移去定址。
ADRP {cond} Rd label
其中:Rd載入的目標寄存器。lable為地址表達式。
(3)使用adrp指令獲取init_pg_dir和init_pg_end的地址,頁大小為4KB,由於內核啟動的時候MMU還未打開(PC為物理地址),因此此時獲取的地址也為物理地址。
(4)adrp通過當前PC地址的偏移地址計算目標地址,和實際的物理地址無關,因此屬於位置無關碼。
2.4.3 保存SWAPPER_MM_MMUFLAGS到x7寄存器
mov_q x7, SWAPPER_MM_MMUFLAGS
SWAPPER_MM_MMUFLAGS巨集描述了段映射的屬性,它實現在arch/arm64/include/asm/kernel-pgtable.h頭文件中:
/*
* Initial memory map attributes.
*/
#define SWAPPER_PTE_FLAGS (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED | PTE_UXN)
#define SWAPPER_PMD_FLAGS (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S | PMD_SECT_UXN)
#if ARM64_KERNEL_USES_PMD_MAPS
#define SWAPPER_MM_MMUFLAGS (PMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS) //段映射,這裡要使用的
#else
#define SWAPPER_MM_MMUFLAGS (PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS) //頁映射
#endif
2.4.4 創建恆等映射
/*
* Create the identity mapping.
*/
adrp x0, idmap_pg_dir ---(1)
adrp x3, __idmap_text_start // __pa(__idmap_text_start) ---(2)
#ifdef CONFIG_ARM64_VA_BITS_52 ---(3)
mrs_s x6, SYS_ID_AA64MMFR2_EL1
and x6, x6, #(0xf << ID_AA64MMFR2_LVA_SHIFT)
mov x5, #52
cbnz x6, 1f
#endif
mov x5, #VA_BITS_MIN ---(4)
1:
adr_l x6, vabits_actual ---(5)
str x5, [x6]
dmb sy //記憶體屏障
dc ivac, x6 // Invalidate potentially stale cache line 把vabits_actual變數對應的緩存給clean掉
/*
* VA_BITS may be too small to allow for an ID mapping to be created
* that covers system RAM if that is located sufficiently high in the
* physical address space. So for the ID map, use an extended virtual
* range in that case, and configure an additional translation level
* if needed.
*
* Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
* entire ID map region can be mapped. As T0SZ == (64 - #bits used),
* this number conveniently equals the number of leading zeroes in
* the physical address of __idmap_text_end.
*/
adrp x5, __idmap_text_end ---(6)
clz x5, x5 //前導0計數:第一個1前0的個數 ---(6)
cmp x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough? ---(6)
b.ge 1f // .. then skip VA range extension
adr_l x6, idmap_t0sz
str x5, [x6]
dmb sy
dc ivac, x6 // Invalidate potentially stale cache line
#if (VA_BITS < 48)
#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)
#define EXTRA_PTRS (1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
/*
* If VA_BITS < 48, we have to configure an additional table level.
* First, we have to verify our assumption that the current value of
* VA_BITS was chosen such that all translation levels are fully
* utilised, and that lowering T0SZ will always result in an additional
* translation level to be configured.
*/
#if VA_BITS != EXTRA_SHIFT
#error "Mismatch between VA_BITS and page size/number of translation levels"
#endif
mov x4, EXTRA_PTRS
create_table_entry x0, x3, EXTRA_SHIFT, x4, x5, x6
#else
/*
* If VA_BITS == 48, we don't have to configure an additional
* translation level, but the top-level table has more entries.
*/
mov x4, #1 << (PHYS_MASK_SHIFT - PGDIR_SHIFT)
str_l x4, idmap_ptrs_per_pgd, x5
#endif
1:
ldr_l x4, idmap_ptrs_per_pgd ---(7)
adr_l x6, __idmap_text_end // __pa(__idmap_text_end) ---(8)
map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14 ---(9)
- 將載入idmap_pg_dir的物理地址到x0寄存器,idmap_pg_dir是恆等映射的一級頁表起始地址,其定義在vmlinux.lds.S鏈接文件中
idmap_pg_dir = .;
. += IDMAP_DIR_SIZE;
idmap_pg_end = .;
這裡分配給idmap_pg_dir的頁面大小為IDMAP_DIR_SIZE,而IDMAP_DIR_SIZE實現在arch/arm64/include/asm/kernel-pgtable.h頭文件中,通常是3個連續的大小為4K頁面。計算參考如下:
<1>#arch/arm64/include/asm/pgtable-hwdef.h
/*
* Number of page-table levels required to address 'va_bits' wide
* address, without section mapping. We resolve the top (va_bits - PAGE_SHIFT)
* bits with (PAGE_SHIFT - 3) bits at each page table level. Hence:
*
* levels = DIV_ROUND_UP((va_bits - PAGE_SHIFT), (PAGE_SHIFT - 3))
*
* where DIV_ROUND_UP(n, d) => (((n) + (d) - 1) / (d))
*
* We cannot include linux/kernel.h which defines DIV_ROUND_UP here
* due to build issues. So we open code DIV_ROUND_UP here:
*
* ((((va_bits) - PAGE_SHIFT) + (PAGE_SHIFT - 3) - 1) / (PAGE_SHIFT - 3))
*
* which gets simplified as :
*/
#define ARM64_HW_PGTABLE_LEVELS(va_bits) (((va_bits) - 4) / (PAGE_SHIFT - 3))
...
/*
* Highest possible physical address supported.
*/
#define PHYS_MASK_SHIFT (CONFIG_ARM64_PA_BITS) //48
<2>#arch/arm64/include/asm/kernel-pgtable.h
#if ARM64_KERNEL_USES_PMD_MAPS //段映射一般走這個
#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - 1)
#define IDMAP_PGTABLE_LEVELS (ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT) - 1) // {((48-12)+(12-3)-1) / (12-3) = (36+9-1)/9 = 44/9 = 4}-1 =3
#else
#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
#define IDMAP_PGTABLE_LEVELS (ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT)) //3
#endif
...
#define IDMAP_DIR_SIZE (IDMAP_PGTABLE_LEVELS * PAGE_SIZE)
這裡的CONFIG_ARM64_PA_BITS配置的是48. 這裡的含義是,計算採用section mapping的話,需要幾個頁來存放table。ARM64_HW_PGTABLE_LEVELS,很關鍵,根據配置的物理地址線的寬度計算需要的頁面數,註意註釋處的計算方法:
((((va_bits) - PAGE_SHIFT) + (PAGE_SHIFT - 3) - 1) / (PAGE_SHIFT - 3))
結合vmlinux.lds,上面的公式就是: ((48-12)+(12-3)-1) / (12-3) = (36+9-1)/9 = 44/9 = 4,最終IDMAP_DIR_SIZE為3個頁面,即一次性在連續的地址上分配三個頁表---PGD/PUD/PMD頁表,每一級頁表占據一個頁面。
這裡需要註意一下我們在這裡只建立了一個2MB大小的段映射,也就是說對於恆等映射,2M的段映射已經夠用。
- 將__idmap_text_start的物理地址放到x3寄存器中, __idmap_text_start標號定義在arch/arm64/kernel/vmlinux.lds.S中,是我們要進行恆等映射的起始地址(物理 == 虛擬地址):
#define IDMAP_TEXT \
. = ALIGN(SZ_4K); \
__idmap_text_start = .; \
*(.idmap.text) \
__idmap_text_end = .;
.text : ALIGN(SEGMENT_ALIGN) { /* Real text segment */
_stext = .; /* Text and read-only data */
IRQENTRY_TEXT
SOFTIRQENTRY_TEXT
ENTRY_TEXT
TEXT_TEXT
SCHED_TEXT
CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
HYPERVISOR_TEXT
IDMAP_TEXT '.idmap.text段'
*(.gnu.warning)
. = ALIGN(16);
*(.got) /* Global offset table */
}
除了在開機啟動時打開MMU外,內核里還有很對場景需要恆等映射,我們通過.section把這些函數都放在.idmap.text段中:
# arch/arm64/kernel/head.S
/*
* end early head section, begin head code that is also used for
* hotplug and needs to have the same protections as the text region
*/
.section ".idmap.text","awx"
這些處於.idmap.text段中函數也可以通過System.map看到:
ffffffc00952f000 T __idmap_text_start //
ffffffc00952f000 T init_kernel_el
ffffffc00952f010 t init_el1
ffffffc00952f038 t init_el2
ffffffc00952f270 t __cpu_stick_to_vhe
ffffffc00952f280 t set_cpu_boot_mode_flag
ffffffc00952f2a8 T secondary_holding_pen
ffffffc00952f2d0 t pen
ffffffc00952f2e4 T secondary_entry
ffffffc00952f2f4 t secondary_startup
ffffffc00952f314 t __secondary_switched
ffffffc00952f3b8 t __secondary_too_slow
ffffffc00952f3c8 T __enable_mmu //重點關註
ffffffc00952f42c T __cpu_secondary_check52bitva
ffffffc00952f434 t __no_granule_support
ffffffc00952f45c t __relocate_kernel
ffffffc00952f4a8 t __primary_switch //重點關註
ffffffc00952f530 t enter_vhe
ffffffc00952f568 T cpu_resume
ffffffc00952f590 T cpu_soft_restart
ffffffc00952f5c4 T cpu_do_resume
ffffffc00952f66c T idmap_cpu_replace_ttbr1
ffffffc00952f6a4 t __idmap_kpti_flag
ffffffc00952f6a8 T idmap_kpti_install_ng_mappings
ffffffc00952f6e8 t do_pgd
ffffffc00952f700 t next_pgd
ffffffc00952f710 t skip_pgd
ffffffc00952f750 t walk_puds
ffffffc00952f758 t next_pud
ffffffc00952f75c t walk_pmds
ffffffc00952f764 t do_pmd
ffffffc00952f77c t next_pmd
ffffffc00952f78c t skip_pmd
ffffffc00952f79c t walk_ptes
ffffffc00952f7a4 t do_pte
ffffffc00952f7c8 t skip_pte
ffffffc00952f7d8 t __idmap_kpti_secondary
ffffffc00952f820 T __cpu_setup
ffffffc00952f974 T __idmap_text_end //
- 假設虛擬地址位寬為48(我們定義的是CONFIG_ARM64_VA_BITS_48);
- 虛擬地址位寬(48)保存到X5寄存器;
- 把立即數VA_BITS_MIN(48)保存到全局變數vabits_actual中;
- 將__idmap_text_end的物理地址放到x5寄存器中,計算__idmap_text_end地址第一個1前0的個數。並判斷__idmap_text_end地址是否超過VA_BITS_MIN所能表達的地址範圍。其中TCR_T0SZ(VA_BITS_MIN) 表示TTBR0頁表所能映射的大小,因為稍後我們創建的頁表會填充到TTBR0寄存器裡面;
- 把PGD頁表包含的頁表項保存到x4寄存器中(2^9);
- 把__idmap_text_end的物理地址放到x6寄存器中;
- 調用map_memory巨集來創建這段恆等映射的頁表;
map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14
(1) x0 --- idmap_pg_dir
(2) x1 --- 無效值,會在map_memory中根據tbl的值重新計算
(3) x3 --- __idmap_text_start
(4) x6 --- __idmap_text_end
(5) x7 --- SWAPPER_MM_MMUFLAGS
(6) x3 --- __idmap_text_start
(7) x4 --- idmap_ptrs_per_pgd
2.4.5 map_memory巨集的解析
map_memory巨集一共12個參數,參數的解釋在下麵代碼的批註中解釋的非常清楚。重要參數說明如下:
- tbl : 頁表起始地址(pgd)
- rtbl : 下級頁表起始地址(typically tbl + PAGE_SIZE)
- vstart: 要映射虛擬地址的起始地址
- vend : 要映射虛擬地址的結束地址
- flags : 最後一級頁表的屬性
- phys : 要映射物理地址的起始地址
- flags : pgd entries個數
/*
* Map memory for specified virtual address range. Each level of page table needed supports
* multiple entries. If a level requires n entries the next page table level is assumed to be
* formed from n pages.
*
* tbl: location of page table 頁表起始地址(pgd)
* rtbl: address to be used for first level page table entry (typically tbl + PAGE_SIZE)下級頁表起始地址
* vstart: virtual address of start of range 要映射虛擬地址的起始地址
* vend: virtual address of end of range - we map [vstart, vend - 1]要映射虛擬地址的結束地址
* flags: flags to use to map last level entries 最後一級頁表的屬性
* phys: physical address corresponding to vstart - physical memory is contiguous 要映射物理地址的起始地址
* pgds: the number of pgd entries :pgd entries個數
*
* Temporaries: istart, iend, tmp, count, sv - these need to be different registers
* Preserves: vstart, flags
* Corrupts: tbl, rtbl, vend, istart, iend, tmp, count, sv
*/
.macro map_memory, tbl, rtbl, vstart, vend, flags, phys, pgds, istart, iend, tmp, count, sv
sub \vend, \vend, #1
add \rtbl, \tbl, #PAGE_SIZE ---(1)
mov \sv, \rtbl
mov \count, #0
compute_indices \vstart, \vend, #PGDIR_SHIFT, \pgds, \istart, \iend, \count ---(2)
populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp ---(3)
mov \tbl, \sv
mov \sv, \rtbl
#if SWAPPER_PGTABLE_LEVELS > 3 //我們這裡不成立
compute_indices \vstart, \vend, #PUD_SHIFT, #PTRS_PER_PUD, \istart, \iend, \count
populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
mov \tbl, \sv
mov \sv, \rtbl
#endif
#if SWAPPER_PGTABLE_LEVELS > 2 ---(4)
compute_indices \vstart, \vend, #SWAPPER_TABLE_SHIFT, #PTRS_PER_PMD, \istart, \iend, \count
populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
mov \tbl, \sv
#endif
compute_indices \vstart, \vend, #SWAPPER_BLOCK_SHIFT, #PTRS_PER_PTE, \istart, \iend, \count ---(5)
bic \count, \phys, #SWAPPER_BLOCK_SIZE - 1
populate_entries \tbl, \count, \istart, \iend, \flags, #SWAPPER_BLOCK_SIZE, \tmp
.endm
-
計算PUD基地址,rtbl是下級頁表地址:PUD = PGD+PAGE_SIZE
-
compute_indices巨集的功能:根據虛擬地址計算各級頁表的索引值index
/*
* Compute indices of table entries from virtual address range. If multiple entries
* were needed in the previous page table level then the next page table level is assumed
* to be composed of multiple pages. (This effectively scales the end index).
*
* vstart: virtual address of start of range
* vend: virtual address of end of range - we map [vstart, vend]
* shift: shift used to transform virtual address into index
* ptrs: number of entries in page table
* istart: index in table corresponding to vstart
* iend: index in table corresponding to vend
* count: On entry: how many extra entries were required in previous level, scales
* our end index.
* On exit: returns how many extra entries required for next page table level
*
* Preserves: vstart, vend, shift, ptrs
* Returns: istart, iend, count
*/
.macro compute_indices, vstart, vend, shift, ptrs, istart, iend, count
lsr \iend, \vend, \shift
mov \istart, \ptrs
sub \istart, \istart, #1
and \iend, \iend, \istart // iend = (vend >> shift) & (ptrs - 1)
mov \istart, \ptrs
mul \istart, \istart, \count
add \iend, \iend, \istart // iend += count * ptrs
// our entries span multiple tables
lsr \istart, \vstart, \shift
mov \count, \ptrs
sub \count, \count, #1
and \istart, \istart, \count
sub \count, \iend, \istart
.endm
- populate_entries巨集的功能:填充索引值index對應的頁表項
/*
* Macro to populate page table entries, these entries can be pointers to the next level
* or last level entries pointing to physical memory.
*
* tbl: page table address
* rtbl: pointer to page table or physical memory
* index: start index to write
* eindex: end index to write - [index, eindex] written to
* flags: flags for pagetable entry to or in
* inc: increment to rtbl between each entry
* tmp1: temporary variable
*
* Preserves: tbl, eindex, flags, inc
* Corrupts: index, tmp1
* Returns: rtbl
*/
.macro populate_entries, tbl, rtbl, index, eindex, flags, inc, tmp1
.Lpe\@: phys_to_pte \tmp1, \rtbl
orr \tmp1, \tmp1, \flags // tmp1 = table entry
str \tmp1, [\tbl, \index, lsl #3]
add \rtbl, \rtbl, \inc // rtbl = pa next level
add \index, \index, #1
cmp \index, \eindex
b.ls .Lpe\@
.endm
- 設置PUD頁表項 ;
- 設置PMD頁表項(因為我們用的是段映射,因此這裡是最後一級,沒有PTE);
2.4.6 創建內核image的映射
/*
* Map the kernel image (starting with PHYS_OFFSET).
*/
adrp x0, init_pg_dir ---(1)
mov_q x5, KIMAGE_VADDR // compile time __va(_text) ---(2)
add x5, x5, x23 // add KASLR displacement //x23 = __PHYS_OFFSET
mov x4, PTRS_PER_PGD
adrp x6, _end // runtime __pa(_end) ---(3)內核映像結束物理地址
adrp x3, _text // runtime __pa(_text) ---(4)內核映像起始物理地址
sub x6, x6, x3 // _end - _text //內核映像的大小
add x6, x6, x5 // runtime __va(_end) ---(5)內核映像結束地址
map_memory x0, x1, x5, x6, x7, x3, x4, x10, x11, x12, x13, x14 ---(6)
- 這裡是載入init_pg_dir的物理地址到x0寄存器,init_pg_dir是kernel image的映射使用的頁表起始地址(與恆等映射不同),其定義在vmlinux.lds.S鏈接文件中。
BSS_SECTION(SBSS_ALIGN, 0, 0)
. = ALIGN(PAGE_SIZE);
init_pg_dir = .;
. += INIT_DIR_SIZE;
init_pg_end = .;
- 載入內核映像虛擬的起始地址KIMAGE_VADDR到x5寄存器,註意這裡使用的是mov_q指令。KIMAGE_VADDR定義在vmlinux.lds.S鏈接文件中。
SECTIONS
{
......
'在5.8內核版本發現TEXT_OFFSET沒有任何作用,因此,被重新定義為0x0'
. = KIMAGE_VADDR; '內核映像虛擬的起始地址(在5.8內核之前這裡為KIMAGE_VADDR + TEXT_OFFSET)'
.head.text : { '早期彙編代碼的text段'
_text = .; '入口地址'
HEAD_TEXT 定義在include/asm-generic/vmlinux.lds.h'#define HEAD_TEXT KEEP(*(.head.text))'
}
- 這裡是載入內核映像結束物理地址到x3寄存器;
- 這裡是載入內核映像起始物理地址到x6寄存器;
- 換算得到內核映像起始虛擬地址,並載入到x6寄存器;
- 調用map_memory巨集來創建這段內核映像映射的頁表;