前面構建記憶體管理框架,已經將記憶體管理node節點設置完畢,接下來將是管理區和頁面管理的構建。此處代碼實現主要在於setup_arch()下的一處鉤子:x86_init.paging.pagetable_init()。據前面分析可知x86_init結構體內該鉤子實際上掛接的是native_pageta ...
前面構建記憶體管理框架,已經將記憶體管理node節點設置完畢,接下來將是管理區和頁面管理的構建。此處代碼實現主要在於setup_arch()下的一處鉤子:x86_init.paging.pagetable_init()。據前面分析可知x86_init結構體內該鉤子實際上掛接的是native_pagetable_init()函數。
native_pagetable_init():
【file:/arch/x86/mm/init_32.c】
void __init native_pagetable_init(void)
{
unsigned long pfn, va;
pgd_t *pgd, *base = swapper_pg_dir;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
/*
* Remove any mappings which extend past the end of physical
* memory from the boot time page table.
* In virtual address space, we should have at least two pages
* from VMALLOC_END to pkmap or fixmap according to VMALLOC_END
* definition. And max_low_pfn is set to VMALLOC_END physical
* address. If initial memory mapping is doing right job, we
* should have pte used near max_low_pfn or one pmd is not present.
*/
for (pfn = max_low_pfn; pfn < 1<<(32-PAGE_SHIFT); pfn++) {
va = PAGE_OFFSET + (pfn<<PAGE_SHIFT);
pgd = base + pgd_index(va);
if (!pgd_present(*pgd))
break;
pud = pud_offset(pgd, va);
pmd = pmd_offset(pud, va);
if (!pmd_present(*pmd))
break;
/* should not be large page here */
if (pmd_large(*pmd)) {
pr_warn("try to clear pte for ram above max_low_pfn: pfn: %lx pmd: %p pmd phys: %lx, but pmd is big page and is not using pte !\n",
pfn, pmd, __pa(pmd));
BUG_ON(1);
}
pte = pte_offset_kernel(pmd, va);
if (!pte_present(*pte))
break;
printk(KERN_DEBUG "clearing pte for ram above max_low_pfn: pfn: %lx pmd: %p pmd phys: %lx pte: %p pte phys: %lx\n",
pfn, pmd, __pa(pmd), pte, __pa(pte));
pte_clear(NULL, va, pte);
}
paravirt_alloc_pmd(&init_mm, __pa(base) >> PAGE_SHIFT);
paging_init();
}
該函數的for迴圈主要是用於檢測max_low_pfn直接映射空間後面的物理記憶體是否存在系統啟動引導時創建的頁表,如果存在,則使用pte_clear()將其清除。
接下來的paravirt_alloc_pmd()主要是用於準虛擬化,主要是使用鉤子函數的方式替換x86環境中多種多樣的指令實現。
再往下的paging_init():
【file:/arch/x86/mm/init_32.c】
/*
* paging_init() sets up the page tables - note that the first 8MB are
* already mapped by head.S.
*
* This routines also unmaps the page at virtual kernel address 0, so
* that we can trap those pesky NULL-reference errors in the kernel.
*/
void __init paging_init(void)
{
pagetable_init();
__flush_tlb_all();
kmap_init();
/*
* NOTE: at this point the bootmem allocator is fully available.
*/
olpc_dt_build_devicetree();
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
zone_sizes_init();
}
paging_init()主要都是函數調用,現在逐一分析各個函數功能,先看pagetable_init():
【file:/arch/x86/mm/init_32.c】
static void __init pagetable_init(void)
{
pgd_t *pgd_base = swapper_pg_dir;
permanent_kmaps_init(pgd_base);
}
其中kmap_get_fixmap_pte():
【file:/arch/x86/mm/init_32.c】
static inline pte_t *kmap_get_fixmap_pte(unsigned long vaddr)
{
return pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr),
vaddr), vaddr), vaddr);
}
可以很容易看到kmap_init()主要是獲取到臨時映射區間的起始頁表並往臨時映射頁表變數kmap_pte置值,並置頁表屬性kmap_prot為PAGE_KERNEL。
paging_init()中,由於沒有開啟CONFIG_OLPC配置,故olpc_dt_build_devicetree()為空函數,暫不分析。同樣,前面提及的sparse_memory_present_with_active_regions()和sparse_init()也暫不分析。
最後看一下zone_sizes_init():
【file:/arch/x86/mm/init.c】
void __init zone_sizes_init(void)
{
unsigned long max_zone_pfns[MAX_NR_ZONES];
memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
#ifdef CONFIG_ZONE_DMA
max_zone_pfns[ZONE_DMA] = MAX_DMA_PFN;
#endif
#ifdef CONFIG_ZONE_DMA32
max_zone_pfns[ZONE_DMA32] = MAX_DMA32_PFN;
#endif
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
#ifdef CONFIG_HIGHMEM
max_zone_pfns[ZONE_HIGHMEM] = max_pfn;
#endif
free_area_init_nodes(max_zone_pfns);
}
通過max_zone_pfns獲取各個管理區的最大頁面數,並作為參數調用free_area_init_nodes(),其中free_area_init_nodes()函數實現:
【file:/mm/page_alloc.c】
/**
* free_area_init_nodes - Initialise all pg_data_t and zone data
* @max_zone_pfn: an array of max PFNs for each zone
*
* This will call free_area_init_node() for each active node in the system.
* Using the page ranges provided by add_active_range(), the size of each
* zone in each node and their holes is calculated. If the maximum PFN
* between two adjacent zones match, it is assumed that the zone is empty.
* For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed
* that arch_max_dma32_pfn has no pages. It is also assumed that a zone
* starts where the previous one ended. For example, ZONE_DMA32 starts
* at arch_max_dma_pfn.
*/
void __init free_area_init_nodes(unsigned long *max_zone_pfn)
{
unsigned long start_pfn, end_pfn;
int i, nid;
/* Record where the zone boundaries are */
memset(arch_zone_lowest_possible_pfn, 0,
sizeof(arch_zone_lowest_possible_pfn));
memset(arch_zone_highest_possible_pfn, 0,
sizeof(arch_zone_highest_possible_pfn));
arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
for (i = 1; i < MAX_NR_ZONES; i++) {
if (i == ZONE_MOVABLE)
continue;
arch_zone_lowest_possible_pfn[i] =
arch_zone_highest_possible_pfn[i-1];
arch_zone_highest_possible_pfn[i] =
max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
}
arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0;
arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0;
/* Find the PFNs that ZONE_MOVABLE begins at in each node */
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
find_zone_movable_pfns_for_nodes();
/* Print out the zone ranges */
printk("Zone ranges:\n");
for (i = 0; i < MAX_NR_ZONES; i++) {
if (i == ZONE_MOVABLE)
continue;
printk(KERN_CONT " %-8s ", zone_names[i]);
if (arch_zone_lowest_possible_pfn[i] ==
arch_zone_highest_possible_pfn[i])
printk(KERN_CONT "empty\n");
else
printk(KERN_CONT "[mem %0#10lx-%0#10lx]\n",
arch_zone_lowest_possible_pfn[i] << PAGE_SHIFT,
(arch_zone_highest_possible_pfn[i]
<< PAGE_SHIFT) - 1);
}
/* Print out the PFNs ZONE_MOVABLE begins at in each node */
printk("Movable zone start for each node\n");
for (i = 0; i < MAX_NUMNODES; i++) {
if (zone_movable_pfn[i])
printk(" Node %d: %#010lx\n", i,
zone_movable_pfn[i] << PAGE_SHIFT);
}
/* Print out the early node map */
printk("Early memory node ranges\n");
for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
printk(" node %3d: [mem %#010lx-%#010lx]\n", nid,
start_pfn << PAGE_SHIFT, (end_pfn << PAGE_SHIFT) - 1);
/* Initialise every node */
mminit_verify_pageflags_layout();
setup_nr_node_ids();
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
free_area_init_node(nid, NULL,
find_min_pfn_for_node(nid), NULL);
/* Any memory on that node */
if (pgdat->node_present_pages)
node_set_state(nid, N_MEMORY);
check_for_memory(pgdat, nid);
}
}
該函數中,arch_zone_lowest_possible_pfn用於存儲各記憶體管理區可使用的最小記憶體頁框號,而arch_zone_highest_possible_pfn則是用來存儲各記憶體管理區可使用的最大記憶體頁框號。於是find_min_pfn_with_active_regions()函數主要是實現用於獲取最小記憶體頁框號,而獲取最大記憶體頁框號則是緊隨的for迴圈:
for (i = 1; i < MAX_NR_ZONES; i++) {
if (i == ZONE_MOVABLE)
continue;
arch_zone_lowest_possible_pfn[i] =
arch_zone_highest_possible_pfn[i-1];
arch_zone_highest_possible_pfn[i] =
max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
}
該迴圈裡面除了確定各記憶體管理區最大記憶體頁框號,同時也確定了各管理區的最小記憶體頁框號,實際上就是確定各個管理區的上下邊界。此外,還有一個全局數組zone_movable_pfn,用於記錄各個node節點的Movable管理區的起始頁框號,而查找該頁框號的相應函數為find_zone_movable_pfns_for_nodes()。
具體實現:
【file:/mm/page_alloc.c】
/*
* Find the PFN the Movable zone begins in each node. Kernel memory
* is spread evenly between nodes as long as the nodes have enough
* memory. When they don't, some nodes will have more kernelcore than
* others
*/
static void __init find_zone_movable_pfns_for_nodes(void)
{
int i, nid;
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
int usable_nodes = nodes_weight(node_states[N_MEMORY]);
struct memblock_type *type = &memblock.memory;
/* Need to find movable_zone earlier when movable_node is specified. */
find_usable_zone_for_movable();
/*
* If movable_node is specified, ignore kernelcore and movablecore
* options.
*/
if (movable_node_is_enabled()) {
for (i = 0; i < type->cnt; i++) {
if (!memblock_is_hotpluggable(&type->regions[i]))
continue;
nid = type->regions[i].nid;
usable_startpfn = PFN_DOWN(type->regions[i].base);
zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
min(usable_startpfn, zone_movable_pfn[nid]) :
usable_startpfn;
}
goto out2;
}
/*
* If movablecore=nn[KMG] was specified, calculate what size of
* kernelcore that corresponds so that memory usable for
* any allocation type is evenly spread. If both kernelcore
* and movablecore are specified, then the value of kernelcore
* will be used for required_kernelcore if it's greater than
* what movablecore would have allowed.
*/
if (required_movablecore) {
unsigned long corepages;
/*
* Round-up so that ZONE_MOVABLE is at least as large as what
* was requested by the user
*/
required_movablecore =
roundup(required_movablecore, MAX_ORDER_NR_PAGES);
corepages = totalpages - required_movablecore;
required_kernelcore = max(required_kernelcore, corepages);
}
/* If kernelcore was not specified, there is no ZONE_MOVABLE */
if (!required_kernelcore)
goto out;
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;
/*
* Recalculate kernelcore_node if the division per node
* now exceeds what is necessary to satisfy the requested
* amount of memory for the kernel
*/
if (required_kernelcore < kernelcore_node)
kernelcore_node = required_kernelcore / usable_nodes;
/*
* As the map is walked, we track how much memory is usable
* by the kernel using kernelcore_remaining. When it is
* 0, the rest of the node is usable by ZONE_MOVABLE
*/
kernelcore_remaining = kernelcore_node;
/* Go through each range of PFNs within this node */
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
unsigned long size_pages;
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
kernel_pages = min(end_pfn, usable_startpfn)
- start_pfn;
kernelcore_remaining -= min(kernel_pages,
kernelcore_remaining);
required_kernelcore -= min(kernel_pages,
required_kernelcore);
/* Continue if range is now fully accounted */
if (end_pfn <= usable_startpfn) {
/*
* Push zone_movable_pfn to the end so
* that if we have to rebalance
* kernelcore across nodes, we will
* not double account here
*/
zone_movable_pfn[nid] = end_pfn;
continue;
}
start_pfn = usable_startpfn;
}
/*
* The usable PFN range for ZONE_MOVABLE is from
* start_pfn->end_pfn. Calculate size_pages as the
* number of pages used as kernelcore
*/
size_pages = end_pfn - start_pfn;
if (size_pages > kernelcore_remaining)
size_pages = kernelcore_remaining;
zone_movable_pfn[nid] = start_pfn + size_pages;
/*
* Some kernelcore has been met, update counts and
* break if the kernelcore for this node has been
* satisfied
*/
required_kernelcore -= min(required_kernelcore,
size_pages);
kernelcore_remaining -= size_pages;
if (!kernelcore_remaining)
break;
}
}
/*
* If there is still required_kernelcore, we do another pass with one
* less node in the count. This will push zone_movable_pfn[nid] further
* along on the nodes that still have memory until kernelcore is
* satisfied
*/
usable_nodes--;
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
out2:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
out:
/* restore the node_state */
node_states[N_MEMORY] = saved_node_state;
}
該函數中early_calculate_totalpages()主要用於統計系統頁面總數,而nodes_weight()則是將當前系統的節點數統計返回,其入參node_states[N_MEMORY]的定義在page_alloc.c中:
nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
[N_ONLINE] = { { [0] = 1UL } },
#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
#endif
#ifdef CONFIG_MOVABLE_NODE
[N_MEMORY] = { { [0] = 1UL } },
#endif
[N_CPU] = { { [0] = 1UL } },
#endif /* NUMA */
};
EXPORT_SYMBOL(node_states);
接著往下的find_usable_zone_for_movable():
【file:/mm/page_alloc.c】
/*
* This finds a zone that can be used for ZONE_MOVABLE pages. The
* assumption is made that zones within a node are ordered in monotonic
* increasing memory addresses so that the "highest" populated zone is used
*/
static void __init find_usable_zone_for_movable(void)
{
int zone_index;
for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
if (zone_index == ZONE_MOVABLE)
continue;
if (arch_zone_highest_possible_pfn[zone_index] >
arch_zone_lowest_possible_pfn[zone_index])
break;
}
VM_BUG_ON(zone_index == -1);
movable_zone = zone_index;
}
其主要實現查找一個可用於ZONE_MOVABLE頁面的記憶體管理區,該區低於ZONE_MOVABLE且頁面數不為0。通常最高記憶體管理區被找到,然後管理區索引記錄在全局變數movable_zone中。
接下來的if分支:
if (movable_node_is_enabled()) {
for (i = 0; i < type->cnt; i++) {
if (!memblock_is_hotpluggable(&type->regions[i]))
continue;
nid = type->regions[i].nid;
usable_startpfn = PFN_DOWN(type->regions[i].base);
zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
min(usable_startpfn, zone_movable_pfn[nid]) :
usable_startpfn;
}
goto out2;
}
該分支主要是當movable_node已經設置的情況下,忽略kernelcore和movablecore的設置,找到最高記憶體管理區的起始頁usable_startpfn和Movable管理區的頁框號。
再往下的if分支:
if (!required_kernelcore)
goto out;
如果至此kernelcore仍未設置時,則表示其實movable管理區是不存在的。
最後在restart的標簽內的代碼,其主要實現的是將kernelcore的記憶體平均分配到各個node上面。其中局部變數kernelcore_node表示各個nodes平均分攤到的記憶體頁面數,usable_startpfn表示movable管理區的最低記憶體頁框號,主要通過遍歷node_states[N_MEMORY]中標誌可用的node節點並遍歷節點內的各個記憶體塊信息,將均攤的記憶體頁面數分到各個node當中,如果無法均攤時,通過判斷:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
重新再次平均分攤,基於優先滿足kernelcore的設置前提,直至無法滿足條件為止。
而在out2的標簽內的代碼則是用於將movable管理區的起始地址做MAX_ORDER_NR_PAGES對齊操作。
末尾out的標簽則僅是恢復node_states[]而已。
find_zone_movable_pfns_for_nodes()函數雖然分析了這麼多,但個人實驗環境由於required_movablecore和required_kernelcore為0,故僅分析這麼多了。
下麵回到free_area_init_nodes()函數中。跟隨在find_zone_movable_pfns_for_nodes()後面是一段日誌信息內容列印,分別列印管理區範圍信息(dmesg命令可以查看),個人實驗環境上的信息為:
再往下的mminit_verify_pageflags_layout()函數主要用於記憶體初始化調測使用的,由於未開啟CONFIG_DEBUG_MEMORY_INIT配置項,此函數為空。而setup_nr_node_ids()是用於設置記憶體節點總數的,此處如果最大節點數MAX_NUMNODES不超過1,則是空函數。
free_area_init_nodes()函數末了還有一個遍歷各個節點做初始化的操作,暫且留待後面再分析。