alloc_page分配記憶體空間--Linux記憶體管理(十七)

1 前景回顧在內核初始化完成之後, 記憶體管理的責任就由伙伴系統來承擔. 伙伴系統基於一種相對簡單然而令人吃驚的強大演算法. Linux內核使用二進位伙伴演算法來管理和分配物理記憶體頁面, 該演算法由Knowlton設計, 後來Knuth又進行了更深刻的描述. 伙伴系統是一個結合了2的方冪個分配器和空閑緩衝 ...

1 前景回顧

在內核初始化完成之後, 記憶體管理的責任就由伙伴系統來承擔. 伙伴系統基於一種相對簡單然而令人吃驚的強大演算法.

Linux內核使用二進位伙伴演算法來管理和分配物理記憶體頁面, 該演算法由Knowlton設計, 後來Knuth又進行了更深刻的描述.

伙伴系統是一個結合了2的方冪個分配器和空閑緩衝區合併計技術的記憶體分配方案, 其基本思想很簡單. 記憶體被分成含有很多頁面的大塊, 每一塊都是2個頁面大小的方冪. 如果找不到想要的塊, 一個大塊會被分成兩部分, 這兩部分彼此就成為伙伴. 其中一半被用來分配, 而另一半則空閑. 這些塊在以後分配的過程中會繼續被二分直至產生一個所需大小的塊. 當一個塊被最終釋放時, 其伙伴將被檢測出來, 如果伙伴也空閑則合併兩者.

內核如何記住哪些記憶體塊是空閑的
分配空閑頁面的方法
影響分配器行為的眾多標識位
記憶體碎片的問題和分配器如何處理碎片

2 記憶體分配API

2.1 記憶體分配器API

就伙伴系統的介面而言, NUMA或UMA體繫結構是沒有差別的, 二者的調用語法都是相同的.

所有函數的一個共同點是 : 只能分配2的整數冪個頁.

因此，介面中不像C標準庫的malloc函數或bootmem和memblock分配器那樣指定了所需記憶體大小作為參數. 相反, 必須指定的是分配階, 伙伴系統將在記憶體中分配2^0 rder頁內核中細粒度的分配只能藉助於slab分配器(或者slub、slob分配器), 後者基於伙伴系統

記憶體分配函數	功能	定義
alloc_pages(mask, order)	分配2^0 rder 頁並返回一個struct page的實例，表示分配的記憶體塊的起始頁	NUMA-include/linux/gfp.h, line 466 UMA-include/linux/gfp.h?v=4.7, line 476
alloc_page(mask)	是前者在order = 0情況下的簡化形式，只分配一頁	include/linux/gfp.h?v=4.7, line 483
get_zeroed_page(mask)	分配一頁並返回一個page實例，頁對應的記憶體填充0（所有其他函數，分配之後頁的內容是未定義的）	mm/page_alloc.c?v=4.7, line 3900
__get_free_pages(mask, order) __get_free_page(mask)	工作方式與上述函數相同，但返回分配記憶體塊的虛擬地址，而不是page實例
get_dma_pages(gfp_mask, order)	用來獲得適用於DMA的頁.	include/linux/gfp.h?v=4.7, line 503

在空閑記憶體無法滿足請求以至於分配失敗的情況下，所有上述函數都返回空指針(比如alloc_pages和alloc_page)或者0(比如get_zeroed_page、__get_free_pages和__get_free_page).

因此內核在各次分配之後都必須檢查返回的結果. 這種慣例與設計得很好的用戶層應用程式沒什麼不同, 但在內核中忽略檢查會導致嚴重得多的故障

內核除了伙伴系統函數之外, 還提供了其他記憶體管理函數. 它們以伙伴系統為基礎, 但並不屬於伙伴分配器自身. 這些函數包括vmalloc和vmalloc_32, 使用頁表將不連續的記憶體映射到內核地址空間中, 使之看上去是連續的.

還有一組kmalloc類型的函數, 用於分配小於一整頁的記憶體區. 其實現將在以後分別討論。

2.2 記憶體分配API統一到alloc_pages介面

通過使用標誌、記憶體域修飾符和各個分配函數，內核提供了一種非常靈活的記憶體分配體系.儘管如此, 所有介面函數都可以追溯到一個簡單的基本函數(alloc_pages_node)

分配單頁的函數alloc_page和__get_free_page, 還有__get_dma_pages是藉助於巨集定義的.

//  http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L483
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)

//  http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L500
#define __get_free_page(gfp_mask) \
    __get_free_pages((gfp_mask), 0)`

//  http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L503
#define __get_dma_pages(gfp_mask, order) \
    __get_free_pages((gfp_mask) | GFP_DMA, (order))

get_zeroed_page的實現也沒什麼困難, 對__get_free_pages使用__GFP_ZERO標誌，即可分配填充位元組0的頁. 再返回與頁關聯的記憶體區地址即可.

//  http://lxr.free-electrons.com/source/mm/page_alloc.c?v=4.7#L3900
unsigned long get_zeroed_page(gfp_t gfp_mask)
{
        return __get_free_pages(gfp_mask | __GFP_ZERO, 0);
}
EXPORT_SYMBOL(get_zeroed_page);

__get_free_pages調用alloc_pages完成記憶體分配, 而alloc_pages又藉助於alloc_pages_node

__get_free_pages函數的定義在mm/page_alloc.c?v=4.7, line 3883

//  http://lxr.free-electrons.com/source/mm/page_alloc.c?v=4.7#L3883
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
{
    struct page *page;

    /*
     * __get_free_pages() returns a 32-bit address, which cannot represent
     * a highmem page
     */
    VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0);

    page = alloc_pages(gfp_mask, order);
    if (!page)
        return 0;
    return (unsigned long) page_address(page);
}
EXPORT_SYMBOL(__get_free_pages);

在這種情況下，使用了一個普通函數而不是巨集，因為alloc_pages返回的page實例需要使用輔助

函數page_address轉換為記憶體地址. 在這裡，只要知道該函數可根據page實例計算相關頁的線性記憶體地址即可. 對高端記憶體頁這是有問題的

這樣, 就完成了所有分配記憶體的API函數到公共的基礎函數alloc_pages的統一

另外所有體繫結構都必須實現的標準函數clear_page, 可幫助alloc_pages對頁填充位元組0, 實現如下表所示

x86	arm
arch/x86/include/asm/page_32.h?v=4.7, line 24	arch/arm/include/asm/page.h?v=4.7#L14 arch/arm/include/asm/page-nommu.h

2.2 alloc_pages函數分配頁

既然所有的記憶體分配API函數都可以追溯掉alloc_page函數, 從某種意義上說，該函數是伙伴系統主要實現的”發射台”.

alloc_pages函數的定義是依賴於NUMA或者UMA架構的, 定義如下

#ifdef CONFIG_NUMA

//  http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L465
static inline struct page *
alloc_pages(gfp_t gfp_mask, unsigned int order)
{
        return alloc_pages_current(gfp_mask, order);
}

#else

//  http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L476
#define alloc_pages(gfp_mask, order) \
                alloc_pages_node(numa_node_id(), gfp_mask, order)
#endif

UMA結構下的alloc_pages是通過alloc_pages_node函數實現的, 下麵我們看看alloc_pages_node函數的定義, 在include/linux/gfp.h?v=4.7, line 448

//  http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L448
/*
 * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
 * prefer the current CPU's closest node. Otherwise node must be valid and
 * online.
 */
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
                        unsigned int order)
{
    if (nid == NUMA_NO_NODE)
        nid = numa_mem_id();

    return __alloc_pages_node(nid, gfp_mask, order);
}

它只是執行了一個簡單的檢查, 如果指定負的結點ID(不存在, 即NUMA_NO_NODE = -1), 內核自動地使用當前執行CPU對應的結點nid = numa_mem_id();, 然後調用__alloc_pages_node函數進行了記憶體分配

__alloc_pages_node函數定義在include/linux/gfp.h?v=4.7, line 435), 如下所示

// http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L435
/*
 * Allocate pages, preferring the node given as nid. The node must be valid and
 * online. For more general interface, see alloc_pages_node().
 */
static inline struct page *
__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
{
    VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
    VM_WARN_ON(!node_online(nid));

    return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}

內核假定傳遞給改alloc_pages_node函數的結點nid是被激活, 即online的.但是為了安全它還是檢查並警告記憶體結點不存在的情況. 接下來的工作委托給__alloc_pages, 只需傳遞一組適當的參數, 其中包括節點nid的備用記憶體域列表zonelist.

現在__alloc_pages函數沒什麼特別的, 它直接將自己的所有信息傳遞給__alloc_pages_nodemask來完成記憶體的分配

//  http://lxr.free-electrons.com/source/include/linux/gfp.h?v=4.7#L428
static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
        struct zonelist *zonelist)
{
    return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
}

2.3 伙伴系統的心臟__alloc_pages_nodemask
內核源代碼將__alloc_pages_nodemask稱之為”伙伴系統的心臟”(`the ‘heart’ of the zoned buddy allocator“), 因為它處理的是實質性的記憶體分配.

由於”心臟”的重要性, 我將在下文詳細介紹該函數.

__alloc_pages_nodemask函數定義在include/linux/gfp.h?v=4.7#L428

3 選擇頁

我們先把註意力轉向頁面選擇是如何工作的。

3.1 記憶體水印標誌

還記得之前講過的記憶體水印麽

enum zone_watermarks {
        WMARK_MIN,
        WMARK_LOW,
        WMARK_HIGH,
        NR_WMARK
};

#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])

內核需要定義一些函數使用的標誌，用於控制到達各個水印指定的臨界狀態時的行為, 這些標誌用巨集來定義, 定義在mm/internal.h?v=4.7, line 453

3.2 zone_watermark_ok函數檢查標誌

設置的標誌在zone_watermark_ok函數中檢查, 該函數根據設置的標誌判斷是否能從給定的記憶體域分配記憶體. 該函數定義在mm/page_alloc.c?v=4.7, line 2820

//  http://lxr.free-electrons.com/source/mm/page_alloc.c?v=4.7#L2820
bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
              int classzone_idx, unsigned int alloc_flags)
{
    return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
                    zone_page_state(z, NR_FREE_PAGES));
}

而__zone_watermark_ok函數則完成了檢查的工作, 該函數定義在mm/page_alloc.c?v=4.7, line 2752

// http://lxr.free-electrons.com/source/mm/page_alloc.c?v=4.7#L2752
/*
 * Return true if free base pages are above 'mark'. For high-order checks it
 * will return true of the order-0 watermark is reached and there is at least
 * one free page of a suitable size. Checking now avoids taking the zone lock
 * to check in the allocation paths if no pages are free.
 */
bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
             int classzone_idx, unsigned int alloc_flags,
             long free_pages)
{
    long min = mark;
    int o;
    const bool alloc_harder = (alloc_flags & ALLOC_HARDER);

    /* free_pages may go negative - that's OK
     * free_pages可能變為負值, 沒有關係 */
    free_pages -= (1 << order) - 1;

    if (alloc_flags & ALLOC_HIGH)
        min -= min / 2;

    /*
     * If the caller does not have rights to ALLOC_HARDER then subtract
     * the high-atomic reserves. This will over-estimate the size of the
     * atomic reserve but it avoids a search.
     */
    if (likely(!alloc_harder))
        free_pages -= z->nr_reserved_highatomic;
    else
        min -= min / 4;

#ifdef CONFIG_CMA
    /* If allocation can't use CMA areas don't use free CMA pages */
    if (!(alloc_flags & ALLOC_CMA))
        free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
#endif

    /*
     * Check watermarks for an order-0 allocation request. If these
     * are not met, then a high-order request also cannot go ahead
     * even if a suitable page happened to be free.
     */
    if (free_pages <= min + z->lowmem_reserve[classzone_idx])
        return false;

    /* If this is an order-0 request then the watermark is fine */
    if (!order)
        return true;

    /* For a high-order request, check at least one suitable page is free 
     * 在下一階，當前階的頁是不可用的  */
    for (o = order; o < MAX_ORDER; o++) {
        struct free_area *area = &z->free_area[o];
        int mt;

        if (!area->nr_free)
            continue;

        if (alloc_harder)
            return true;

        /* 所需高階空閑頁的數目相對較少 */
        for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
            if (!list_empty(&area->free_list[mt]))
                return true;
        }

#ifdef CONFIG_CMA
        if ((alloc_flags & ALLOC_CMA) &&
            !list_empty(&area->free_list[MIGRATE_CMA])) {
            return true;
        }
#endif
    }
    return false;
}

我們知道zone_per_state用來訪問每個記憶體域的統計量. 在上述代碼中, 得到的是空閑頁的數目.

free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);

在解釋了ALLOC_HIGH和ALLOC_HARDER標誌之後(將最小值標記降低到當前值的一半或四分之一，使得分配過程努力或更加努力),

if (alloc_flags & ALLOC_HIGH)
    min -= min / 2;

if (likely(!alloc_harder))
    free_pages -= z->nr_reserved_highatomic;
else
    min -= min / 4;

該函數會檢查空閑頁的數目free_pages是否小於最小值與lowmem_reserve中指定的緊急分配值min之和.

/* For a high-order request, check at least one suitable page is free */
for (o = order; o < MAX_ORDER; o++) {
    struct free_area *area = &z->free_area[o];
    int mt;

    if (!area->nr_free)
        continue;

    if (alloc_harder)
        return true;

    for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
        if (!list_empty(&area->free_list[mt]))
            return true;
    }

#ifdef CONFIG_CMA
    if ((alloc_flags & ALLOC_CMA) &&
        !list_empty(&area->free_list[MIGRATE_CMA])) {
        return true;
    }
#endif
}

如果內核遍歷所有的低端記憶體域之後，發現記憶體不足, 則不進行記憶體分配.

3.3 get_page_from_freelist函數

get_page_from_freelist是伙伴系統使用的另一個重要的輔助函數. 它通過標誌集和分配階來判斷是否能進行分配。如果可以，則發起實際的分配操作. 該函數定義在mm/page_alloc.c?v=4.7, line 2905

這個函數的參數很有意思, 之前的時候這個函數的參數只能用複雜來形容

static struct page *  
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,  
        struct zonelist *zonelist, int high_zoneidx, int alloc_flags,  
        struct zone *preferred_zone, int migratetype)

但是這仍然不夠, 隨著內核的不段改進, 所支持的特性也越多, 分配記憶體時需要參照的標識也越來越多, 那難道看著這個函數的參數不斷膨脹麽, 這個不是內核黑客們所能容忍的, 於是大家想出了一個解決方案, 把那些相關聯的參數封裝成一個結構

static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, const struct alloc_context *ac)

這個封裝好的結構就是struct alloc_context, 定義在mm/internal.h?v=4.7, line 103

/*
 * Structure for holding the mostly immutable allocation parameters passed
 * between functions involved in allocations, including the alloc_pages*
 * family of functions.
 *
 * nodemask, migratetype and high_zoneidx are initialized only once in
 * __alloc_pages_nodemask() and then never change.
 *
 * zonelist, preferred_zone and classzone_idx are set first in
 * __alloc_pages_nodemask() for the fast path, and might be later changed
 * in __alloc_pages_slowpath(). All other functions pass the whole strucure
 * by a const pointer.
 */
struct alloc_context {
        struct zonelist *zonelist;
        nodemask_t *nodemask;
        struct zoneref *preferred_zoneref;
        int migratetype;
        enum zone_type high_zoneidx;
        bool spread_dirty_pages;
};

欄位	描述
zonelist	當perferred_zone上沒有合適的頁可以分配時，就要按zonelist中的順序掃描該zonelist中備用zone列表，一個個的試用
nodemask	表示節點的mask，就是是否能在該節點上分配記憶體，這是個bit位數組
preferred_zone	表示從high_zoneidx後找到的合適的zone，一般會從該zone分配；分配失敗的話，就會在zonelist再找一個preferred_zone = 合適的zone
migratetype	遷移類型，在zone->free_area.free_list[XXX] 作為分配下標使用，這個是用來反碎片化的，修改了以前的free_area結構體，在該結構體中再添加了一個數組，該數組以遷移類型為下標，每個數組元素都掛了對應遷移類型的頁鏈表
high_zoneidx	是表示該分配時，所能分配的最高zone，一般從high–>normal–>dma 記憶體越來越昂貴，所以一般從high到dma分配依次分配
spread_dirty_pages

zonelist是指向備用列表的指針. 在預期記憶體域沒有空閑空間的情況下, 該列表確定了掃描系統其他記憶體域(和結點)的順序.

隨後的for迴圈所作的基本上與直覺一致, 遍歷備用列表的所有記憶體域，用最簡單的方式查找一個適當的空閑記憶體塊

首先，解釋ALLOC_*標誌(__cpuset_zone_allowed_softwall是另一個輔助函數, 用於檢查給定記憶體域是否屬於該進程允許運行的CPU).
zone_watermark_ok接下來檢查所遍歷到的記憶體域是否有足夠的空閑頁，並試圖分配一個連續記憶體塊。如果兩個條件之一不能滿足，即或者沒有足夠的空閑頁，或者沒有連續記憶體塊可滿足分配請求，則迴圈進行到備用列表中的下一個記憶體域，作同樣的檢查. 直到找到一個合適的頁面, 在進行try_this_node進行記憶體分配
如果記憶體域適用於當前的分配請求, 那麼buffered_rmqueue試圖從中分配所需數目的頁

/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, const struct alloc_context *ac)
{
    struct zoneref *z = ac->preferred_zoneref;
    struct zone *zone;
    bool fair_skipped = false;
    bool apply_fair = (alloc_flags & ALLOC_FAIR);

zonelist_scan:
    /*
     * Scan zonelist, looking for a zone with enough free.
     * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
     */
    for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                ac->nodemask) {
        struct page *page;
        unsigned long mark;

        if (cpusets_enabled() &&
            (alloc_flags & ALLOC_CPUSET) &&
            !__cpuset_zone_allowed(zone, gfp_mask))
                continue;
        /*
         * Distribute pages in proportion to the individual
         * zone size to ensure fair page aging.  The zone a
         * page was allocated in should have no effect on the
         * time the page has in memory before being reclaimed.
         */
        if (apply_fair) {
            if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) {
                fair_skipped = true;
                continue;
            }
            if (!zone_local(ac->preferred_zoneref->zone, zone)) {
                if (fair_skipped)
                    goto reset_fair;
                apply_fair = false;
            }
        }
        /*
         * When allocating a page cache page for writing, we
         * want to get it from a zone that is within its dirty
         * limit, such that no single zone holds more than its
         * proportional share of globally allowed dirty pages.
         * The dirty limits take into account the zone's
         * lowmem reserves and high watermark so that kswapd
         * should be able to balance it without having to
         * write pages from its LRU list.
         *
         * This may look like it could increase pressure on
         * lower zones by failing allocations in higher zones
         * before they are full.  But the pages that do spill
         * over are limited as the lower zones are protected
         * by this very same mechanism.  It should not become
         * a practical burden to them.
         *
         * XXX: For now, allow allocations to potentially
         * exceed the per-zone dirty limit in the slowpath
         * (spread_dirty_pages unset) before going into reclaim,
         * which is important when on a NUMA setup the allowed
         * zones are together not big enough to reach the
         * global limit.  The proper fix for these situations
         * will require awareness of zones in the
         * dirty-throttling and the flusher threads.
         */
        if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
            continue;

        mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
        if (!zone_watermark_fast(zone, order, mark,
                       ac_classzone_idx(ac), alloc_flags)) {
            int ret;

            /* Checked here to keep the fast path fast */
            BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
            if (alloc_flags & ALLOC_NO_WATERMARKS)
                goto try_this_zone;

            if (zone_reclaim_mode == 0 ||
                !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
                continue;

            ret = zone_reclaim(zone, gfp_mask, order);
            switch (ret) {
            case ZONE_RECLAIM_NOSCAN:
                /* did not scan */
                continue;
            case ZONE_RECLAIM_FULL:
                /* scanned but unreclaimable */
                continue;
            default:
                /* did we reclaim enough */
                if (zone_watermark_ok(zone, order, mark,
                        ac_classzone_idx(ac), alloc_flags))
                    goto try_this_zone;

                continue;
            }
        }

try_this_zone:
        page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order,
                gfp_mask, alloc_flags, ac->migratetype);
        if (page) {
            prep_new_page(page, order, gfp_mask, alloc_flags);

            /*
             * If this is a high-order atomic allocation then check
             * if the pageblock should be reserved for the future
             */
            if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
                reserve_highatomic_pageblock(page, zone, order);

            return page;
        }
    }

    /*
     * The first pass makes sure allocations are spread fairly within the
     * local node.  However, the local node might have free pages left
     * after the fairness batches are exhausted, and remote zones haven't
     * even been considered yet.  Try once more without fairness, and
     * include remote zones now, before entering the slowpath and waking
     * kswapd: prefer spilling to a remote zone over swapping locally.
     */
    if (fair_skipped) {
reset_fair:
        apply_fair = false;
        fair_skipped = false;
        reset_alloc_batches(ac->preferred_zoneref->zone);
        z = ac->preferred_zoneref;
        goto zonelist_scan;
    }

    return NULL;
}

4 分配控制

如前所述, __alloc_pages_nodemask是伙伴系統的心臟. 我們已經處理了所有的準備工作並描述了所有可能的標誌, 現在我們把註意力轉向相對複雜的部分 : 函數__alloc_pages_nodemask的實現, 這也是內核中比較冗長的部分之一. 特別是在可用記憶體太少或逐漸用完時, 函數就會比較複雜. 如果可用記憶體足夠，則必要的工作會很快完成，就像下述代碼

4.1 函數源代碼註釋