Linux Kernel vs. Memory Fragmentation (Part I)

Author: Wenbo Zhang (Linux Kernel Engineer of the EE team at PingCAP)

Transcreator: Coco Yi; Editor: Tom Dewan

(External) memory fragmentation is a long-standing Linux kernel programming issue. As the system runs, it assigns various tasks to memory pages. Over time, memory gets fragmented, and eventually, a busy system that is up for a long time may have only a few contiguous physical pages.

Because the Linux kernel supports virtual memory management, physical memory fragmentation is often not an issue. With page tables, unless large pages are used, physically scattered memory is still contiguous in the virtual address space.

However, it becomes very difficult to allocate contiguous physical memory from the kernel linear mapping area. For example, it is challenging to allocate structure objects through the block allocator—a common and frequent operation in the kernel mode—or operate on a Direct Memory Access (DMA) buffer that does not support the scatter and gather modes. Such operations might cause frequent direct memory reclamation or compaction, resulting in large fluctuations in system performance, or allocation failure. In slow memory allocation paths, different operations are performed according to the page allocation flag.

If the kernel programming no longer relies on the high-order physical memory allocation in the linear address space, the memory fragmentation issue will be solved. However, for a huge project like the Linux kernel, it isn’t practical to make such changes.

Since Linux 2.x, the open source community has tried several methods to alleviate the memory fragmentation issue, including many effective, but unusual patches. Some merged patches have been controversial, such as the memory compaction mechanism. At the LSFMM 2014 conference, many developers complained that memory compaction was not very efficient and that bugs were not easy to reproduce. But the community did not abandon the feature and continued to optimize it in subsequent versions.

Mel Gorman is the most persistent contributor in this field. He has submitted two sets of important patches. The first set was merged in Linux 2.6.24 and iterated over 28 versions before the community accepted it. The second set was merged in Linux 5.0 and successfully reduced memory fragmentation events by 94% on one- or two-socket machines.

In this post, I’ll introduce some common extensions to the buddy allocator that helps prevent memory fragmentation in the Linux 3.10 kernel, the principle of memory compaction, how to view the fragmentation index, and how to quantify the latency overheads caused by memory compaction.

A brief history of defragmentation

Before I start, I want to recommend some good reads. The following articles show you all the efforts of improving high-level memory allocation during Linux kernel development.

Publish date	Articles on LWN.net
2004-09-08	Kswapd and high-order allocations
2004-05-10	Active memory defragmentation
2005-02-01	Yet another approach to memory fragmentation
2005-11-02	Fragmentation avoidance
2005-11-08	More on fragmentation avoidance
2006-11-28	Avoiding – and fixing – memory fragmentation
2010-01-06	Memory compaction
2014-03-26	Memory compaction issues
2015-07-14	Making kernel pages movable
2016-04-23	CMA and compaction
2016-05-10	Make direct compaction more deterministic
2017-03-21	Proactive compaction
2018-10-31	Fragmentation avoidance improvements
2020-04-21	Proactive compaction for the kernel

Now, let’s get started.

Linux buddy memory allocator

Linux uses the buddy algorithm as a page allocator, which is simple and efficient. Linux has made some extensions to the classic algorithm:

Partitions’ buddy allocator
Per-CPU pageset
Group by migration types

The Linux kernel uses node, zone, and page to describe physical memory. The partitions’ buddy allocator focuses on a certain zone on a certain node.

Before the 4.8 version, the Linux kernel implemented the page recycling strategy based on zone, because the early design was mainly for 32-bit processors, and there was a lot of high memory. However, the page aging speed of different zones on the same node was inconsistent, which caused many problems.

Over a long period, the community has added a lot of tricky patches, but the problem has remained. With more 64-bit processors and large memory models being used in recent years, Mel Groman migrated the page recycling strategy from zone to node and solved this problem. If you use Berkeley Packet Filter (BPF) authoring tools to observe recycling operations, you need to know this.

The per-CPU pageset optimizes single page allocation, which reduces lock contention between processors. It has nothing to do with defragmentation.

Grouping by migration types is the defragmentation method I’ll introduce in detail.

Group by migration types

First, you need to understand the memory address space layout. Each processor architecture has a definition. For example, the definition of x86_64 is in mm.txt.

Because the virtual address and physical address are not linearly mapped, accessing the virtual address space through the page table (such as the heap memory requirement of the user space) does not require contiguous physical memory. Take the Intel 5-level page table in the following figure as an example. The virtual address is divided from low to high:

Low: Page offset
Level 1: Direct page table index
Level 2: Page middle directory index
Level 3: Page upper directory index
Level 4: Page 4-level directory index
Level 5: Page global index

Intel 5-level paging – Wikipedia

The page frame number of the physical memory is stored in the direct page table entry, and you can find it through the direct page table index. The physical address is the combination of the found page frame number and the page offset.

Suppose you want to change the corresponding physical page in a direct page table entry. You only need to:

Allocate a new page.
Copy the data of the old page to the new one.
Modify the value of the direct page table entry to the new page frame number.

These operations do not change the original virtual address, and you can migrate such pages at will.

For the linear mapping area, the virtual address equals the physical address plus the constant. Modifying the physical address changes the virtual address, and accessing the original virtual address causes a bug. Therefore, it is not recommended to migrate these pages.

When the physical pages accessed through the page table and the pages accessed through linear mapping are mixed and managed together, memory fragmentation is prone to occur. Therefore, the kernel defines several migration types based on the mobility of the pages and groups the pages by the migration types for defragmentation.

Among the defined migration types, the three most frequently used are: MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, and MIGRATE_RECLAIMABLE. Other migration types have special purposes, which I won’t describe here.

You can view the distribution of each migration type at each stage through /proc/pagetypeinfo:

When applying for a page, the page allocation flag you use determines the specific migration type from which the page is allocated. For example, you can use __GFP_MOVABLE for user space memory, and __GFP_RECLAIMABLE for file pages.

When pages of a certain migration type are used up, the kernel steals physical pages from other migration types. To avoid fragmentation, the page stealing starts from the largest page block. The page block size is determined by pageblock_order.

The standby priorities of the above three migration types from high to low are:

MIGRATE_UNMOVABLE: MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE MIGRATE_RECALIMABlE: MIGRATE_UNMOVABLE, MIGRATE_MOVABLE MIGRATE_MOVABLE: MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE

The kernel introduces grouping by migration types for defragmentation. But frequent page stealing indicates that there are external memory fragmentation events, and they might cause trouble in the future.

Analyze external memory fragmentation events

My previous article Why We Disable Linux’s THP Feature for Databases mentioned that you can use ftrace events provided by the kernel to analyze external memory fragmentation events. The procedure is as follows:

Enable the ftrace events:

echo 1> /sys/kernel/debug/tracing/events/kmem/mm_page_alloc_extfrag/enable

Start collecting the ftrace events:

cat /sys/kernel/debug/tracing/trace_pipe> ~/extfrag.log

Tap Ctrl-C to stop collecting. A event contains many fields:

To analyze the number of external memory fragmentation events, focus on the events with fallback_order <; pageblock order. In the x86_64 environment, pageblock order is 9.

Clean up the events:

echo 0> /sys/kernel/debug/tracing/events/kmem/mm_page_alloc_extfrag/enable

You can see that grouping by migration types only delays memory fragmentation, but does not fundamentally solve it.

As the memory fragmentation increases and it does not have enough contiguous physical memory, performance degrades. So, it’s not enough to apply this feature alone.

In my next article, I’ll introduce more methods that the kernel uses to regulate memory fragmentation.

To be continued…

Try TiDB Cloud Free Now!

Get the power of a cloud-native, distributed SQL database built for real-time
analytics in a fully managed service.

Sign Me Up

Book a Demo

Linux

Engineering

Have questions? Let us know how we can help.

TiDB Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Serverless

A fully-managed cloud DBaaS for auto-scaling workloads

Try for Free Learn More