Contents
Background
A customer noticed an issue. His system has 100+GB free memory (which is pure free memory, not buffer/cache), but system starts reclaiming pages from buffer/cache and swapping out pages at some point.
From system log, we can see things like below at the reclaiming moment.
May 2 10:03:56 rhel68-kmalloc kernel: Pid: 22319, comm: insmod Not tainted 2.6.32-642.el6.x86_64 #1
May 2 10:03:56 rhel68-kmalloc kernel: Call Trace:
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8113e77c>] ? __alloc_pages_nodemask+0x7dc/0x950
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f132>] ? kmem_getpages+0x62/0x170
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fd4a>] ? fallback_alloc+0x1ba/0x270
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f79f>] ? cache_grow+0x2cf/0x320
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fac9>] ? ____cache_alloc_node+0x99/0x160
The message "page allocation failure. order:10, mode:0xd0" indicates that kernel was requesting an order-10 physical memory (4MB continuous physical memory), but failed to allocate.
This could be a cause that triggered buffer/cache reclaiming and swapping. The system might consider such 'page allocation failure' as 'low on memory', and trigger PFRA, which might lead to swapping out pages, and reclaiming buffer/cache.
Can we reproduce such situation?
Linux's buddy system
In RHEL5/6/7, the page size is 4KB.
4096
In system's buddyinfo, we can see how fragmented the memory is.
Node 0, zone DMA 2 2 1 1 1 0 1 0 1 1 3
Node 0, zone DMA32 97 54 32 81 51 16 23 23 11 10 403
For example, in Node 0's DMA32 zone, the memory fragmentation is 97*4KB + 54*8KB + 32*16KB + 81*32KB + 51*64KB + 16*128KB + 23*256KB + 23*512KB + 11*1024KB + 10*2048KB + 403*4096KB.
Write a kernel module to request continuous physical memory
In Linux, a user-space process can not request a continuous physical memory directly, except for HugePage, if I recall correctly.
To request a continuous physical memory, we can call kmalloc() function in a kernel module.
Here's the source code adapted from a hello world module. When loading this module, it will request 30 * 4096KB continuous physical memory. The size fed into kmalloc() should less or equal than 4096KB, as the largest continuous physical memory provided by system is 4096KB.
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
MODULE_AUTHOR("siwu_from_CEE");
MODULE_DESCRIPTION("Request for high order continous memory");
MODULE_LICENSE("GPL");
#define NR_PAGES 30
#define PER_PAGE_SIZE 4096*1024 // 4096KB
int irq;
module_param(irq, int, 0644);
int sample;
module_param_named(test, sample, int, 0644);
int arr_data[10];
int arr_cnt;
module_param_array(arr_data, int, &arr_cnt, 0644);
void * stuff[NR_PAGES];
static int hello_init(void)
{
int i;
for(i = 0; i < NR_PAGES; i++){
stuff[i] = kmalloc(PER_PAGE_SIZE,GFP_KERNEL);
}
if(!stuff[0]){
printk("Failed to allocate memory\n");
return 0;
}
printk(" %zu bytes of memory allocated for %d times\n", ksize(stuff[0]), NR_PAGES);
return 0;
}
static void hello_exit(void)
{
int i;
for(i = 0; i < NR_PAGES; i++){
if (stuff[i]) kfree(stuff[i]);
}
printk("Bye. Bye..%d\n", irq);
}
module_init(hello_init);
module_exit(hello_exit);
Complie the kernel module
First, we need to install the proper kernel header. In RHEL/Centos, we could do so by executing:
Be careful that the version of kernel-headers and kernel-devel should match the current running kernel. (`uname -r`)
Second, create a Makefile under the same directory of big_mem.c.
Last, execute below command to compile the module.
'2.6.32-642.el6.x86_64' is the version of my current module.
Insert the module
After running the above 'make' command, a module file 'big_mem.ko' is generated under the same directory.
Let's insert it and observe the memory usage.
Node 0, zone DMA 2 2 1 1 1 0 1 0 1 1 3
Node 0, zone DMA32 2 7 11 34 24 4 22 27 12 11 365
total used free shared buffers cached
Mem: 1877 345 1531 0 59 149
-/+ buffers/cache: 136 1740
Swap: 1023 0 1023
[root@rhel68-kmalloc kmalloc-2]# insmod big_mem.ko
[root@rhel68-kmalloc ~]# cat /proc/buddyinfo ; free -m
Node 0, zone DMA 2 2 1 1 1 0 1 0 1 1 3
Node 0, zone DMA32 2 7 11 34 24 4 22 27 12 11 335
total used free shared buffers cached
Mem: 1877 465 1411 0 59 149
-/+ buffers/cache: 256 1620
Swap: 1023 0 1023
Reproduce the issue
To reproduce the customer's issue, I use below tricks.
1. Generate some read/write traffic. So buffer/cache will go high and high order continuous physical memory will decrease. (for example, copy some large files.)
2. Insert the module which requests for a large amount of continuous physical memory.
Node 0, zone DMA 2 2 1 2 1 1 1 2 1 1 0
Node 0, zone DMA32 331 548 624 373 225 47 24 11 12 0 1
total used free shared buffers cached
Mem: 996 917 78 0 84 318
-/+ buffers/cache: 514 481
Swap: 1023 1 1022
[root@rhel68-kmalloc kmalloc-1]# insmod big_mem.ko
[root@rhel68-kmalloc ~]# cat /proc/buddyinfo ; free -m
Node 0, zone DMA 17 11 12 13 11 7 3 2 1 1 2
Node 0, zone DMA32 4343 2720 1731 1019 627 324 189 113 54 13 3
total used free shared buffers cached
Mem: 996 608 387 0 84 16
-/+ buffers/cache: 507 488
Swap: 1023 37 986
3. Check system log, and we can see "page allocation failure".
May 2 10:03:56 rhel68-kmalloc kernel: Pid: 22319, comm: insmod Not tainted 2.6.32-642.el6.x86_64 #1
May 2 10:03:56 rhel68-kmalloc kernel: Call Trace:
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8113e77c>] ? __alloc_pages_nodemask+0x7dc/0x950
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f132>] ? kmem_getpages+0x62/0x170
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fd4a>] ? fallback_alloc+0x1ba/0x270
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f79f>] ? cache_grow+0x2cf/0x320
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fac9>] ? ____cache_alloc_node+0x99/0x160
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff811803c7>] ? kmem_cache_alloc_trace+0x137/0x1c0
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffffa02a0076>] ? hello_init+0x26/0x84 [big_mem]
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffffa02a0050>] ? hello_init+0x0/0x84 [big_mem]
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff810020d0>] ? do_one_initcall+0xc0/0x280
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff810c85f1>] ? sys_init_module+0xe1/0x250
May 2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
Reference
1. The Linux Kernel Module Programming Guide
2. 8.1. The Real Story of kmalloc