How to request continuous physical memory in Linux?

Background

A customer noticed an issue. His system has 100+GB free memory (which is pure free memory, not buffer/cache), but system starts reclaiming pages from buffer/cache and swapping out pages at some point.

From system log, we can see things like below at the reclaiming moment.

May  2 10:03:56 rhel68-kmalloc kernel: insmod: page allocation failure. order:10, mode:0xd0
May  2 10:03:56 rhel68-kmalloc kernel: Pid: 22319, comm: insmod Not tainted 2.6.32-642.el6.x86_64 #1
May  2 10:03:56 rhel68-kmalloc kernel: Call Trace:
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8113e77c>] ? __alloc_pages_nodemask+0x7dc/0x950
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f132>] ? kmem_getpages+0x62/0x170
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fd4a>] ? fallback_alloc+0x1ba/0x270
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f79f>] ? cache_grow+0x2cf/0x320
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fac9>] ? ____cache_alloc_node+0x99/0x160

The message "page allocation failure. order:10, mode:0xd0" indicates that kernel was requesting an order-10 physical memory (4MB continuous physical memory), but failed to allocate.

This could be a cause that triggered buffer/cache reclaiming and swapping. The system might consider such 'page allocation failure' as 'low on memory', and trigger PFRA, which might lead to swapping out pages, and reclaiming buffer/cache.

Can we reproduce such situation?

Linux's buddy system

In RHEL5/6/7, the page size is 4KB.

# getconf PAGESIZE
4096

In system's buddyinfo, we can see how fragmented the memory is.

# cat /proc/buddyinfo
Node 0, zone      DMA      2      2      1      1      1      0      1      0      1      1      3
Node 0, zone    DMA32     97     54     32     81     51     16     23     23     11     10    403

For example, in Node 0's DMA32 zone, the memory fragmentation is 97*4KB + 54*8KB + 32*16KB + 81*32KB + 51*64KB + 16*128KB + 23*256KB + 23*512KB + 11*1024KB + 10*2048KB + 403*4096KB.

Write a kernel module to request continuous physical memory

In Linux, a user-space process can not request a continuous physical memory directly, except for HugePage, if I recall correctly.

To request a continuous physical memory, we can call kmalloc() function in a kernel module.

Here's the source code adapted from a hello world module. When loading this module, it will request 30 * 4096KB continuous physical memory. The size fed into kmalloc() should less or equal than 4096KB, as the largest continuous physical memory provided by system is 4096KB.

// Filename: big_mem.c
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>

MODULE_AUTHOR("siwu_from_CEE");
MODULE_DESCRIPTION("Request for high order continous memory");
MODULE_LICENSE("GPL");

#define NR_PAGES 30
#define PER_PAGE_SIZE 4096*1024  // 4096KB

int irq;
module_param(irq, int, 0644);
int sample;
module_param_named(test, sample, int, 0644);

int arr_data[10];
int arr_cnt;
module_param_array(arr_data, int, &arr_cnt, 0644);

void * stuff[NR_PAGES];

static int hello_init(void)
{
  int i;
  for(i = 0; i < NR_PAGES; i++){
    stuff[i] = kmalloc(PER_PAGE_SIZE,GFP_KERNEL);
  }

  if(!stuff[0]){
    printk("Failed to allocate memory\n");
    return 0;
  }

  printk(" %zu bytes of memory allocated for %d times\n", ksize(stuff[0]), NR_PAGES);
  return 0;
}

static void hello_exit(void)
{
  int i;
  for(i = 0; i < NR_PAGES; i++){
    if (stuff[i]) kfree(stuff[i]);
  }

  printk("Bye. Bye..%d\n", irq);
}
module_init(hello_init);
module_exit(hello_exit);

Complie the kernel module

First, we need to install the proper kernel header. In RHEL/Centos, we could do so by executing:

# yum install kernel-headers kernel-devel

Be careful that the version of kernel-headers and kernel-devel should match the current running kernel. (`uname -r`)

Second, create a Makefile under the same directory of big_mem.c.

obj-m += big_mem.o

Last, execute below command to compile the module.

# make -C /lib/modules/2.6.32-642.el6.x86_64/build/ M=`pwd` modules

'2.6.32-642.el6.x86_64' is the version of my current module.

Insert the module

After running the above 'make' command, a module file 'big_mem.ko' is generated under the same directory.

Let's insert it and observe the memory usage.

[root@rhel68-kmalloc ~]# cat /proc/buddyinfo ; free -m
Node 0, zone      DMA      2      2      1      1      1      0      1      0      1      1      3
Node 0, zone    DMA32      2      7     11     34     24      4     22     27     12     11    365
             total       used       free     shared    buffers     cached
Mem:          1877        345       1531          0         59        149
-/+ buffers/cache:        136       1740
Swap:         1023          0       1023

[root@rhel68-kmalloc kmalloc-2]# insmod big_mem.ko

[root@rhel68-kmalloc ~]# cat /proc/buddyinfo ; free -m
Node 0, zone      DMA      2      2      1      1      1      0      1      0      1      1      3
Node 0, zone    DMA32      2      7     11     34     24      4     22     27     12     11    335
             total       used       free     shared    buffers     cached
Mem:          1877        465       1411          0         59        149
-/+ buffers/cache:        256       1620
Swap:         1023          0       1023

Reproduce the issue

To reproduce the customer's issue, I use below tricks.

1. Generate some read/write traffic. So buffer/cache will go high and high order continuous physical memory will decrease. (for example, copy some large files.)

2. Insert the module which requests for a large amount of continuous physical memory.

[root@rhel68-kmalloc ~]# cat /proc/buddyinfo ; free -m
Node 0, zone      DMA      2      2      1      2      1      1      1      2      1      1      0
Node 0, zone    DMA32    331    548    624    373    225     47     24     11     12      0      1
             total       used       free     shared    buffers     cached
Mem:           996        917         78          0         84        318
-/+ buffers/cache:        514        481
Swap:         1023          1       1022

[root@rhel68-kmalloc kmalloc-1]# insmod big_mem.ko

[root@rhel68-kmalloc ~]# cat /proc/buddyinfo ; free -m
Node 0, zone      DMA     17     11     12     13     11      7      3      2      1      1      2
Node 0, zone    DMA32   4343   2720   1731   1019    627    324    189    113     54     13      3
             total       used       free     shared    buffers     cached
Mem:           996        608        387          0         84         16
-/+ buffers/cache:        507        488
Swap:         1023         37        986

3. Check system log, and we can see "page allocation failure".

May  2 10:03:56 rhel68-kmalloc kernel: insmod: page allocation failure. order:10, mode:0xd0
May  2 10:03:56 rhel68-kmalloc kernel: Pid: 22319, comm: insmod Not tainted 2.6.32-642.el6.x86_64 #1
May  2 10:03:56 rhel68-kmalloc kernel: Call Trace:
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8113e77c>] ? __alloc_pages_nodemask+0x7dc/0x950
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f132>] ? kmem_getpages+0x62/0x170
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fd4a>] ? fallback_alloc+0x1ba/0x270
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117f79f>] ? cache_grow+0x2cf/0x320
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8117fac9>] ? ____cache_alloc_node+0x99/0x160
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff811803c7>] ? kmem_cache_alloc_trace+0x137/0x1c0
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffffa02a0076>] ? hello_init+0x26/0x84 [big_mem]
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffffa02a0050>] ? hello_init+0x0/0x84 [big_mem]
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff810020d0>] ? do_one_initcall+0xc0/0x280
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff810c85f1>] ? sys_init_module+0xe1/0x250
May  2 10:03:56 rhel68-kmalloc kernel: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

Reference

1. The Linux Kernel Module Programming Guide
2. 8.1. The Real Story of kmalloc