Out of Memory Killer
In some cases, the operating system can decide to drop processes in order to keep the machine running. This process is called OOM killer (Out-Of-Memory Killer) and is invoked when all other RAM recovery processes have failed.
OOM-Killer is called when the swap is completely filled and the OS has run out of enough memory to allocate to running processes.
In order to determine which processes to kill, the OOM-Killer applies a ranking heuristic. All processes are ranked using a oom_score_adj score between -1000 (never be killed) and +1000 (to be absolutely killed).
The process kill is brutal and does not allow sacrificed process to write a goodbye message in their own log.
Such behaviors occur in Indexima when configuration parameters violate the 20% memory rule.
Stand Alone INDEXIMA CLUSTER
In Standalone configuration when nodes are dedicated to Indexima, the Linux-based free
command allows you to estimate the maximum memory to allocate to the Indexima Java Heap:
The free command
# free -h
total used free shared buff/cache available
Mem: 31G 139M 30G 241M 1.2G 30G
Swap: 1.0G 0B 1.0G
In this example, the maximum available RAM is 31GB.
The parameter GALACTICA_MEM must be set to the maximum available RAM minus 20% of available memory.
This parameter is defined in conf/galactica.conf file.
# Java heap size for INDEXIMA data engine
export GALACTICA_MEM=25000m
Hadoop / Yarn configuration
The parameter yarn.memory defines the maximum available RAM you can allocate to an Indexima Hadoop Container.
This parameter is located in conf/galactica.conf and the default value is 1024.
yarn.memory value must be greater than java heap + 20%.
yarn.memory = GALACTICA_MEM + 20%
Snippet of /var/log/messages in centos7 when an om-killer occurred.
Search for oom-killer pattern
15:15:05 ip-38-75 kernel: [1380440.636612] java invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
15:15:05 ip-38-75 kernel: [1380440.646149] java cpuset=/ mems_allowed=0
15:15:05 ip-38-75 kernel: [1380440.649393] CPU: 0 PID: 6273 Comm: java Not tainted 4.14.62-65.117.amzn1.x86_64 #1
15:15:05 ip-38-75 kernel: [1380440.655182] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
15:15:05 ip-38-75 kernel: [1380440.660166] Call Trace:
15:15:05 ip-38-75 kernel: [1380440.662481] dump_stack+0x5c/0x82
15:15:05 ip-38-75 kernel: [1380440.665407] dump_header+0x94/0x21c
15:15:05 ip-38-75 kernel: [1380440.675890] oom_kill_process+0x213/0x410
15:15:05 ip-38-75 kernel: [1380440.679277] out_of_memory+0x296/0x4c0
15:15:05 ip-38-75 kernel: [1380440.690514] filemap_fault+0x1e3/0x5f0
15:15:05 ip-38-75 kernel: [1380440.703103] __do_fault+0x20/0x60
15:15:05 ip-38-75 kernel: [1380440.706465] __handle_mm_fault+0xcd2/0x13f0
15:15:05 ip-38-75 kernel: [1380440.724966] RIP: 5ed07690:0x3e8
15:15:05 ip-38-75 kernel: [1380440.728230] RSP: 59f3c800:00007f4831bb56d0 EFLAGS: 7f4831bb5660
15:15:05 ip-38-75 kernel: [1380440.728271] Mem-Info:
15:15:05 ip-38-75 kernel: [1380440.736040] active_anon:4033458 inactive_anon:13 isolated_anon:0
15:15:05 ip-38-75 kernel: [1380440.736040] active_file:561 inactive_file:3996 isolated_file:382
15:15:05 ip-38-75 kernel: [1380440.736040] unevictable:0 dirty:5 writeback:0 unstable:0
15:15:05 ip-38-75 kernel: [1380440.736040] slab_reclaimable:5366 slab_unreclaimable:6806
15:15:05 ip-38-75 kernel: [1380440.736040] mapped:572 shmem:19 pagetables:9400 bounce:0
15:15:05 ip-38-75 kernel: [1380440.736040] free:32493 free_pcp:62 free_cma:0
15:15:05 ip-38-75 kernel: [1380440.802358] lowmem_reserve[]: 0 3720 16005 16005
15:15:05 ip-38-75 kernel: [1380440.832894] Node 0 Normal free:49936kB min:50188kB low:62764kB high:75340kB active_anon:12383276kB inactive_anon:52kB active_file:260kB inactive_file:12756kB unevictable:0kB writepending:12kB present:12845056kB managed:12579648kB mlocked:0kB kernel_stack:8048kB pagetables:30404kB bounce:0kB free_pcp:8kB local_pcp:8kB free_cma:0kB
15:15:05 ip-38-75 kernel: [1380440.872715] Node 0 DMA32: 296*4kB (UME) 215*8kB (UME) 212*16kB (UME) 70*32kB (UME) 50*64kB (UME) 27*128kB (UME) 12*256kB (UME) 4*512kB (UE) 2*1024kB (ME) 1*2048kB (M) 10*4096kB (UME) = 65368kB
15:15:05 ip-38-75 kernel: [1380440.900634] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
15:15:05 ip-38-75 kernel: [1380440.915330] 5183 total pagecache pages
15:15:05 ip-38-75 kernel: [1380440.919079] 0 pages in swap cache
15:15:05 ip-38-75 kernel: [1380440.927072] Free swap = 0kB
15:15:05 ip-38-75 kernel: [1380440.930156] Total swap = 0kB
15:15:05 ip-38-75 kernel: [1380440.933397] 4194205 pages RAM
15:15:05 ip-38-75 kernel: [1380440.936526] 0 pages HighMem/MovableOnly
15:15:05 ip-38-75 kernel: [1380440.940323] 87612 pages reserved
In this example, no more RAM is available.
List all processes to determine oom_score_adj
15:15:05 ip-38-75 kernel: [1380440.943466] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
15:15:05 ip-38-75 kernel: [1380440.950603] [ 1797] 0 1797 2869 225 12 3 0 -1000 udevd
15:15:05 ip-38-75 kernel: [1380440.995381] [ 2568] 32 2568 8841 101 22 4 0 0 rpcbind
15:15:05 ip-38-75 kernel: [1380441.122893] [ 4802] 0 4802 1077373 41765 179 8 0 0 java
The faulty process is found and is immediately killed.
15:15:05 ip-38-75 kernel: [1380441.293974] Out of memory: Kill process 2965 (java) score 854 or sacrifice child
15:15:05 ip-38-75 kernel: [1380441.300828] Killed process 2965 (java) total-vm:16098684kB, anon-rss:14004696kB, file-rss:0kB, shmem-rss:0kB
15:15:05 ip-38-75 dhclient[4141]: bound to 172.31.38.75 -- renewal in 1461 seconds.
15:15:06 ip-38-75 kernel: [1380442.025206] oom_reaper: reaped process 2965 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB