问题现象

最近发现线上服务的服务节点经常性的挂掉,我们业务的线上服务使用docker跑的Tomcat服务。服务业务现象收集如下:

  • tomcat的相关日志,并没有明显的报错信息,但是所有日志均停在了同一时刻。
  • docker容器是活着的,但是Java进程挂了。
  • 重新启动服务后,内存占用比较高。服务器内存配置16G,jvm Xmx8g Xmn4g。
  • 磁盘IO比较高。

作为运维,服务有没有内存溢出等问题暂且放一边,先从服务器这边排查下问题。首先怀疑是资源占用比较高,系统把服务干掉了。查看系统日志 message 发现如下日志,证实了猜想:

Nov 27 21:02:21 localhost kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Nov 27 21:02:21 localhost kernel: java cpuset=aa5075722ccf791119d698c46ae14d51d2d2d45d479eee43fda347b6df0f5773 mems_allowed=0-1
Nov 27 21:02:21 localhost kernel: CPU: 15 PID: 42005 Comm: java Tainted: G            -------------- T 3.10.0-229.el7.x86_64 #1
Nov 27 21:02:21 localhost kernel: Hardware name: Huawei RH1288A V2/BC11SRSK0, BIOS RMIBV512 08/27/2015
Nov 27 21:02:21 localhost kernel: ffff880287fee660 000000005270c3fe ffff880471377a58 ffffffff81604b0a
Nov 27 21:02:21 localhost kernel: ffff880471377ae8 ffffffff815ffaaf ffff8803428abf30 ffff8803428abf48
Nov 27 21:02:21 localhost kernel: 0000000000000206 ffff880287fee660 ffff880471377ad0 ffffffff81117aef
Nov 27 21:02:21 localhost kernel: Call Trace:
Nov 27 21:02:21 localhost kernel: [<ffffffff81604b0a>] dump_stack+0x19/0x1b
Nov 27 21:02:21 localhost kernel: [<ffffffff815ffaaf>] dump_header+0x8e/0x214
Nov 27 21:02:21 localhost kernel: [<ffffffff81117aef>] ? delayacct_end+0x8f/0xb0
Nov 27 21:02:21 localhost kernel: [<ffffffff8115a44e>] oom_kill_process+0x24e/0x3b0
Nov 27 21:02:21 localhost kernel: [<ffffffff81159fb6>] ? find_lock_task_mm+0x56/0xc0
Nov 27 21:02:21 localhost kernel: [<ffffffff8115ac76>] out_of_memory+0x4b6/0x4f0
Nov 27 21:02:21 localhost kernel: [<ffffffff81160e55>] __alloc_pages_nodemask+0xa95/0xb90
Nov 27 21:02:21 localhost kernel: [<ffffffff811a29da>] alloc_pages_vma+0x9a/0x140
Nov 27 21:02:21 localhost kernel: [<ffffffff81182fb7>] handle_mm_fault+0x9f7/0xd70
Nov 27 21:02:21 localhost kernel: [<ffffffff810a94f6>] ? try_to_wake_up+0x1b6/0x280
Nov 27 21:02:21 localhost kernel: [<ffffffff810a95e3>] ? wake_up_process+0x23/0x40
Nov 27 21:02:21 localhost kernel: [<ffffffff8160fe06>] __do_page_fault+0x156/0x540
Nov 27 21:02:21 localhost kernel: [<ffffffff812e3247>] ? call_rwsem_wake+0x17/0x30
Nov 27 21:02:21 localhost kernel: [<ffffffff8109c42d>] ? up_write+0x1d/0x20
Nov 27 21:02:21 localhost kernel: [<ffffffff810faec6>] ? __audit_syscall_exit+0x1f6/0x2a0
Nov 27 21:02:21 localhost kernel: [<ffffffff8161020a>] do_page_fault+0x1a/0x70
Nov 27 21:02:21 localhost kernel: [<ffffffff8160c408>] page_fault+0x28/0x30
Nov 27 21:02:21 localhost kernel: Mem-Info:
Nov 27 21:02:21 localhost kernel: Node 0 DMA per-cpu:
Nov 27 21:02:21 localhost kernel: CPU    0: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    1: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    2: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    3: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    4: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    5: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    6: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    7: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    8: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    9: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   10: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   11: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   12: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   13: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   14: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   15: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   16: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   17: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   18: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   19: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   20: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   21: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   22: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   23: hi:    0, btch:   1 usd:   0
Nov 27 21:02:21 localhost kernel: Node 0 DMA32 per-cpu:
Nov 27 21:02:21 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    1: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    2: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    3: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    4: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    5: hi:  186, btch:  31 usd:   2
Nov 27 21:02:21 localhost kernel: CPU    6: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    7: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    8: hi:  186, btch:  31 usd:   3
Nov 27 21:02:21 localhost kernel: CPU    9: hi:  186, btch:  31 usd:   1
Nov 27 21:02:21 localhost kernel: CPU   10: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   11: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   12: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   13: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   14: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   15: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   16: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   17: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   18: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   19: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   20: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   21: hi:  186, btch:  31 usd:  12
Nov 27 21:02:21 localhost kernel: CPU   22: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   23: hi:  186, btch:  31 usd:  15
Nov 27 21:02:21 localhost kernel: Node 0 Normal per-cpu:
Nov 27 21:02:21 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    1: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    2: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    3: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    4: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    5: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    6: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    7: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    8: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    9: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   10: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   11: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   12: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   13: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   14: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   15: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   16: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   17: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   18: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   19: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   20: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   21: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   22: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   23: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: Node 1 Normal per-cpu:
Nov 27 21:02:21 localhost kernel: CPU    0: hi:  186, btch:  31 usd:  46
Nov 27 21:02:21 localhost kernel: CPU    1: hi:  186, btch:  31 usd:   2
Nov 27 21:02:21 localhost kernel: CPU    2: hi:  186, btch:  31 usd:  42
Nov 27 21:02:21 localhost kernel: CPU    3: hi:  186, btch:  31 usd:   2
Nov 27 21:02:21 localhost kernel: CPU    4: hi:  186, btch:  31 usd:  31
Nov 27 21:02:21 localhost kernel: CPU    5: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU    6: hi:  186, btch:  31 usd:  16
Nov 27 21:02:21 localhost kernel: CPU    7: hi:  186, btch:  31 usd: 164
Nov 27 21:02:21 localhost kernel: CPU    8: hi:  186, btch:  31 usd:  19
Nov 27 21:02:21 localhost kernel: CPU    9: hi:  186, btch:  31 usd:  18
Nov 27 21:02:21 localhost kernel: CPU   10: hi:  186, btch:  31 usd:   8
Nov 27 21:02:21 localhost kernel: CPU   11: hi:  186, btch:  31 usd:  20
Nov 27 21:02:21 localhost kernel: CPU   12: hi:  186, btch:  31 usd:  31
Nov 27 21:02:21 localhost kernel: CPU   13: hi:  186, btch:  31 usd:  27
Nov 27 21:02:21 localhost kernel: CPU   14: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   15: hi:  186, btch:  31 usd:   4
Nov 27 21:02:21 localhost kernel: CPU   16: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   17: hi:  186, btch:  31 usd:  30
Nov 27 21:02:21 localhost kernel: CPU   18: hi:  186, btch:  31 usd:  30
Nov 27 21:02:21 localhost kernel: CPU   19: hi:  186, btch:  31 usd:   0
Nov 27 21:02:21 localhost kernel: CPU   20: hi:  186, btch:  31 usd:   3
Nov 27 21:02:21 localhost kernel: CPU   21: hi:  186, btch:  31 usd:  24
Nov 27 21:02:21 localhost kernel: CPU   22: hi:  186, btch:  31 usd:  12
Nov 27 21:02:21 localhost kernel: CPU   23: hi:  186, btch:  31 usd:  18
Nov 27 21:02:21 localhost kernel: active_anon:2724567 inactive_anon:177005 isolated_anon:0
Nov 27 21:02:21 localhost kernel: Node 0 DMA free:15900kB min:88kB low:108kB high:132kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 1764 7766 7766
Nov 27 21:02:21 localhost kernel: Node 0 DMA32 free:34556kB min:10040kB low:12548kB high:15060kB active_anon:1612600kB inactive_anon:65356kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052096kB managed:1808016kB mlocked:0kB dirty:0kB writeback:0kB mapped:2344kB shmem:72780kB slab_reclaimable:39568kB slab_unreclaimable:23940kB kernel_stack:8096kB pagetables:5680kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 0 6002 6002
Nov 27 21:02:21 localhost kernel: Node 0 Normal free:33824kB min:34160kB low:42700kB high:51240kB active_anon:5423868kB inactive_anon:20808kB active_file:20kB inactive_file:12kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:6291456kB managed:6146484kB mlocked:0kB dirty:0kB writeback:0kB mapped:5364kB shmem:18288kB slab_reclaimable:108572kB slab_unreclaimable:85084kB kernel_stack:26192kB pagetables:14532kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:114 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 0 0 0
Nov 27 21:02:21 localhost kernel: Node 1 Normal free:46612kB min:45816kB low:57268kB high:68724kB active_anon:3861800kB inactive_anon:621856kB active_file:340kB inactive_file:40kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:8388608kB managed:8243396kB mlocked:0kB dirty:0kB writeback:132kB mapped:3756kB shmem:993724kB slab_reclaimable:120776kB slab_unreclaimable:139408kB kernel_stack:26976kB pagetables:18120kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2643 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 0 0 0
Nov 27 21:02:21 localhost kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15900kB
Nov 27 21:02:21 localhost kernel: Node 0 DMA32: 742*4kB (UEMR) 396*8kB (UEM) 344*16kB (UEM) 407*32kB (UEMR) 151*64kB (UEMR) 5*128kB (UMR) 0*256kB 1*512kB (R) 0*1024kB 0*2048kB 0*4096kB = 35480kB
Nov 27 21:02:21 localhost kernel: Node 0 Normal: 1015*4kB (UEMR) 599*8kB (UEM) 586*16kB (UEM) 376*32kB (UEM) 99*64kB (UEM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36724kB
Nov 27 21:02:21 localhost kernel: Node 1 Normal: 2725*4kB (UEMR) 1187*8kB (UEM) 460*16kB (UEM) 339*32kB (UEM) 106*64kB (UEM) 7*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46284kB
Nov 27 21:02:21 localhost kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 27 21:02:21 localhost kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 27 21:02:21 localhost kernel: 272043 total pagecache pages
Nov 27 21:02:21 localhost kernel: 0 pages in swap cache
Nov 27 21:02:21 localhost kernel: Swap cache stats: add 17671445, delete 17671445, find 29162461/30280216
Nov 27 21:02:21 localhost kernel: Free swap  = 0kB
Nov 27 21:02:21 localhost kernel: Total swap = 0kB
Nov 27 21:02:21 localhost kernel: 4187036 pages RAM
Nov 27 21:02:21 localhost kernel: 0 pages HighMem/MovableOnly
Nov 27 21:02:21 localhost kernel: 133587 pages reserved
Nov 27 21:02:21 localhost kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Nov 27 21:02:21 localhost kernel: [ 1183]     0  1183    14046     2536      29        0             0 systemd-journal
Nov 27 21:02:21 localhost kernel: [ 1194]     0  1194    28172       64      21        0             0 lvmetad
Nov 27 21:02:21 localhost kernel: [ 1249]     0  1249    10682      189      22        0         -1000 systemd-udevd
Nov 27 21:02:21 localhost kernel: [ 2094]     0  2094    29177      123      25        0         -1000 auditd
Nov 27 21:02:21 localhost kernel: [ 2121]   997  2121     1084       30       7        0             0 lsmd
Nov 27 21:02:21 localhost kernel: [ 2124]     0  2124    53614      445      59        0             0 abrtd
Nov 27 21:02:21 localhost kernel: [ 2126]     0  2126    32512      129      21        0             0 smartd
Nov 27 21:02:21 localhost kernel: [ 2127]     0  2127   185139     2193     132        0             0 rsyslogd
Nov 27 21:02:21 localhost kernel: [ 2129]     0  2129    52996      340      56        0             0 abrt-watch-log
Nov 27 21:02:21 localhost kernel: [ 2130]     0  2130     4817       90      14        0             0 irqbalance
Nov 27 21:02:21 localhost kernel: [ 2131]     0  2131     1093       40       8        0             0 rngd
Nov 27 21:02:21 localhost kernel: [ 2132]     0  2132     8737      163      21        0             0 systemd-logind
Nov 27 21:02:21 localhost kernel: [ 2133]    81  2133     7224      180      20        0          -900 dbus-daemon
Nov 27 21:02:21 localhost kernel: [ 2534]     0  2534     6483       51      16        0             0 atd
Nov 27 21:02:21 localhost kernel: [ 2536]     0  2536     6794       59      18        0             0 xinetd
Nov 27 21:02:21 localhost kernel: [ 3458]     0  3458     9840      126      22        0         -1000 sshd
Nov 27 21:02:21 localhost kernel: [18475]     0 18475    27501       32      10        0             0 agetty
Nov 27 21:02:21 localhost kernel: [226241]   999 226241   129017     2131      48        0             0 polkitd
Nov 27 21:02:21 localhost kernel: [226841]    38 226841     8371      130      20        0             0 ntpd
Nov 27 21:02:21 localhost kernel: [17993]     0 17993     2838       50      11        0             0 init.sh
Nov 27 21:02:21 localhost kernel: [18033]     0 18033   271311     4160      94        0             0 python2.6
Nov 27 21:02:21 localhost kernel: [18034]     0 18034    11302     6761      28        0             0 python2.6
Nov 27 21:02:21 localhost kernel: [18042]     0 18042     1038       23       8        0             0 tail
Nov 27 21:02:21 localhost kernel: [133984]     0 133984  1556394     7033     190        0             0 scribed
Nov 27 21:02:21 localhost kernel: [41077]    99 41077     3880       48      12        0             0 dnsmasq
Nov 27 21:02:21 localhost kernel: [136214]     0 136214    27931      275      53        0             0 lldpd
Nov 27 21:02:21 localhost kernel: [136217]   990 136217    27931      260      52        0             0 lldpd
Nov 27 21:02:21 localhost kernel: [179346] 60422 179346    81770     2557      38        0             0 wagent
Nov 27 21:02:21 localhost kernel: [184527]     0 184527    52460     2085      57        0             0 python
Nov 27 21:02:21 localhost kernel: [184648]     0 184648    49745     1409      53        0             0 python
Nov 27 21:02:21 localhost kernel: [184670]     0 184670    47515     1281      49        0             0 python
Nov 27 21:02:21 localhost kernel: [184696]     0 184696    47514     1278      51        0             0 python
Nov 27 21:02:21 localhost kernel: [184718]     0 184718    47514     1283      48        0             0 python
Nov 27 21:02:21 localhost kernel: [184740]     0 184740    47514     1314      50        0             0 python
Nov 27 21:02:21 localhost kernel: [184762]     0 184762    47514     1302      51        0             0 python
Nov 27 21:02:21 localhost kernel: [56906]     0 56906   505000     6523     101        0             0 xxxAgent
Nov 27 21:02:21 localhost kernel: [38493]     0 38493   347897     6702      90        0          -500 dockerd
Nov 27 21:02:21 localhost kernel: [38776]     0 38776    31576      157      18        0             0 crond
Nov 27 21:02:21 localhost kernel: [39783]     0 39783   118293      662      24        0          -500 docker-containe
Nov 27 21:02:21 localhost kernel: [39805]     0 39805     3761       55      12        0             0 docker.sh
Nov 27 21:02:21 localhost kernel: [39855]     0 39855  6841020  2555504    7788        0             0 java
Nov 27 21:02:21 localhost kernel: [48691]     0 48691    48225     1480      52        0             0 python
Nov 27 21:02:21 localhost kernel: [90430]     0 90430   312213     1081      61        0          -500 docker-containe
Nov 27 21:02:21 localhost kernel: [68926]     0 68926     5624       62      16        0             0 xxagent
Nov 27 21:02:21 localhost kernel: [68932]     0 68932    12834     7251      30        0             0 xxagent
Nov 27 21:02:21 localhost kernel: [70100]     0 70100    42611      220      39        0             0 crond
Nov 27 21:02:21 localhost kernel: [70103]     0 70103    42611      220      39        0             0 crond
Nov 27 21:02:21 localhost kernel: [70176]     0 70176    28279       43      15        0             0 sh
Nov 27 21:02:21 localhost kernel: [70177]     0 70177    28279       42      13        0             0 sh
Nov 27 21:02:21 localhost kernel: [70182]     0 70182    28279       64      13        0             0 bash
Nov 27 21:02:21 localhost kernel: [70186]     0 70186    28279       43      12        0             0 bash
Nov 27 21:02:21 localhost kernel: [70205]     0 70205    32413       93      21        0             0 perl
Nov 27 21:02:21 localhost kernel: [70357]     0 70357    28279       63      11        0             0 bash
Nov 27 21:02:21 localhost kernel: [70718]     0 70718     1143       81       7        0             0 gzip
Nov 27 21:02:21 localhost kernel: Out of memory: Kill process 39855 (java) score 632 or sacrifice child
Nov 27 21:02:21 localhost kernel: Killed process 39855 (java) total-vm:27364080kB, anon-rss:10222016kB, file-rss:0kB

该日志可以说明 Java进程是被系统kill的了。下面我们来具体看下是怎么回事?

分析

Linux 系统有一种自我保护的机制叫做 OOM-killer,即当物理内存被进程使用完后,当再有进程来请求内存时,它就会把当前进程中最占用内存最多,回收内存收益最大的进程给kill掉来回收内存,保证系统的正常运行。

从上边的日志可以看到如下几行:

Nov 27 21:02:21 localhost kernel: Out of memory: Kill process 39855 (java) score 632 or sacrifice child
Nov 27 21:02:21 localhost kernel: Killed process 39855 (java) total-vm:27364080kB, anon-rss:10222016kB, file-rss:0kB

从这个日志中,我们可以看到 java进程 39855 因为Out of memory 被系统kill了。下边一行展示了当时他的内存占用情况,虚拟内存 27G左右,anon-rss(虚拟内存页,大家可理解为内存单元,每个大小为4k,一般为进程占用)10G左右,file-rss(文件内存也,当打开大文件时占用)0。

来看上边各进程内存占用情况,找到java 进程 rss 2555504*4k=10G 左右,此进程占用内存最大。其次有5个在6000+, 5*6000*4k=1.2G, 剩余进程大约在3~4g,正好为系统的16G。

当没有空余的内存时,触发OOM-killer机制,java进程占用内存最大,就被系统kill了。由于Java进程非正常退出,docker容器没有退出,但是已经不再提供服务了。

优化方案

方案一,降低Java虚拟机配置的最大内存或增加机器物理内存

在不影响服务的情况下,降低Java虚拟机配置的最大内存,从而降低整体内存的占用情况,虽然内存占用可能依然较高,但是整体会有所下降,触发OOM-killer机制的几率会减少。

方案二,修改进程得分或关闭OOM-killer机制

系统在选择进程来kill的时候,会根据某种算法计算出一个介于[-17,15]之间的数值,保存在 /proc/[pid]/oom_adj中,得分越高表示选中的可能性越大。-17表示禁止kill。我们可以修改这个得分来设置我们的进程不被kill。

$ echo -17 > /proc/$(pidof java)/oom_adj

也可采用禁止OOM-killer机制的方法。达到禁止kill的目的。如下:

# sysctl -w vm.overcommit_memory=2

此方案,可能会因进程占用大量内存而没有被释放,导致系统卡死,其他操作无法处理,生产环境不建议使用。

总之,预估好服务峰值时消耗的内存和机器的物理内存之间的关系是关键。

参考