這篇博客是上一篇博客Oracle shutdown immediate遭遇ORA-24324 ORA-24323 ORA-01089的延伸(資料庫掛起hang時,才去重啟的),其實這是我們海外一工廠的遇到的案例,把內容拆開是因為這個case分開講述顯得主題明確一些。正式進入主題: 伺服器資料庫版本O... ...
這篇博客是上一篇博客Oracle shutdown immediate遭遇ORA-24324 ORA-24323 ORA-01089的延伸(資料庫掛起hang時,才去重啟的),其實這是我們海外一工廠的遇到的案例,把內容拆開是因為這個case分開講述顯得主題明確一些。正式進入主題:
伺服器資料庫版本Oracle Database 10g Release 10.2.0.4.0,操作系統為Red Hat Enterprise Linux Server release 5.7,虛擬機。當時告警日誌裡面出現大量的found dead shared server這裡信息。資料庫也出現連接不上的情況
found dead shared server 'S016', pid = (35, 23)
found dead shared server 'S023', pid = (42, 1)
Fri Aug 5 10:28:48 2016
found dead shared server 'S013', pid = (32, 110)
found dead shared server 'S021', pid = (40, 1)
Fri Aug 5 10:33:53 2016
found dead shared server 'S012', pid = (31, 132)
found dead shared server 'S023', pid = (38, 3)
Fri Aug 5 10:38:55 2016
found dead shared server 'S013', pid = (32, 111)
found dead shared server 'S022', pid = (42, 3)
Fri Aug 5 10:40:53 2016
found dead shared server 'S020', pid = (39, 4)
found dead shared server 'S021', pid = (40, 3)
failed to start shared server, oer=0
通過檢查發現系統記憶體耗盡,出現了oom_kill 。OOMkiller,即out of memory killer,是linux下麵當記憶體耗盡時的的一種處理機制。當記憶體較少時,OOM會遍歷整個進程鏈表,然後根據進程的記憶體使用情況以及它的oom score值最終找到得分較高的進程,然後發送kill信號將其殺掉。
# grep -i kill /var/log/messages | more
Aug 5 10:12:10 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:12:10 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug 5 10:12:11 xxxxx kernel: Killed process 21687 (oracle)
Aug 5 10:12:11 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:12:11 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug 5 10:12:11 xxxxx kernel: Killed process 21668 (oracle)
Aug 5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:23:09 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug 5 10:23:09 xxxxx kernel: Killed process 21756 (oracle)
Aug 5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:23:09 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug 5 10:23:09 xxxxx kernel: Killed process 21732 (oracle)
Aug 5 10:28:08 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:28:08 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug 5 10:28:09 xxxxx kernel: Killed process 21752 (oracle)
Aug 5 10:28:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:28:09 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug 5 10:28:09 xxxxx kernel: Killed process 21722 (oracle)
Aug 5 10:32:24 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:32:24 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug 5 10:32:24 xxxxx kernel: Killed process 21718 (oracle)
Aug 5 10:32:24 xxxxx kernel: gdm-rh-security invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:32:24 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug 5 10:32:24 xxxxx kernel: Killed process 22053 (oracle)
Aug 5 10:37:54 xxxxx kernel: beremote invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:37:54 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug 5 10:37:54 xxxxx kernel: Killed process 22238 (oracle)
Aug 5 10:37:54 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:37:54 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
--More--
從上面可以看到大量的ORACLE的進程被kill掉,從而導致ORACLE出現"found dead shared server 'S016', pid = (35, 23)"這類錯誤,在官方文檔Found Dead Shared Server Messages Reported In Alert.Log (文檔 ID 760872.1) 有如下介紹(這個文檔較老舊,不過原理依然適用於此處環境):
SYMPTOMS
The following is being reported in the alert.log file
Mon Dec 22 16:48:31 2008
found dead shared server 'S004', pid = (13, 1)
found dead shared server 'S001', pid = (10, 1)
No further errors accompany those messages in the alert.log file.
Listener.log shows no errors.
CAUSE
When a session is killed, in some circumstances, the shared server could die as well leading to those messages being reported in the alert.log file.This has been verified by << Bug 4270723 >> FOUND DEAD SHARED SERVER
也就是說當一個會話被殺,在某些情況下, shared sever進程會死掉,導致上面“found dead shared server”出現在告警日誌中。 這個案例中,由於系統大量kill掉會話進程,導致shared server進程死掉。所以Oracle資料庫出現無法訪問的情況。突然掛起。
$grep "Out of memory" messages
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
Aug 5 10:38:46 xxxxx kernel: Out of memory: kill process 22227 (oracle) score 2132996 or a child
Aug 5 10:39:14 xxxxx kernel: Out of memory: kill process 22229 (oracle) score 2134280 or a child
Aug 5 10:40:57 xxxxx kernel: Out of memory: kill process 22286 (oracle) score 2135299 or a child
Aug 5 10:41:24 xxxxx kernel: Out of memory: kill process 22245 (oracle) score 2135302 or a child
Aug 5 10:41:25 xxxxx kernel: Out of memory: kill process 22485 (oracle) score 2133009 or a child
Aug 5 10:41:56 xxxxx kernel: Out of memory: kill process 21779 (oracle) score 2132880 or a child
Aug 5 10:42:08 xxxxx kernel: Out of memory: kill process 22068 (oracle) score 2132873 or a child
Aug 5 10:42:26 xxxxx kernel: Out of memory: kill process 22249 (oracle) score 2132873 or a child
Aug 5 10:42:26 xxxxx kernel: Out of memory: kill process 22278 (oracle) score 2132873 or a child
Aug 5 10:42:31 xxxxx kernel: Out of memory: kill process 21662 (oracle) score 2132872 or a child
Aug 5 10:42:47 xxxxx kernel: Out of memory: kill process 22045 (oracle) score 2132872 or a child
Aug 5 10:42:57 xxxxx kernel: Out of memory: kill process 22314 (oracle) score 2132872 or a child
Aug 5 10:43:35 xxxxx kernel: Out of memory: kill process 22336 (oracle) score 2132872 or a child
Aug 5 10:43:35 xxxxx kernel: Out of memory: kill process 22435 (oracle) score 2132870 or a child
Aug 5 10:43:55 xxxxx kernel: Out of memory: kill process 21666 (oracle) score 2132869 or a child
Aug 5 10:44:02 xxxxx kernel: Out of memory: kill process 22263 (oracle) score 2132869 or a child
Aug 5 10:44:19 xxxxx kernel: Out of memory: kill process 22405 (oracle) score 2132866 or a child
Aug 5 10:44:20 xxxxx kernel: Out of memory: kill process 22438 (oracle) score 2132866 or a child
Aug 5 10:44:20 xxxxx kernel: Out of memory: kill process 22453 (oracle) score 2132865 or a child
Aug 5 10:44:23 xxxxx kernel: Out of memory: kill process 22466 (oracle) score 2132737 or a child
Aug 5 10:44:26 xxxxx kernel: Out of memory: kill process 22499 (oracle) score 2132607 or a child
Aug 5 10:44:27 xxxxx kernel: Out of memory: kill process 21716 (oracle) score 1078417 or a child
Aug 5 10:44:27 xxxxx kernel: Out of memory: kill process 21670 (oracle) score 1066455 or a child
Aug 5 10:48:02 xxxxx kernel: Out of memory: kill process 22829 (oracle) score 2134273 or a child
Aug 5 10:49:47 xxxxx kernel: Out of memory: kill process 22900 (oracle) score 2133007 or a child
Aug 5 10:50:36 xxxxx kernel: Out of memory: kill process 22842 (oracle) score 2133095 or a child
Aug 5 10:51:25 xxxxx kernel: Out of memory: kill process 22990 (oracle) score 2134285 or a child
Aug 5 10:51:25 xxxxx kernel: Out of memory: kill process 23054 (oracle) score 2132994 or a child
Aug 5 10:51:49 xxxxx kernel: Out of memory: kill process 22933 (oracle) score 2134277 or a child
Aug 5 10:51:49 xxxxx kernel: Out of memory: kill process 23103 (oracle) score 2132996 or a child
Aug 5 10:52:52 xxxxx kernel: Out of memory: kill process 23211 (oracle) score 2134267 or a child
在官方文檔Oracle VM Server hangs after Invoking the OOM Killer and having hundreds of kpartx processes spawned and "state S non-preferred supports toluSnA" reported on the FC LUNs (文檔 ID 2123877.1)介紹了這麼一個案例
APPLIES TO:
Oracle VM - Version 3.3.2 and laterLinux x86-64
SYMPTOMS
On Oracle VM Server, during normal server operation, the server suddenly hangs, with the following kind of messages being logged about invoking the Out Of Memory Killer :