閱識風雲是華為雲信息大咖,擅長將複雜信息多元化呈現,其出品的一張圖(雲圖說)、深入淺出的博文(雲小課)或短視頻(雲視廳)總有一款能讓您快速上手華為雲。更多精彩內容請單擊此處。 摘要:當HDFS集群出現DataNode節點間磁碟利用率不平衡時,會導致MapReduce應用程式無法很好地利用本地計算的優 ...
[20230214]訪問asm相關視圖緩慢的分析2.txt
--//前段時間做awr信息刪除時看到的情況,當時主要精力放在整理awr信息上,對於遇到的問題放在一邊,等到我想分析時,
--//crs crash,所以這篇文章許多內容來自事後的分析.
1.環境:
[email protected]:9014/ywdb> @ ver1
PORT_STRING VERSION BANNER
------------------------------ -------------- --------------------------------------------------------------------------------
x86_64/Linux 2.4.xx 11.2.0.4.0 Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
2.問題提出:
[email protected]:9014/ywdb> @ ashtop sql_id 1=1 &day
Total Distinct Distinct
Seconds AAS %This SQL_ID FIRST_SEEN LAST_SEEN Execs Seen Tstamps
--------- ------- ------- ------------- ------------------- ------------------- ---------- --------
279736 3.2 87% | f632bnsg7382q 2023-02-09 10:55:09 2023-02-10 08:40:00 192 1711
10363 .1 3% | 0sd3pynbkmuhv 2023-02-09 08:42:32 2023-02-10 08:34:01 284 7816
6740 .1 2% | 7crn7t9sw2u19 2023-02-10 08:11:45 2023-02-10 08:40:00 4 1685
5248 .1 2% | 2023-02-09 08:40:14 2023-02-10 08:39:52 2 4836
1685 .0 1% | 0ngfv7gr2cj5v 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | 21uwuy3vy423u 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | 6yw5r0xnwnc37 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | fr9dygfy3az4d 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | gvm0actu0bm6d 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
517 .0 0% | 327ms8chj9552 2023-02-09 08:40:36 2023-02-10 08:39:42 517 514
...
30 rows selected.
--//主要查看sql_id=f632bnsg7382q的問題.
--//另外 7crn7t9sw2u19 0ngfv7gr2cj5v 21uwuy3vy423u 6yw5r0xnwnc37 fr9dygfy3az4d 也很奇怪僅僅執行1次,需要1685秒.
[email protected]:9014/ywdb> @ sql_id f632bnsg7382q
--SQL_ID = f632bnsg7382q
SELECT 'ASMGROUP' ROWINFO
, t.GROUP_NUMBER
, t.NAME
, t.ALLOCATION_UNIT_SIZE
, ROUND(t.TOTAL_MB/1024, 2)
, ROUND(t.FREE_MB/1024, 2), round(decode(TOTAL_MB, 0, 0,(t.TOTAL_MB-t.FREE_MB)/t.TOTAL_MB*100), 2) usedrate, t.STATE, t.TYPE, t.OFFLINE_DISKS, t.VOTING_FILES
FROM v$asm_diskgroup t;
--//原始代碼一行,自己做了格式化處理.
--//查看後面7crn7t9sw2u19 0ngfv7gr2cj5v 21uwuy3vy423u 6yw5r0xnwnc37 fr9dygfy3az4d都是訪問ask相關視圖語句.猜測asm實例存在
--//問題.我當時2/9號手工執行執行sql_id=f632bnsg7382q語句出現掛起!!
--//其他訪問asm視圖也存在相似問題,例如:
[email protected]:9014/ywdb> @ sql_id 21uwuy3vy423u
--SQL_ID = 21uwuy3vy423u
select path from v$asm_disk;
[email protected]:9014/ywdb> @ sqlhh f632bnsg7382q 100
BEGIN_INTERVAL_TIME INST_ID SQL_ID PLAN_HASH_VALUE EXECUTIONS ELA_MS_PER_EXEC CPU_MS_PER_EXEC ROWS_PER_EXEC LIOS_PER_EXEC BLKRD_PER_EXEC IOW_MS_PER_EXEC AVG_IOW_MS CLW_MS_PER_EXEC APW_MS_PER_EXEC CCW_MS_PER_EXEC
------------------- ---------- ------------- --------------- ---------- --------------- --------------- ------------- ------------- -------------- --------------- ----------- --------------- --------------- ---------------
2023-02-06 05:00:24 2 f632bnsg7382q 3236948515 12 131 8 2.0 0 0 0 0.0 0 0 0
2023-02-06 07:00:03 2 f632bnsg7382q 3236948515 12 167 9 2.0 0 0 0 0.0 0 0 0
2023-02-07 02:00:07 2 f632bnsg7382q 3236948515 12 344 9 2.0 0 0 0 0.0 0 0 0
2023-02-08 06:00:47 2 f632bnsg7382q 3236948515 12 119 9 2.0 0 0 0 0.0 0 0 0
2023-02-09 02:00:10 2 f632bnsg7382q 3236948515 12 168 9 2.0 0 0 0 0.0 0 0 0
--//查詢sqlhh實際上訪問tpt/awr/awr_sqlstats_per_exec.sql腳本查詢awr報表100天的相關信息(當然預設沒有這麼長時間),可以發現
--//實際上執行很快ELA_MS_PER_EXEC最大344,單位毫秒.為什麼呢?理論講這樣的情況根本不可能記錄在awr相關表中或者講當時實例真的
--//不忙!!
--//查看awr_sqlstats_per_exec.sql實際上訪問dba_hist_snapshot,dba_hist_sqlstat視圖.可以跟前面看到的ashtop信息完全對不上.
--//而且每小時執行12次,猜測是某個監測軟體執行的,每5分鐘執行1次.再仔細看發現有記錄的是inst_id=2的信息,我當時的猜測在實例1
--//上的執行掛起了,這樣才有可能解析我當前看到的情況.
--//另外我同事看了awr報表,在實例1的awr報表上確實看不到sql_id=f632bnsg7382q的相關信息.我自己也看了,確實沒有.為什麼不理解!!
--//實例2執行,很快完成:
[email protected]:9015/ywdb/ywdb2> select count(*) from v$asm_diskgroup t;
COUNT(*)
----------
2
--//實例1執行,hang:
[email protected]:9014/ywdb/ywdb1> select count(*) from v$asm_diskgroup t;
--//掛起!!
3.繼續分析:
[email protected]:9014/ywdb/ywdb1> @ ashtop event,inst_id sql_id='f632bnsg7382q' &day
Total Distinct Distinct
Seconds AAS %This EVENT INST_ID FIRST_SEEN LAST_SEEN Execs Seen Tstamps
--------- ------- ------- ------------------------------------------ ---------- ------------------- ------------------- ---------- --------
279544 3.2 100% | 1 2023-02-10 08:26:01 2023-02-10 08:54:14 166 1684
20 .0 0% | 2 2023-02-09 11:05:17 2023-02-10 08:44:14 20 20
7 .0 0% | ASM file metadata operation 2 2023-02-09 10:55:09 2023-02-10 07:33:02 7 7
--//出現ASM file metadata operation等待,根本不是怎麼主要等待事件.但是可以發現實例1 在執行時CPU占用100%,可以確定該類似語
--//句在實例1執行時掛起.
--//這個時候我突然以前測試過連接資料庫的進程在訪問asm相關視圖時會啟動一個訪問asm實例的進程,然後將相關信息返回給資料庫進
--//程.也就是講問題出在訪問asm的連接上.等到我想看asm實例時,crs down掉.無法繼續分析...
4.事後分析:
--//資料庫已經重啟!只能查詢dba_hist_ACTIVE_SESS_HISTORY視圖.
SELECT min(sql_exec_id),max(sql_exec_id)
FROM dba_hist_ACTIVE_SESS_HISTORY
where sql_id = 'f632bnsg7382q' and instance_number=1
MIN(SQL_EXEC_ID) MAX(SQL_EXEC_ID)
---------------- ----------------
16777216 16777381
--//一共產生166,16777381-16777216+1 = 166.
SELECT sql_exec_id,min(sample_time),client_id
FROM dba_hist_ACTIVE_SESS_HISTORY
where sql_id = 'f632bnsg7382q' and instance_number=1 and sql_exec_id between 16777216 and 16777381
group by sql_exec_id,client_id
order by sql_exec_id
SQL_EXEC_ID MIN(SAMPLE_TIME) CLIENT_ID
----------- ----------------------- ---------------
16777216 2023-02-03 09:02:59.411 192.168.101.217
16777217 2023-02-03 09:02:59.411 192.168.101.217
...
16777370 2023-02-03 09:02:59.411 192.168.101.217
16777371 2023-02-03 09:02:59.411 192.168.101.227
16777372 2023-02-03 09:02:59.411 192.168.101.227
16777373 2023-02-03 09:02:59.411 192.168.101.227
16777374 2023-02-03 09:02:59.411 192.168.101.227
16777375 2023-02-03 09:02:59.411 192.168.101.227
16777376 2023-02-03 09:02:59.411 192.168.101.227
16777377 2023-02-03 09:02:59.411 192.168.101.227
16777378 2023-02-03 09:02:59.411 192.168.101.227
16777379 2023-02-03 09:02:59.411 192.168.101.227
16777380 2023-02-03 09:02:59.411 192.168.90.217
16777381 2023-02-09 09:20:24.302 192.168.98.XX
166 rows selected.
--//有點奇怪的是2023-02-03 09:02:59.411時間點有165個會話同時執行.IP來自192.168.101.217,192.168.101.227,192.168.90.217
--//主要來自192.168.101.217.
--//註:最後1個IP=192.168.98.XX是我發現問題手工執行的.
--//問了一下3個IP來自同一個主機,做了負載均衡.
--//我開始以為是我做awr信息刪除導致的問題(當時在做這方面操作),再仔細分析不可能!!因為不大可能一個IP同一個時間點執行相同的
--//sql語句,也就是因為執行時掛起,應該是慢慢累積的,再早的awr信息已經刪除,已經無法確定出現問題的最早時間點.
--//有165個session一直在運行這條語句在實例1,從時間點看比2023-02-03 09:02:59.411還有早已經出現問題,為什麼以後就不再有相同
--//sql語句繼續執行呢?難道因為負載問題移到實例2執行了嗎?
--//找到一個鏈接:https://www.cnblogs.com/jyzhao/p/8379720.html,感覺我遇到的問題跟他很相似.
--//我估計在查詢asm視圖時出現掛起,出現阻塞現象.
--//實際上在看鏈接https://www.cnblogs.com/jyzhao/p/8379720.html時,我執行如下:
$ crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
$ crsctl check cluster -all
^C
--//等很久沒有出現結果,按ctrl+c中斷!!估計實際上已經出現問題了.我還沒來得及看鏈接的具體內容,crs down了.
--//也許問題還會再次出現,只能等待下次出現再做分析了!!
--//發現自己不瞭解dba_hist_active_sess_history視圖,sql_exec_start欄位可以確定執行的開始時間.
SELECT DISTINCT sql_exec_id, sql_exec_start
FROM dba_hist_ACTIVE_SESS_HISTORY
WHERE sql_id = 'f632bnsg7382q'
AND instance_number = 1
AND sql_exec_id BETWEEN 16777216 AND 16777381
ORDER BY sql_exec_start;
SQL_EXEC_ID SQL_EXEC_START
----------- -------------------
16777216 2022-10-08 18:37:03
16777217 2022-10-08 18:47:04
16777218 2022-10-08 18:57:03
16777219 2022-10-08 19:07:04
16777220 2022-10-08 19:17:06
16777221 2022-10-08 19:27:06
16777222 2022-10-08 19:37:08
16777223 2022-10-08 19:47:09
16777224 2022-10-08 19:57:09
16777225 2022-10-08 20:07:10
16777226 2022-10-08 20:17:12
16777227 2022-10-08 20:27:12
16777228 2022-10-08 20:37:13
16777229 2022-10-08 20:47:15
16777230 2022-10-08 20:57:16
16777231 2022-10-08 21:07:18
16777232 2022-10-08 21:17:20
16777233 2022-10-08 21:27:21
16777234 2022-10-08 21:37:22
16777235 2022-10-08 21:47:23
16777236 2022-10-08 21:57:24
16777237 2022-10-08 22:07:26
16777238 2022-10-08 22:17:28
16777239 2022-10-08 22:27:30
16777240 2022-10-08 22:37:30
16777241 2022-10-08 22:47:31
16777242 2022-10-08 22:57:34
16777243 2022-10-08 23:07:35
16777244 2022-10-08 23:17:37
16777245 2022-10-08 23:27:39
16777246 2022-10-08 23:37:40
16777247 2022-10-08 23:47:42
16777248 2022-10-08 23:57:51
16777249 2022-10-09 00:07:46
16777250 2022-10-09 00:17:47
16777251 2022-10-09 00:27:50
16777252 2022-10-09 00:37:51
16777253 2022-10-09 00:47:53
16777254 2022-10-09 00:57:56
16777255 2022-10-09 01:07:59
16777256 2022-10-09 01:18:00
16777257 2022-10-09 01:28:02
16777258 2022-10-09 01:38:04
16777259 2022-10-09 01:48:06
16777260 2022-10-09 01:58:08
16777261 2022-10-09 02:18:12
16777262 2022-10-09 02:28:14
16777263 2022-10-09 02:38:17
--// 時間不再連續...
16777264 2022-10-09 08:08:27
16777265 2022-10-09 11:28:53
16777266 2022-10-09 11:48:53
16777267 2022-10-09 15:40:13
16777268 2022-10-09 15:50:13
16777269 2022-10-09 16:10:12
16777270 2022-10-09 16:20:13
16777271 2022-10-09 16:30:17
16777272 2022-10-09 16:40:17
16777273 2022-10-09 16:50:22
16777274 2022-10-09 17:10:24
16777275 2022-10-09 17:20:26
16777276 2022-10-09 17:30:31
16777277 2022-10-09 17:50:33
16777278 2022-10-09 18:00:36
16777279 2022-10-09 18:20:50
16777280 2022-10-09 18:30:41
16777281 2022-10-09 18:40:44
16777282 2022-10-09 18:50:45
16777283 2022-10-09 19:00:48
16777284 2022-10-09 19:10:52
16777285 2022-10-09 19:20:54
16777286 2022-10-09 19:31:09
16777287 2022-10-09 19:40:59
16777288 2022-10-09 19:51:01
16777289 2022-10-09 20:01:07
16777290 2022-10-09 20:11:07
16777291 2022-10-09 20:21:10
16777292 2022-10-09 20:31:13
16777293 2022-10-09 20:41:15
16777294 2022-10-09 20:51:30
16777295 2022-10-09 21:01:22
16777296 2022-10-09 21:11:25
16777297 2022-10-09 21:21:40
16777298 2022-10-09 21:31:32
16777299 2022-10-09 21:41:32
16777300 2022-10-09 21:51:35
16777301 2022-10-09 22:01:53
16777302 2022-10-09 22:11:42
16777303 2022-10-09 22:21:58
16777304 2022-10-09 22:32:31
16777305 2022-10-09 22:43:09
16777306 2022-10-09 22:53:40
16777307 2022-10-09 23:04:12
16777308 2022-10-09 23:14:43
16777309 2022-10-09 23:25:05
16777310 2022-10-09 23:35:27
16777311 2022-10-09 23:46:07
16777312 2022-10-09 23:59:12
16777313 2022-10-10 00:09:27
16777314 2022-10-10 00:19:28
16777315 2022-10-10 00:29:16
16777316 2022-10-10 00:39:32
16777317 2022-10-10 00:49:33
16777318 2022-10-10 00:59:21
16777319 2022-10-10 01:09:41
16777320 2022-10-10 01:19:43
16777321 2022-10-10 01:29:31
16777322 2022-10-10 01:39:45
16777323 2022-10-10 01:49:49
16777324 2022-10-10 01:59:52
16777325 2022-10-10 02:09:53
16777326 2022-10-10 02:19:54
16777327 2022-10-10 02:29:56
16777328 2022-10-10 02:39:58
16777329 2022-10-10 02:50:01
16777330 2022-10-10 03:00:01
16777331 2022-10-10 03:10:03
16777332 2022-10-10 03:20:07
16777333 2022-10-10 03:30:09
16777334 2022-10-10 03:40:12
16777335 2022-10-10 03:50:13
16777336 2022-10-10 04:00:16
16777337 2022-10-10 04:10:18
16777338 2022-10-10 04:20:19
16777339 2022-10-10 04:30:21
16777340 2022-10-10 04:40:22
16777341 2022-10-10 04:50:25
16777342 2022-10-10 05:00:32
16777343 2022-10-10 05:10:33
16777344 2022-10-10 05:20:33
16777345 2022-10-10 05:30:34
16777346 2022-10-10 05:40:39
16777347 2022-10-10 05:50:41
16777348 2022-10-10 06:00:44
16777349 2022-10-10 06:10:45
16777350 2022-10-10 06:20:49
16777351 2022-10-10 06:30:52
16777352 2022-10-10 06:40:53
16777353 2022-10-10 06:50:55
16777354 2022-10-10 07:01:00
16777355 2022-10-10 07:11:02
16777356 2022-10-10 07:21:05
16777357 2022-10-10 07:31:08
16777358 2022-10-10 07:41:12
16777359 2022-10-10 07:51:15
16777360 2022-10-10 08:01:07
16777361 2022-10-10 08:11:11
16777362 2022-10-10 08:21:30
16777363 2022-10-10 08:31:33
16777364 2022-10-10 08:41:38
16777365 2022-10-10 08:51:52
16777366 2022-10-10 09:02:31
16777367 2022-10-10 09:13:51
16777368 2022-10-10 09:30:25
16777369 2022-10-10 09:59:14
16777370 2022-10-10 11:34:56
16777371 2022-10-17 09:06:15
16777372 2022-10-17 09:13:06
16777373 2022-10-17 09:23:08
16777374 2022-10-17 09:23:09
16777375 2022-10-17 09:26:17
16777376 2022-10-17 09:32:48
16777377 2022-10-17 09:43:08
16777378 2022-10-17 09:53:08
16777379 2022-10-17 10:03:09
16777380 2023-01-30 16:39:02
16777381 2023-02-09 09:20:13
166 rows selected.
--//10分鐘1次.1小時6次.而前面sqlhh看到每小時12次.難道正常情況下另外一個實例每小時也執行6次.
--//出現負載均衡問題,導致後面的許多執行發生在實例2嗎?
--//最早出現的時間點是2022-10-08 18:37:03.到現在才發現,不知道如何管理資料庫的.
--//轉抄鏈接的內容,也許以後會用上:
有了上面的信息,再次查詢MOS就可以匹配到 - Diskgroup Mount Hangs with RBAL Waiting on 'GPnP Get Item' and 'enq: DD -
contention' (文檔 ID 1375505.1) - "crsctl check cluster -all" command gives CRS-4404, CRS-4405 errors (文檔 ID
1392934.1)
參照MOS解決方案:
Use the following command as root on Compute Node 3 to identify the current gpnpd.bin process:
ps -ef | grep gpnpd.bin
This will give you an output similar to:
ps -ef | grep pnp
oracle 31422 1 0 2011 ? 00:05:40 /u01/app/11.2.0.2/grid/bin/gpnpd.bin
Please note the process ID number of the current gpnpd.bin process. In this example, = 31422
Use the following command on Compute Node 3 to stop the current gpnpd.bin process:
kill -HUP
for example, kill -HUP 31422
--//補充.
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
--//註:HUP 相當於1.這裡實際的執行效果相當於-9參數.
After a few moments, gpnpd.bin should be automatically restarted by GI, run the command from step # 1 to verify that a
new gpnpd.bin process exists