hanganalyze3分析总结(测)

原创: 2016-06-07 15:58:49 983浏览

hanganalyze 3 分析总结一种是会话级别的： ALTER SESSION SET EVENTS immediate trace name HANGANALYZE level level; 一种是实例级别: ORADEBUG hanganalyze level 一种是集群范围的： ORADEBUG setmypid ORADEBUG setinst all ORADEBUG -g def hanganaly

hanganalyze 3 分析总结
一种是会话级别的：
ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level ';

一种是实例级别:
ORADEBUG hanganalyze

一种是集群范围的：
ORADEBUG setmypid
ORADEBUG setinst all
ORADEBUG -g def hanganalyze

先解释下各个level的含义：

1-2:只有hanganalyze输出，不dump任何进程
3：Level2+Dump出在IN_HANG状态的进程【最简单，最容易判断问题】
4：Level3+Dump出在等待链里面的blockers(状态为LEAF/LEAF_NW/IGN_DMP)
5：Level4+Dump出所有在等待链中的进程（状态为NLEAF）

hanganalyze报告会分作许多片断，会话片断信息总是由一个header详尽描述被提取的的会话信息。Oracle8i和9i的信息略有不同：

sid是 Session ID
sess_srno是serial#
proc_ptr是Process Pointer
ospid 是OS Process ID
cnode是Node Id，Oracle9i才用
Nodenum是hanganalyze
自己为了记录这些会话而定制的编号，从0开始排起。
State 是node的状态
Adjlist是临近的node(通常代表一个blocker node)
Predecessor是Predecessor node ,通常代表一个 waiter node

接着解释一下比较重要的一些node state:
IN_HANG：这表示该node处于死锁状态，通常还有其他node（blocker）也处于该状态
LEAF/LEAF_NW：该node通常是blocker。通过条目的”predecessor”列可以判断这个node是否是blocker。
LEAF说明该NODE没有等待其他资源，而
LEAF_NW则可能是没有等待其他资源或者是在使用CPU.
如下的实例说明了node16阻塞了node19的资源：

nodenum]/cnode/sid/sess_srno/session/
ospid/state/start/finish/[adjlist]/predecessor
[16]/0/17/154/0x24617be0/26800/LEAF/29/30//19
[19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186

NLEAF：通常可以看作这些会话是被阻塞的资源。发生这种情况一般说明数据库发生性能问题而不是数据库hang
IGN/IGN_DMP：这类会话通常被认为是空闲会话，除非其adjlist列里存在node。
如果是非空闲会话则说明其adjlist里的node正在等待其他node释放资源。
SINGLE_NODE/SINGLE_NODE_NW：近似于空闲会话

实战hanganalyze:

测试案例

C:\Users\Administrator>SQLPLUS / AS SYSDBA
SQL*Plus: Release 10.2.0.4.0 - Production on Sat May 28 15:30:03 2011
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
数据库已经hang 了登录失败下面来做个 systemdump 吧看看是什么原因

C:\Users\Administrator>sqlplus -prelim '/ as sysdba'
SQL*Plus: Release 10.2.0.4.0 - Production on Sat May 28 15:31:13 2011
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
Enter password:
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug tracefile_name
c:\oracle\product\10.2.0\admin\systex\udump\systex_ora_4636.trc
SQL> oradebug hanganalyze 3

Hang Analysis in c:\oracle\product\10.2.0\admin\systex\udump\systex_ora_6824.trc

先来看看 hanganalyze 的trace
Chain 1 : :
//// 看来是 139 阻塞了一下的会话了.
--
Other chains found:

Chain 2 : :

Chain 3 : :

Chain 4 : :

Chain 5 : :

Chain 6 : :

Chain 7 : :

Chain 8 : :

OK ,我们在 systemdump 里面搜索 2932

O/S info: user: SYSTEM, term: WILLSON-LI, ospid: 2932
OSD pid info: Windows threadid: 2932, image: ORACLE.EXE (SHAD)
Dump of memory from 0x000007FF18249EC0 to 0x000007FF1824A0C8
7FF18249EC0 00000006 00000000 17615728 000007FF [........(Wa.....]
7FF18249ED0 00000010 000313A7 183ADF60 000007FF [........`.:.....]
7FF18249EE0 00000003 000313A7 1798B0E8 000007FF [................]
7FF18249EF0 0000000B 000313A7 1836A1C0 000007FF [..........6.....]
7FF18249F00 00000004 0003129B 16835088 000007FF [.........P......]
7FF18249F10 00000007 000313A7 16836020 000007FF [........ `......]
7FF18249F20 00000007 000313A7 00000000 00000000 [................]
7FF18249F30 00000000 00000000 00000000 00000000 [................]
Repeat 24 times
7FF1824A0C0 00000000 00000000 [........]
(FOB) flags=2 fib=000007FF169BB290 incno=0 pending i/o cnt=0
fname=C:\ORACLE\PRODUCT\10.2.0\ORADATA\SYSTEX\CONTROL04.CTL
fno=0 lblksz=16384 fsiz=430
(FOB) flags=2 fib=000007FF169BB630 incno=0 pending i/o cnt=0
fname=C:\ORACLE\PRODUCT\10.2.0\ORADATA\SYSTEX\SYSTEM01.DBF
fno=1 lblksz=8192 fsiz=64000
----------------------------------------
SO: 000007FF1836A1C0, type: 4, owner: 000007FF18263170, flag: INIT/-/-/0x00
(session) sid: 139 trans: 000007FF16ADE418, creator: 000007FF18263170, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-0010-00000029, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 000007FF17EABB40, user: 0/SYS
service name: SYS$USERS
O/S info: user: WILLSON-LI\Administrator, term: WILLSON-LI, ospid: 5588:5372, machine: WORKGROUP\WILLSON-LI
program: sqlplus.exe
application name: sqlplus.exe, hash value=254292535
waiting for 'latch: shared pool' blocking sess=0x0000000000000000 seq=273 wait_time=0 seconds since wait started=129
address=c972f38, number=d6, tries=0 ////--- 在等待 latch c972f38
Dumping Session Wait History
for 'SQL*Net message from client' count=1 wait_time=61829702
driver id=42455100, #bytes=1, =0
for 'SQL*Net message to client' count=1 wait_time=3
driver id=42455100, #bytes=1, =0
for 'ksdxexeotherwait' count=1 wait_time=2
=0, =0, =0
for 'ksdxexeotherwait' count=1 wait_time=2
省略... 一直在等待 ksdxexeotherwait
---我们再搜索 c972f38
SO: 000007FF18263170, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=16, calls cur/top: 000007FF183ADF60/000007FF183ADF60, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=20
Location from where call was made: kghalo:
waiting for c972f38 Child shared pool level=7 child#=1
Location from where latch is held: kghupr1: Chunk Header
Context saved from call: 343209960
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
11 (283, 1306567910, 283)
15 (268, 1306567910, 268)
10 (268, 1306567910, 268)
16 (129, 1306567910, 129)
23 (120, 1306567910, 120)
24 (108, 1306567910, 108)
waiter count=6
gotten 586960 times wait, failed first 238 sleeps 8
gotten 0 times nowait, failed: 0
on wait list for c972f38----------------------------------------------
holding (efd=7) 15e479a0 Child library cache level=5 child#=2
Location from where latch is held: kglobpn: child:: latch
Context saved from call: 1
state=busy, wlstate=free

可以看出来 SO ： 000007FF18263170 holding 了 Child library cache 在等待 Child shared pool

SQL> oradebug poke 0xc972f38 4 0
BEFORE: [00C972F38, 00C972F3C) = 00000001
AFTER: [00C972F38, 00C972F3C) = 00000000

其他：

1、10G以前，使用DBX,GDB之类的工具：
找到一个前台进程的PID（ps -ef|grep LOCAL)，然后用gdb $ORACLE_HOME/bin/oracle进行跟踪
oracle@gurufl02:~> ps -ef|grep LOCAL
oracle 26358 26357 0 09:52 ? 00:00:00 oracleora9 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 27848 27810 0 10:41 pts/3 00:00:00 grep LOCAL
oracle@gurufl02:~> gdb $ORACLE_HOME/bin/oracle 26358

然后执行：print ksudss(10)
( gdb) print ksudss(10)
[Switching to Thread 4154533552 (LWP 26358)]
warning: Unable to restore previously selected frame.
$1 = 181773020
Current language: auto; currently asm
这个时候在UDUMP或者BDUMP下会生成一个TRACE文件。里面包含SYSTEM STATE DUMP
除了通过print ksudss(10)进行systemstate dump，还可以进行下面的dump
　　print ksdhng(3,1,0) 相当于oradebug hanganalyze 3
　　print ksudps(10) 相当于oradebug dump processstate 10
　　print curdmp() 相当于oradebug call curdmp（也就是oradebug dump cursordump)
　　print ksdtrc(4) 相当于oradebug dump events 4（这里参数表示level,1–session,2–process,4–system)
　　print ksdsel(10046,12) –相当于为attach的进程设置10046事件level 12
　　print skdxipc() –相当于oradebug ipc
　　print skdxprst() –相当于oradebug procstat
　　当然如果能用oradebug，就应该使用oradebug，毕竟方便得多，也更安全。

2、10G下sqlplus -prelim
如果10G，可以使用sqlplus -prelim选项强制登录
export ORACLE_SID=ora9
sqlplus -prelim '/ as sysdba'
oradebug setmypid
oradebug unlimit;
oradebug dump systemstate 10

3、如果上述都不行，可以使用pstack把堆栈信息取出来，另外也可以使用truss, tusc，strace 之类的工具看看到底在等什么
数据库HANG住是计较头痛的事情，如何找到HANG住的原因，是DBA必须面临的课题