to find all information about different component in EXADATA
$ cd /opt/oracle.Support/onecommand
$ cat dbm.dat
$ cd /opt/oracle.Support/onecommand
$ cat dbm.dat
ILOM
There are two option grafical and comand
you can use a web browser to access the ILOM web interface
https://11.209......./ipages/ologing.asp
and root user and password
you can check
- Identify hardware error and faults
- Remotely control the power of the node
- View the graphical and non-graphical console of the host
- View current status of sensors and indicators of the system
- Identify the hardware configuration of the system
- Receive alerts that are generated about system event
To check status
./crsctl stat res -t
OS log
/var/log/messages
To check configuration
/etc/syslog.conf
To check configuration
/etc/syslog.conf
Starting CellCLI
cellcli [port_number] [-n] [-m] [-xml] [-v | -vv | -vvv] [-x] [-e command]
Examine TOP kswapd
/opt/oracle.Exawatcher/osw/archive/oswtop
Memory Utilization
cat /proc/meminfo | egrep '^MemTotal:|^MemFree:|^Cached:'
MemTotal: 1540864 kB
MemFree: 71520 kB
Cached: 979324 kB
To check Huge
note : Compute and Cell nodes should also be checked to ensure huge pages are configured
MemTotal: 1540864 kB
MemFree: 71520 kB
Cached: 979324 kB
To check Huge
note : Compute and Cell nodes should also be checked to ensure huge pages are configured
# grep ^Huge /proc/meminfo
HugePages_Total: 22960
HugePages_Free: 2056
To check if there are any interface has been down
Service
lsnrctl status LISTENER_SCAN2
HugePages_Total: 22960
HugePages_Free: 2056
HugePages_Rsvd: 2016
HugePages_Surp: 0
Hugepagesize: 2048 kB
note: Example of swapping: On a healthy system the swpd column would contain only 0’s.
ILOM Integrated Lights Out Manager
is a dedicated service processor that is used to manage and monitor servers. Each Cell server, Compute node, and InfiniBand switch will have a dedicated ILOM. There are several places to view errors and messages with ILOM. The first is with the web management console. From within the web console select “Open Problems.”
HugePages_Surp: 0
Hugepagesize: 2048 kB
VMSTAT
On a Compute node, go to /opt/oracle.Exawatcher/osw/archive/oswvmstat.
Zero swapping is needed to achieve stable and good system performance.
Zero swapping is needed to achieve stable and good system performance.
vmstat 60
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 652 116216 132176 1473568 0 0 81 177 462 928 20 6 69 5
0 0 652 115492 132176 1473572 0 0 1 13 488 1184 18 7 57 17
0 0 652 114624 132176 1473572 0 0 0 0 363 752 12 4 84 0
ILOM Integrated Lights Out Manager
is a dedicated service processor that is used to manage and monitor servers. Each Cell server, Compute node, and InfiniBand switch will have a dedicated ILOM. There are several places to view errors and messages with ILOM. The first is with the web management console. From within the web console select “Open Problems.”
from the ILOM host using the ipmitool, to check last 10 event
ipmitool sel list 10
Network Status
srvctl status vip -n node1
srvctl status vip -n node1
To check if there are any interface has been down
dcli -l root -g ./all_group "ifconfig -a | grep DOWN"
Disk Status
# dcli -g all_group -l root /opt/MegaRAID/MegaCli/MegaCli64 AdpAllInfo -aALL | grep
"Device Present" -A 8
slcb01db07: Device Present
slcb01db07: ================
slcb01db07: Virtual Drives
slcb01db07: Degraded
slcb01db07: Offline
slcb01db07: Physical Devices : 5
slcb01db07: Disks : 4
slcb01db07: Critical Disks : 0
slcb01db07: Failed Disks
--
slcb01db08: Device Present
slcb01db08: ================
slcb01db08: Virtual Drives
slcb01db08: Degraded
slcb01db08: Offline
slcb01db08: Physical Devices : 5
slcb01db08: Disks : 4
slcb01db08: Critical Disks : 0
slcb01db08: Failed Disks : 0
--
slcb01cel12: Device Present
slcb01cel12: ================
slcb01cel12: Virtual Drives
slcb01cel12: Degraded
slcb01cel12: Offline
slcb01cel12: Physical Devices : 14
slcb01cel12: Disks : 12
slcb01cel12: Critical Disks : 0
slcb01cel12: Failed Disks
--
slcb01cel13: Device Present
slcb01cel13: ================
slcb01cel13: Virtual Drives
slcb01cel13: Degraded
slcb01cel13: Offline
slcb01cel13: Physical Devices : 14
slcb01cel13: Disks : 12
slcb01cel13: Critical Disks : 0
slcb01cel13: Failed Disks : 0
CheckHWnFWProfile
is a program that validates whether hardware and firmware on the Compute nodes and Storage Nodes are all supported configurations. This only takes a few seconds to run and can help identify issues such as unsupported disks as demonstrated below. Note that Exachk will also execute this command to check for issues.
dcli -l root -g ./all_group "/opt/oracle.SupportTools/CheckHWnFWProfile"
lsnrctl status LISTENER_SCAN2
Database Free Buffer Waits
A very important metric to monitor is the “free buffer wait” wait event time. Free buffer waits indicate that a database
process was not able to find a free buffer into which to perform a read operation. This occurs when the DBWR
process can’t write blocks to storage fast enough. “Free buffer waits” are an indication that the write rate of the I/O
system is maxed out or is close to being maxed out. If this statistic appears in the top 5 wait events, then proactive
action should be taken to reduce the write rate or increase the I/O capacity of storage.
Have Changes Occurred in the Environment?
Change Management
Change Management
-
Recent Oracle patching (Operating System, Database, Cell server, Clusterware, etc.)
-
Newly deployed applications
-
Code changes to existing applications
-
Other changes in usage (i.e. new users added)
-
Oracle configuration changes
-
Operating system configuration changes
-
Migration to a new platform
-
Expansion of the environment
-
Addition of other InfiniBand devices to the fabric
- Changes in resource management plan
Compare configuration file
$ strings spfileemrep.ora > spfileemrep.ora.txt
$ strings spfileemrep.ora_072513_0100 > spfileemrep.ora_072513_0100.txt
$ diff spfileemrep.ora.txt spfileemrep.ora_072513_0100.txt
Checking changes to the kernel tunable parameters
dcli -l root -g ./dbs_group "sysctl -a > /tmp/sysctl.current;diff /root/<baseline kernel configuration file> /tmp/sysctl.current"
note: It is normal for some parameters to change dynamically. So the above output should be carefully analyzed to determine if the delta from the diff output is relevant to the issues being experienced.
AWR Data
you can check difference between two times with
/u01/app/oracle/product/12.2.0.1/db_1/rdbms/admin
awrddrpt.sql
number of users
number of transactions
redo rate
physical reads per transaction
physical writes per transaction
TOP comand
top - 20:44:36 up 2:28, 1 user, load average: 0.02, 0.04, 0.05
Tasks: 176 total, 2 running, 174 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1540864 total, 74352 free, 407024 used, 1059488 buff/cache
KiB Swap: 5300220 total, 5300192 free, 28 used. 864920 avail Mem
you can check difference between two times with
/u01/app/oracle/product/12.2.0.1/db_1/rdbms/admin
awrddrpt.sql
number of users
number of transactions
redo rate
physical reads per transaction
physical writes per transaction
Check if Compute node is CPU bound
Evaluate load average per core = # of runnable processes per core
The 3 load-average values are the 1-minute, 5-minute, and 15-minute averages.
-
Question: Is load average of 80 high?
-
Answer: It depends.
The 3 load-average values are the 1-minute, 5-minute, and 15-minute averages.
Compute load/core = 283 / 12 ~= 23 runnable processes per core
Note that Compute nodes that are CPU bound will incorrectly show high I/O wait times because the process that issues an I/O will not be immediately rescheduled when the I/O completes. Therefore CPU scheduling time will be measured as part of I/O wait times. Thus, I/O response times measured at the database level are not accurate when the CPU is maxed out. Thus it is important to have ruled out CPU contention as documented above.
Note that Compute nodes that are CPU bound will incorrectly show high I/O wait times because the process that issues an I/O will not be immediately rescheduled when the I/O completes. Therefore CPU scheduling time will be measured as part of I/O wait times. Thus, I/O response times measured at the database level are not accurate when the CPU is maxed out. Thus it is important to have ruled out CPU contention as documented above.
top - 20:44:36 up 2:28, 1 user, load average: 0.02, 0.04, 0.05
Tasks: 176 total, 2 running, 174 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1540864 total, 74352 free, 407024 used, 1059488 buff/cache
KiB Swap: 5300220 total, 5300192 free, 28 used. 864920 avail Mem
I/O Performance
Check if cells are I/O bound
Check if cells are I/O bound