unixadmin.free.fr just another IBM blog and technotes backup

7mar/12

Server running AIX with Oracle RAC reboots itself

SOURCE: Technote T1011228

Problem(Abstract)

Server running AIX with Oracle RAC reboots itself with no warning
Symptom

AIX server shuts down and/or reboots.

A REBOOT_ID is logged in /var/adm/ras/errlog indicating "SYSTEM SHUTDOWN BY USER" although no shutdown or reboot command was issued by any user.

example error message...

LABEL: REBOOT_ID
IDENTIFIER: 2BFA76F6

Date/Time: Wed Dec 3 08:19:09 2008
Sequence Number: 1447
Machine Id: 0000ABCD1234
Node Id: nodeA
Class: S
Type: TEMP
Resource Name: SYSPROC

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
0
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
0

Cause

Oracle Real Application Clusters (RAC) is known to reboot the operating system with no warning due to configuration of the oprocd daemon

Environment

AIX with Oracle RAC

Diagnosing the problem

Oracle Real Application Clusters (RAC) typically runs a process called oprocd.

The idea of OPROCD is quite straightforward. It’s goal is to provide I/O fencing. Basically oprocd works by setting a timer, then sleeping. If, when it wakes up again and gets scheduled onto cpu, it sees that a longer time has passed than the acceptable margin, oprocd will decide to reboot the node.

You can check for the oprocd process with the ps command...

# ps -ef | grep oprocd
root 221672 1 0 08:27:44 - 0:00
/u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 500 -f

These options to oprocd are saying -t 1000 (wake up every 1000 ms) and -m 500 (allow up to 500 ms margin of error on the time that oprocd wakes up before rebooting). In other words, if oprocd wakes up after > 1.5 secs it’s going to force a reboot.

Resolving the problem

The timeout and margin times are computed from the elements of diagwait and reboot time and it isn't recommended changing them via the init.cssd file, but rather through the command 'crsctl set css diagwait '.

There is a formula involved in the calculation of the times. For example, if the reboot time is 3 and you submit a diagwait setting of 13 you will get -t 1000 -m 10000.

# crsctl set css diagwait 13 -force

# ps -ef | grep oprocd
root 221672 1 0 08:27:44 - 0:00
/u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 10000 -f

You can see that the margin has changed to 10000 ms, that is 10 seconds in place of the default 0.5 seconds. This is a 20 fold increase allows oprocd more time to determine if the node needs to be rebooted.

IBM recommends the customer contact Oracle Support before modifying this value.

IBM and Oracle came to the agreement that a diagwait value of 13 is a suitable value if the best practices are used...

IBM recommends customers follow best practices, and if possible update to AIX 6.1 or AIX 7.1 with current Technology Levels which include the new non-pagable kernel as the preferred corrective action.

The Oracle master document can be found here...

Commentaires () Trackbacks (0)

Aucun commentaire pour l'instant


Leave a comment

(required)

Aucun trackbacks pour l'instant