Oracle DBA Tips Corner |
|
Recover Corrupt/Missing OCR with No Backup - (Oracle 10g)
by Jeff Hunter, Sr. Database Administrator
Contents
Overview
It happens. Not very often, but it can happen. You are faced with a
corrupt or missing Oracle Cluster Registry (OCR) and have no backup
to recover from. So, how can
something like this occur? We know that the CRSD process is responsible
for creating backup copies of the OCR every 4 hours from the master
node in the CRS_home/cdata directory. These backups are meant
to be used to recover the OCR from a lost or corrupt OCR file using the
ocrconfig -restore
command, so how is it possible to be in a situation where the OCR
needs to be recovered and you have no viable backup?
Well, consider a scenario where you add a node to the cluster and
before the next backup (before 4 hours) you find the OCR has been
corrupted. You may have forgotten to create a
logical export
of the OCR before adding the new node or worse yet, the logical
export you took is also corrupt. In either case, you are left with a
corrupt OCR and no recent backup. Talk about a bad day!
Another possible scenario could be a shell script that wrongly deletes
all available backups. Talk about an even worse day.
In the event the OCR is corrupt on one node and all options to
recover it have failed, one safe way to re-create the OCR
(and consequently the voting disk) is to reinstall the Oracle Clusterware
software. In order to accomplish this, a complete outage is required for
the entire cluster throughout the duration of the re-install.
The Oracle Clusterware software will need to be fully
removed, the OCR and voting disks reformatted, all virtual IP addresses (VIPs)
de-installed, and a complete reinstall of the Oracle Clusterware
software will need to be performed. It should also be noted that any
patches that were applied to the original clusterware install will need
to be re-applied. As you can see, having a backup of the OCR and voting
disk can dramatically simplify the recovery of your system!
A second and much more efficient method used to re-create the
OCR (and consequently the voting disk as well) is to
re-run the root.sh script
from the primary node in the cluster. This is described
in Doc ID: 399482.1 on the
My Oracle Support
web site. In my opinion, this method is quicker and much less intrusive
than reinstalling Oracle Clusterware. Using root.sh to re-create
the OCR/Voting Disk is the focus of this article.
It is worth mentioning that only one of the two methods mentioned above
needs to be performed in order to recover from a lost or corrupt OCR.
In addition to recovering the OCR, either method could also be used
to restore the SCLS directories from an accidental delete.
These are internal only directories which are created by root.sh
and on the Linux platform are located at /etc/oracle/scls_scr.
If the SCLS directories are accidentally removed then they can
only be created using the same methods used to re-create the OCR
which is the focus of this article.
There are two other critical files in Oracle Clusterware that
if accidentally deleted, are a bit easier to recover from:
If there are multiple voting disks and one was accidentally
deleted, then check if there are any backups of this voting
disk. If there are no backups then we can add one using the
crsctl add votedisk
command.
If these files are accidentally deleted, then stop the
Oracle Clusterware on that node and restart it again.
This will recreate these socket files. If the socket
files for cssd are deleted, then the Oracle
Clusterware stack may not come down in which case
the node has to be bounced.
![]() |
The example configuration used in this article consists of a two-node RAC with a clustered database named racdb.idevelopment.info running Oracle RAC 10g Release 2 on the Linux x86 platform. The two node names are racnode1 and racnode2, each hosting a single Oracle instance named racdb1 and racdb2 respectively. For a detailed guide on building the example clustered database environment, please see:
Building an Inexpensive Oracle RAC 10g Release 2 on Linux - (CentOS 5.3 / iSCSI)
The example Oracle Clusterware environment is configured with three mirrored voting disks and two mirrored OCR files all of which are located on an OCFS2 clustered file system. Note that the voting disk is owned by the oracle user in the oinstall group with 0644 permissions while the OCR file is owned by root in the oinstall group with 0640 permissions:
Check Current OCR File
[oracle@racnode1 ~]$ ls -l /u02/oradata/racdb total 39840 -rw-r--r-- 1 oracle oinstall 10240000 Oct 9 19:33 CSSFile -rw-r--r-- 1 oracle oinstall 10240000 Oct 9 19:36 CSSFile_mirror1 -rw-r--r-- 1 oracle oinstall 10240000 Oct 9 19:38 CSSFile_mirror2 drwxr-xr-x 2 oracle oinstall 3896 Aug 26 23:45 dbs -rw-r----- 1 root oinstall 268644352 Oct 9 19:27 OCRFile -rw-r----- 1 root oinstall 268644352 Oct 9 19:28 OCRFile_mirrorCheck Current Voting Disk
[oracle@racnode1 ~]$ ocrcheck Status of Oracle Cluster Registry is as follows : Version : 2 Total space (kbytes) : 262120 Used space (kbytes) : 4676 Available space (kbytes) : 257444 ID : 1513888898 Device/File Name : /u02/oradata/racdb/OCRFile Device/File integrity check succeeded Device/File Name : /u02/oradata/racdb/OCRFile_mirror Device/File integrity check succeeded Cluster registry integrity check succeededNetwork Settings
[oracle@racnode1 ~]$ crsctl query css votedisk 0. 0 /u02/oradata/racdb/CSSFile 1. 0 /u02/oradata/racdb/CSSFile_mirror1 2. 0 /u02/oradata/racdb/CSSFile_mirror2 located 3 votedisk(s).
Oracle RAC Node 1 - (racnode1) Device IP Address Subnet Gateway Purpose eth0 192.168.1.151 255.255.255.0 192.168.1.1 Connects racnode1 to the public network eth1 192.168.2.151 255.255.255.0 Connects racnode1 to iSCSI shared storage (Openfiler). eth2 192.168.3.151 255.255.255.0 Connects racnode1 (interconnect) to racnode2 (racnode2-priv) /etc/hosts 127.0.0.1 localhost.localdomain localhost # Public Network - (eth0) 192.168.1.151 racnode1 192.168.1.152 racnode2 # Network Storage - (eth1) 192.168.2.151 racnode1-san 192.168.2.152 racnode2-san # Private Interconnect - (eth2) 192.168.3.151 racnode1-priv 192.168.3.152 racnode2-priv # Public Virtual IP (VIP) addresses - (eth0:1) 192.168.1.251 racnode1-vip 192.168.1.252 racnode2-vip # Private Storage Network for Openfiler - (eth1) 192.168.1.195 openfiler1 192.168.2.195 openfiler1-priv
Oracle RAC Node 2 - (racnode2) Device IP Address Subnet Gateway Purpose eth0 192.168.1.152 255.255.255.0 192.168.1.1 Connects racnode2 to the public network eth1 192.168.2.152 255.255.255.0 Connects racnode2 to iSCSI shared storage (Openfiler). eth2 192.168.3.152 255.255.255.0 Connects racnode2 (interconnect) to racnode1 (racnode1-priv) /etc/hosts 127.0.0.1 localhost.localdomain localhost # Public Network - (eth0) 192.168.1.151 racnode1 192.168.1.152 racnode2 # Network Storage - (eth1) 192.168.2.151 racnode1-san 192.168.2.152 racnode2-san # Private Interconnect - (eth2) 192.168.3.151 racnode1-priv 192.168.3.152 racnode2-priv # Public Virtual IP (VIP) addresses - (eth0:1) 192.168.1.251 racnode1-vip 192.168.1.252 racnode2-vip # Private Storage Network for Openfiler - (eth1) 192.168.1.195 openfiler1 192.168.2.195 openfiler1-priv
![]() |
To describe the steps required in recovering the OCR, it is assumed the current OCR has been accidentally deleted and no viable backups are available. It is also assumed the CRS stack was up and running on both nodes in the cluster at the time the OCR files were removed:
[root@racnode1 ~]# rm /u02/oradata/racdb/OCRFile [root@racnode1 ~]# rm /u02/oradata/racdb/OCRFile_mirror [root@racnode1 ~]# ps -ef | grep d.bin | grep -v grep root 548 27171 0 Oct09 ? 00:06:17 /u01/app/crs/bin/crsd.bin reboot oracle 575 566 0 Oct09 ? 00:00:10 /u01/app/crs/bin/evmd.bin root 1118 660 0 Oct09 ? 00:00:00 /u01/app/crs/bin/oprocd.bin run -t 1000 -m 500 -f oracle 1277 749 0 Oct09 ? 00:03:31 /u01/app/crs/bin/ocssd.bin [root@racnode2 ~]# ps -ef | grep d.bin | grep -v grep oracle 674 673 0 Oct09 ? 00:00:10 /u01/app/crs/bin/evmd.bin root 815 27760 0 Oct09 ? 00:06:12 /u01/app/crs/bin/crsd.bin reboot root 1201 827 0 Oct09 ? 00:00:00 /u01/app/crs/bin/oprocd.bin run -t 1000 -m 500 -f oracle 1442 891 0 Oct09 ? 00:03:43 /u01/app/crs/bin/ocssd.bin
- Shutdown Oracle Clusterware on All Nodes.
Although all OCR files have been lost or corrupted, the Oracle Clusterware daemons as well as the clustered database remain running. In this scenario, Oracle Clusterware and all managed resources need to be shut down in order to start the OCR recovery. Attempting to stop CRS using crsctl stop crs will fail given it cannot write to the now lost/corrupt OCR file:
[root@racnode1 ~]# crsctl stop crs OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]With the environment in this unstable state, shutdown all database instances from all nodes in the cluster and then reboot each node:
[oracle@racnode1 ~]$ sqlplus / as sysdba SQL> shutdown immediate [root@racnode1 ~]# reboot ------------------------------------------------ [oracle@racnode2 ~]$ sqlplus / as sysdba SQL> shutdown immediate [root@racnode2 ~]# rebootWhen the Oracle RAC nodes come back up, note that Oracle Clusterware will fail to start as a result of the lost/corrupt OCR file:
[root@racnode1 ~]# crs_stat -t CRS-0184: Cannot communicate with the CRS daemon. [root@racnode2 ~]# crs_stat -t CRS-0184: Cannot communicate with the CRS daemon.- Execute rootdelete.sh from All Nodes.
The rootdelete.sh script can be found at $ORA_CRS_HOME/install/rootdelete.sh on all nodes in the cluster:
[root@racnode1 ~]# $ORA_CRS_HOME/install/rootdelete.sh Shutting down Oracle Cluster Ready Services (CRS): OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2] Shutdown has begun. The daemons should exit soon. Checking to see if Oracle CRS stack is down... Oracle CRS stack is not running. Oracle CRS stack is down now. Removing script for Oracle Cluster Ready services Updating ocr file for downgrade Cleaning up SCR settings in '/etc/oracle/scls_scr' [root@racnode2 ~]# $ORA_CRS_HOME/install/rootdelete.sh Shutting down Oracle Cluster Ready Services (CRS): OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2] Shutdown has begun. The daemons should exit soon. Checking to see if Oracle CRS stack is down... Oracle CRS stack is not running. Oracle CRS stack is down now. Removing script for Oracle Cluster Ready services Updating ocr file for downgrade Cleaning up SCR settings in '/etc/oracle/scls_scr'The "OCR initialization failed accessing OCR device" and PROC-26 errors can be safely ignored given the OCR is not available. The most important action is that the SCR entries are cleaned up.
Keep in mind that if you have more than two nodes in your cluster, you need to run rootdelete.sh on all other nodes as well.
- Run rootdeinstall.sh from the Primary Node.
The primary node is the node where the Oracle Clusterware installation was performed on (which is typically node1). For the purpose of this example, I originally installed Oracle Clusterware from the machine racnode1 which is therefore the primary node.
The rootdeinstall.sh script will clear out any old data from a raw storage device in preparation for the new OCR. If the OCR is on a clustered file system, a new OCR file(s) will be created with null data.
[root@racnode1 ~]# $ORA_CRS_HOME/install/rootdeinstall.sh Removing contents from OCR mirror device 2560+0 records in 2560+0 records out 10485760 bytes (10 MB) copied, 0.0513806 seconds, 204 MB/s Removing contents from OCR device 2560+0 records in 2560+0 records out 10485760 bytes (10 MB) copied, 0.0443477 seconds, 236 MB/s- Run root.sh from the Primary Node. (same node as above)
Amoung several other tasks, this script will create the OCR and voting disk(s).
[root@racnode1 ~]# $ORA_CRS_HOME/root.sh Checking to see if Oracle CRS stack is already configured Setting the permissions on OCR backup directory Setting up NS directories Oracle Cluster Registry configuration upgraded successfully Successfully accumulated necessary OCR keys. Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897. node: node 1: racnode1 racnode1-priv racnode1 node 2: racnode2 racnode2-priv racnode2 Creating OCR keys for user 'root', privgrp 'root'.. Operation successful. Now formatting voting device: /u02/oradata/racdb/CSSFile Now formatting voting device: /u02/oradata/racdb/CSSFile_mirror1 Now formatting voting device: /u02/oradata/racdb/CSSFile_mirror2 Format of 3 voting devices complete. Startup will be queued to init within 30 seconds. Adding daemons to inittab Expecting the CRS daemons to be up within 600 seconds. CSS is active on these nodes. racnode1 CSS is inactive on these nodes. racnode2 Local node checking complete. Run root.sh on remaining nodes to start CRS daemons. - Run root.sh from All Remaining Nodes.
[root@racnode2 ~]# $ORA_CRS_HOME/root.sh Checking to see if Oracle CRS stack is already configured Setting the permissions on OCR backup directory Setting up NS directories Oracle Cluster Registry configuration upgraded successfully clscfg: EXISTING configuration version 3 detected. clscfg: version 3 is 10G Release 2. Successfully accumulated necessary OCR keys. Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897. node: node 1: racnode1 racnode1-priv racnode1 node 2: racnode2 racnode2-priv racnode2 clscfg: Arguments check out successfully. NO KEYS WERE WRITTEN. Supply -force parameter to override. -force is destructive and will destroy any previous cluster configuration. Oracle Cluster Registry for cluster has already been initialized Startup will be queued to init within 30 seconds. Adding daemons to inittab Expecting the CRS daemons to be up within 600 seconds. CSS is active on these nodes. racnode1 racnode2 CSS is active on all nodes. Waiting for the Oracle CRSD and EVMD to start Oracle CRS stack installed and running under init(1M) Running vipca(silent) for configuring nodeapps Creating VIP application resource on (2) nodes... Creating GSD application resource on (2) nodes... Creating ONS application resource on (2) nodes... Starting VIP application resource on (2) nodes... Starting GSD application resource on (2) nodes... Starting ONS application resource on (2) nodes... Done. Oracle 10.2.0.1 users should note that running root.sh on the last node will fail. Most notably is the silent mode VIPCA configuration failing because of BUG 4437727 in 10.2.0.1. Refer to my article Building an Inexpensive Oracle RAC 10g Release 2 on Linux - (CentOS 5.3 / iSCSI) to workaround these errors.
The Oracle Clusterware and Oracle RAC software in my configuration were patched with 10.2.0.4 and therefore did not receive any errors during the running of root.sh on the last node.
- Configure Server-Side ONS using racgons.
CRS_home/bin/racgons add_config hostname1:port hostname2:port
[root@racnode1 ~]# $ORA_CRS_HOME/bin/racgons add_config racnode1:6200 racnode2:6200 [root@racnode1 ~]# $ORA_CRS_HOME/bin/onsctl ping Number of onsconfiguration retrieved, numcfg = 2 onscfg[0] {node = racnode1, port = 6200} Adding remote host racnode1:6200 onscfg[1] {node = racnode2, port = 6200} Adding remote host racnode2:6200 ons is running ...- Configure Network Interfaces for Clusterware.
Log in as the owner of the Oracle Clusterware software which is typically the oracle user account and configure all network interfaces. The first step is to identify the current interfaces and IP addresses using oifcfg iflist. As discussed in the network settings section, eth0/192.168.1.0 is my public interface/network, eth1/192.168.2.0 is my iSCSI storage network and not used specifically for Oracle Clusterware, and eth2/192.168.3.0 is the cluster_interconnect interface/network.
[oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/oifcfg iflist eth0 192.168.1.0 <-- public interface eth1 192.168.2.0 <-- not used eth2 192.168.3.0 <-- cluster interconnect [oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/oifcfg setif -global eth0/192.168.1.0:public [oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/oifcfg setif -global eth2/192.168.3.0:cluster_interconnect [oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/oifcfg getif eth0 192.168.1.0 global public eth2 192.168.3.0 global cluster_interconnect- Add TNS Listener using NETCA.
As the Oracle Clusterware software owner (typically oracle), add a cluster TNS listener configuration to OCR using netca. This may give errors if the listener.ora contains the entries already. If this is the case, move the listener.ora to /tmp from the $ORACLE_HOME/network/admin or from the $TNS_ADMIN directory if the TNS_ADMIN environmental is defined and then run netca. Add all the listeners that were added during the original Oracle Clusterware software installation.
[oracle@racnode1 ~]$ export DISPLAY=<X-Windows Terminal>:0 [oracle@racnode1 ~]$ mv $TNS_ADMIN/listener.ora /tmp/listener.ora.original [oracle@racnode2 ~]$ mv $TNS_ADMIN/listener.ora /tmp/listener.ora.original [oracle@racnode1 ~]$ netca &- Add all Resources Back to OCR using srvctl.
As a final step, log in as the Oracle Clusterware software owner (typically oracle) and add all resources back to the OCR using the srvctl command.
Please ensure that these commands are not run as the root user account.
Add ASM INSTANCE(S) to OCR:
srvctl add asm -n <node_name> -i <asm_instance_name> -o <oracle_home>
[oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/srvctl add asm -i +ASM1 -n racnode1 -o /u01/app/oracle/product/10.2.0/db_1 [oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/srvctl add asm -i +ASM2 -n racnode2 -o /u01/app/oracle/product/10.2.0/db_1Add DATABASE to OCR:
srvctl add database -d <db_unique_name> -o <oracle_home>
[oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/srvctl add database -d racdb -o /u01/app/oracle/product/10.2.0/db_1Add INSTANCE(S) to OCR:
srvctl add instance -d <db_unique_name> -i <instance_name> -n <node_name>
[oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d racdb -i racdb1 -n racnode1 [oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d racdb -i racdb2 -n racnode2Add SERVICE(S) to OCR:
srvctl add service -d <db_unique_name> -s <service_name> -r <preferred_list> -P <TAF_policy>where TAF_policy is set to NONE, BASIC, or PRECONNECT
[oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/srvctl add service -d racdb -s racdb_srvc -r racdb1,racdb2 -P BASICAfter completing the steps above, the OCR should have been successfully recreated. Bring up all of the resources that were added to the OCR and run cluvfy to verify the cluster configuration.
[oracle@racnode1 ~]$ $ORA_CRS_HOME/bin/crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.racdb.db application OFFLINE OFFLINE ora....b1.inst application OFFLINE OFFLINE ora....b2.inst application OFFLINE OFFLINE ora....srvc.cs application OFFLINE OFFLINE ora....db1.srv application OFFLINE OFFLINE ora....db2.srv application OFFLINE OFFLINE ora....SM1.asm application OFFLINE OFFLINE ora....E1.lsnr application ONLINE ONLINE racnode1 ora....de1.gsd application ONLINE ONLINE racnode1 ora....de1.ons application ONLINE ONLINE racnode1 ora....de1.vip application ONLINE ONLINE racnode1 ora....SM2.asm application OFFLINE OFFLINE ora....E2.lsnr application ONLINE ONLINE racnode2 ora....de2.gsd application ONLINE ONLINE racnode2 ora....de2.ons application ONLINE ONLINE racnode2 ora....de2.vip application ONLINE ONLINE racnode2 [oracle@racnode1 ~]$ srvctl start asm -n racnode1 [oracle@racnode1 ~]$ srvctl start asm -n racnode2 [oracle@racnode1 ~]$ srvctl start database -d racdb [oracle@racnode1 ~]$ srvctl start service -d racdb [oracle@racnode1 ~]$ cluvfy stage -post crsinst -n racnode1,racnode2 Performing post-checks for cluster services setup Checking node reachability... Node reachability check passed from node "racnode1". Checking user equivalence... User equivalence check passed for user "oracle". Checking Cluster manager integrity... Checking CSS daemon... Daemon status check passed for "CSS daemon". Cluster manager integrity check passed. Checking cluster integrity... Cluster integrity check passed Checking OCR integrity... Checking the absence of a non-clustered configuration... All nodes free of non-clustered, local-only configurations. Uniqueness check for OCR device passed. Checking the version of OCR... OCR of correct Version "2" exists. Checking data integrity of OCR... Data integrity check for OCR passed. OCR integrity check passed. Checking CRS integrity... Checking daemon liveness... Liveness check passed for "CRS daemon". Checking daemon liveness... Liveness check passed for "CSS daemon". Checking daemon liveness... Liveness check passed for "EVM daemon". Checking CRS health... CRS health check passed. CRS integrity check passed. Checking node application existence... Checking existence of VIP node application (required) Check passed. Checking existence of ONS node application (optional) Check passed. Checking existence of GSD node application (optional) Check passed. Post-check for cluster services setup was successful.
![]() |
Jeffrey Hunter is an Oracle Certified Professional, Java Development Certified Professional, Author, and an Oracle ACE. Jeff currently works as a Senior Database Administrator for The DBA Zone, Inc. located in Pittsburgh, Pennsylvania. His work includes advanced performance tuning, Java and PL/SQL programming, capacity planning, database security, and physical / logical database design in a UNIX, Linux, and Windows server environment. Jeff's other interests include mathematical encryption theory, programming language processors (compilers and interpreters) in Java and C, LDAP, writing web-based database administration tools, and of course Linux. He has been a Sr. Database Administrator and Software Engineer for over 16 years and maintains his own website site at: http://www.iDevelopment.info. Jeff graduated from Stanislaus State University in Turlock, California, with a Bachelor's degree in Computer Science.
![]() |
All articles, scripts and material located at the Internet address of http://www.idevelopment.info is the copyright of Jeffrey M. Hunter
and is protected under copyright laws of the United States. This document may not be hosted on any other site without my express,
prior, written permission. Application to host any of the material elsewhere can be made by contacting me at jhunter@idevelopment.info.
I have made every effort and taken great care in making sure that the material included on my web site is technically accurate,
but I disclaim any and all responsibility for any loss, damage or destruction of data or any other property which may arise from
relying on it. I will in no case be liable for any monetary damages arising from such loss, damage or destruction.