DBA Tips Archive for Oracle |
Differences in df and du on Oracle Cluster File System (OCFS2) and Orphan Files
by Jeff Hunter, Sr. Database Administrator
Recently, it was noticed that the df and du commands were displaying different results on two OCFS2 file systems from all clustered database nodes in an Oracle RAC 10g configuration.
| Mount Point | LUN | Purpose | OCFS2 Kernel Driver | OCFS2 Tools | OCFS2 Console |
|---|---|---|---|---|---|
| /u02 | /dev/iscsi/thingdbcrsvol1/part1 | Oracle CRS Components | 1.4.2-1.el5 | 1.4.2-1.el5 | 1.4.2-1.el5 |
| /u03 | /dev/iscsi/thingdbfravol1/part1 | Flash Recovery Area | 1.4.2-1.el5 | 1.4.2-1.el5 | 1.4.2-1.el5 |
For example:
|
In the above example, the difference between df and du on the /u03 cluster file system is 248 GB! (326912416-78463904)
The problem appeared to be isolated to the /u03 cluster file system which is used exclusively for the global Flash Recovery Area (FRA). The files stored in the FRA include RMAN backups, archived redo logs, and flashback database logs:
/u03/flash_recovery_area/thingdb/archivelog /u03/flash_recovery_area/thingdb/autobackup /u03/flash_recovery_area/thingdb/backupset /u03/flash_recovery_area/thingdb/flashback
These directories are all managed by the Oracle RDBMS software no files in the FRA (or /u03 in general) are manually added or removed.
Researching this problem yielded the following My Oracle Support notes:
Discusses either orphaned files or invalid cluster size as the problem.
Contains tests used to determine that the current difference between the df and du commands were indeed orphaned files and not the result of an invalid cluster size.
It discusses a bug within OCFS2 that leaves some deleted files in the orphan directory (the //orphan_dir name space in OCFS2) after being deleted.
The solution as described in Note ID 806554.1 is to upgrade OCFS2 to version 1.4.4 or higher. At the time of this writing, the latest version of OCFS2 was 1.4.7-1.
After several days of research, it was determined that orphan files on the OCFS2 cluster file system were responsible for the significant difference between the df and du commands. OCFS2 was apparently leaving some deleted files in the orphan directory (the //orphan_dir name space in OCFS2) after being deleted.
Note ID 468923.1 on the My Oracle Support website explains that despite deleting files and/or directories on an OCFS2 cluster file system, it is possible that orphaned files may exist. These are files that have been deleted, but are being accessed by running processes on one or more OCFS2 cluster nodes.
When an object (file and/or directory) is deleted, the file system unlinks the object entry from the existing directory and links it as an entry against that cluster node's orphan directory (the //orphan_dir name space in OCFS2). When the object is eventually no longer used across the cluster, the file system frees it's inode including all disk space associated with it.
So it appears that in order to completely avoid orphaned files being created, the application/database should guarantee that the files are not being opened by other processes when deleting. Again, the problem seemed to be isolated to the /u03 file system which was used exclusively for the global Flash Recovery Area and managed solely by the Oracle RDBMS software. No files on this file system were being manually added or deleted.
The orphaned files in /u03 added up significantly given they were RMAN backups, archived logs, flashback logs, etc. The initial plan was to manually remove all orphan files using /sbin/fsck.ocfs2 -fy /dev/iscsi/thingdbfravol1/part1. Since one of the pre-requisites is to unmount the file system(s), a production outage had to be scheduled.
It was later found that a bug within OCFS2 was leaving some deleted files in the orphan directory after being deleted. The solution was to upgrade the OCFS2 configuration to version 1.4.4 or higher. At the time of this writing, the latest version of OCFS2 was 1.4.7-1.
This article discussing some of the troubleshooting steps that were involved in resolving OCFS2 not removing orphan files.
During the initial troubleshooting phase, all nodes in the cluster were rebooted to determine its effect.
|
When the nodes came back online, the problem was resolved and df and du were calculating the same disk space usage. This was a short lived victory, however, when in less than 24 hours, df and du were once again reporting stark differences on the /u03 cluster file system.
Testing resumed with the assumption that the cluster file system(s) were having an issue removing orphaned files. Note ID 468923.1 on the My Oracle Support website included directions to identify which node, application, or user (holders) were associated with //orphan_dir name space entries (if any). To determine whether or not the orphaned files were truly being held by a process (from one of the nodes in the RAC), the following was run from all OCFS2 clustered nodes:
|
Similarly, the lsof command can be used to produce the same results:
|
Although the find /proc and lsof commands both produced output, the files listed didn't appear to have anything to do with the files on either of the clustered file systems (/u02 or /u03). It was now evident that while excessive orphan files did indeed exist on the /u03 cluster file system, no processes were identified as holding a lock on them.
The next round of troubleshooting involved querying the //orphan_dir name space in OCFS2 using the debugfs.ocfs2 command. The general syntax used to query the //orphan_dir name space is:
|
Where:
Four digit OCFS2 slot number that specifies which node to check using debugfs.ocfs2. For example, cluster node 1 (thing1) will be slot number 0000, cluster node 2 (thing2) will be slot number 0001, cluster node 3 (thing3) will be slot number 0002, node n will be slot number LPAD(n-1, 4, '0'), and so on.
Name of the disk device. For example: /dev/iscsi/thingdbfravol1/part1.
For example:
|
The above output shows that cluster node 1 (thing1) found entries (files) associated with the //orphan_dir name space on /u03, meaning they are orphan files. The output from debugfs.ocfs2 shows the file entries using some type of "hex" value (i.e. 83865735=00000000000bd021). This hex value indicates the inode number. Note that cluster node 2 (thing2) has no entries in the //orphan_dir name space indicating there are no orphan files on this node. This makes sense since the nightly RMAN process is only run from the first node (thing1).
By the time it was identified that orphan files were not being removed from the OCFS2 cluster file system, immediate action was required to clear out entries found in the //orphan_dir name space and allow the file system to free it's inodes and disk space associated with it. The size and number of orphan files were considerably large given they were RMAN backups and archived logs.
To manually clear all orphaned files, schedule an outage on all nodes in the cluster. Bring down the clustered database and all Oracle RAC services, unmount the OCFS2 file system(s) that contain the orphan files to be removed, and run the fsck.ocfs2 command from all nodes in the cluster as follows:
|
After removing all orphan files, mount the OCFS2 cluster file systems and restart all Oracle RAC services.
The permanent solution in this case was to upgrade OCFS2 to version 1.4.4 or higher according to Note ID 806554.1 from the My Oracle Support website. At the time of this writing, the latest version of OCFS2 was 1.4.7-1.
Upgrading from OCFS2 version 1.4.2-1 to 1.4.7-1 does not require any on-disk format change. At a minimum, it is a simple kernel driver update which means the upgrade could be performed in a rolling manner. While this would avoid a cluster-wide outage, a full outage was still scheduled since it was unknown if cleaning out the orphan files using fsck.ocfs2 could be performed on a disk device that still had other cluster instances mounting it.
The following paper provides step-by-step instructions to upgrade an installation of Oracle Cluster File System 2 (OCFS2) 1.4 on the Linux platform.
To ensure orphan files are not being held on an OCFS2 cluster file system, the following script should be scheduled to run nightly from all nodes in the cluster.
The purpose of this script is to identify and warn the DBA on any orphan files in an OCFS2 cluster file system. The script queries the //orphan_dir name space using the debugfs.ocfs2 command from a clustered node to determine the number of orphaned files on that node.
This script should be scheduled to run on a nightly basis through CRON as the root user account.
ocfs2_check_orphaned_files.ksh
The ocfs2_check_orphaned_files.ksh script takes three parameters:
Name of the disk device. For example: /dev/iscsi/thingdbfravol1/part1.
Four digit OCFS2 slot number that specifies which node to check using debugfs.ocfs2. For example, cluster node 1 will be slot number 0000, cluster node 2 will be slot number 0001, cluster node 3 will be slot number 0002, node n will be slot number LPAD(n-1, 4, '0'), and so on.
Maximum number of orphaned files that can exist in the provided OCFS2 file system before this script issues a warning email.
For example, the following is scheduled nightly from two nodes in an Oracle RAC cluster; namely thing1 and thing2. Note that node 1 (thing1) passes in slot number 0000 while node 2 (thing2) passes in slot number 0001. The third parameter (5) specifies that at most, five orphan files can exist in the provided OCFS cluster file system before this script issues a warning email.
|
Jeffrey Hunter is an Oracle Certified Professional, Java Development Certified Professional, Author, and an Oracle ACE. Jeff currently works as a Senior Database Administrator for The DBA Zone, Inc. located in Pittsburgh, Pennsylvania. His work includes advanced performance tuning, Java and PL/SQL programming, developing high availability solutions, capacity planning, database security, and physical / logical database design in a UNIX, Linux, and Windows server environment. Jeff's other interests include mathematical encryption theory, programming language processors (compilers and interpreters) in Java and C, LDAP, writing web-based database administration tools, and of course Linux. He has been a Sr. Database Administrator and Software Engineer for over 18 years and maintains his own website site at: http://www.iDevelopment.info. Jeff graduated from Stanislaus State University in Turlock, California, with a Bachelor's degree in Computer Science.
Copyright (c) 1998-2012 Jeffrey M. Hunter. All rights reserved.
All articles, scripts and material located at the Internet address of http://www.idevelopment.info is the copyright of Jeffrey M. Hunter and is protected under copyright laws of the United States. This document may not be hosted on any other site without my express, prior, written permission. Application to host any of the material elsewhere can be made by contacting me at jhunter@idevelopment.info.
I have made every effort and taken great care in making sure that the material included on my web site is technically accurate, but I disclaim any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on it. I will in no case be liable for any monetary damages arising from such loss, damage or destruction.