Recovery RAID IBM - ServeRAID recovery from multiple defunct disk drive (DDD) failures

Recovering from multiple Defunct Dead Disks (DDDs) in a ServeRAID environment

Note: If multiple DDD drives are experienced in a cluster or fault tolerant environment, troubleshooting should start with the cluster or fault tolerant set-up before continuing below. This is particularly indicated if all drives in an external enclosure are DDD.

The ServeRAID controller is designed to tolerate a single Defunct Dead Disk (DDD) failure when configured with redundant RAID levels. There is no guarantee that any multiple disk failure can be recovered with the data intact. The following instructions provide the highest possibility of a successful recovery.

For multiple DDD drives in a RAID 1 and RAID 5 array, data is lost. Data recovery may be attempted by bringing all but the first drive that was marked DDD back to the online state.

Initiate a rebuild for the first drive that was marked DDD. This first drive that was marked DDD has inconsistent data and must be rebuilt.
Once data recovery has been attempted, a cause for the DDD drives, such as a bad cable, backplane, etc., must be identified.
If no cause can be found, physically replace the DDD drives one at a time so that data can be rebuilt each time.
Physically replace all single DDD drives.

Resource requirements

IBM ServeRAID Support CD
ServeRAID command line diskette (available on the IBM ServeRAID Support CD or can be downloaded)
ServeRAID-3x or ServeRAID-4x controller(s)
These procedures depend on certain logging functions enabled in the BIOS/Firmware of the ServeRAID controller which was first implemented in version 4.0 of the IBM ServeRAID Support CD. The ServeRAID controller must have previously been flashed using any 4.x version of the IBM ServeRAID Support CD or diskette(s) prior to the failure.
Click here for capturing configuration and event logs using DUMPLOG and click here for clearing event logs using CLEARLOG.BAT appropriate for your operating system.

Notes:

(1) The IBM ServeRAID Support CD software is backward compatible. If you are prompted to upgrade BIOS/firmware, select cancel. Upgrading BIOS/firmware levels when your system has multiple DDD in a ServeRAID environment is not recommended unless otherwise directed by IBM support.
(2) Diskette images of the ServeRAID downloadable diskettescan be found on the IBM ServeRAID Support CD in the IMAGES directory. You can build a diskette of the ServeRAID Command Line utilities from this image. For more information see the README.TXT file in the IMAGES directory on the IBM ServeRAID Support CD.

Recovery steps ServeRAID systems with multiple disk failures

Capture the ServeRAID logs using DUMPLOG.BAT. There are two methods of capturing these logs, depending on the situation. These logs should be sent to your IBM Support Center as necessary for root cause analysis. This is the best evidence to determine what caused the failure.

Capturing ServeRAID logs method 1

Use Method 1 if the operating system logical drive is off-line. Copy the DUMPLOG.BAT and CLEARLOG.BAT files to the root of the ServeRAID Command Line diskette or the A:\ directory. Boot to the ServeRAID Command line diskette and run the DUMPLOG command using the following syntax:

DUMPLOG <FILENAME.TXT> <Controller#>

Capturing ServeRAID logs method 2

Use Method 2 when a data logical drive is off-line and the operating system is online. Copy or extract the DUMPLOG utility appropriate for your operating system to a local directory or folder.
Use ServeRAID Manager to determine the first DDD under the following two conditions:

Operating system is accessible
1. Open ServeRAID Manager and notice the DDDs
2. In ServeRAID Manager highlight the system hostname with DDDs
3. Right click the system hostname then and choose Save printable configuration and event logs. These logs are saved into the installation directory of ServeRAID Manager, usually "Program Files\RAIDMAN." The log files are saved as RAIDx.LOG where x is the controller number.
4. Open the correct RAIDx.LOG text file usingin any standard text editor for the controller with DDDs. Go to the last page of the RAIDx.LOG file and you will see a list called ServeRAID defunct drive event log for controller x. This portion of the log will list by date and time stamp all the DDDs by the order in which they went off-line with the most recent failure shown at the bottom of the list. Determine which hard disk drive failed first. The first DDD will be at the at the top of list.
  
  Important: Since the "ServeRAID defunct drive event log" is not cleared, there may be entries from earlier incidents of defunct drives that does not pertain to the problem currently being worked. Review the list of DDDs from the bottom to the top of the list and identify the point where the first drive associated with this incident is logged. The date and time stamps are usually the strongest indicators of where this point is.
  
  There is no guarantee that the "ServeRAID defunct drive event log" will list the drives in the exact order the disks failed under certain circumstances. One example is when an array is setup across multiple ServeRAID channels. In this configuration, the ServeRAID controller issues parallel I/O's to disk devices on each channel. In the event of a catastrophic failure, disks could also be marked defunct in parallel. This could impede the reliability of the date and time stamps as the ServeRAID controller writes events from multiple channels operating in parallel to a single log.
5. Detach or remove the DDD from the backplane or cable (after the system is powered off). This hard disk drive will need to be replaced.
6. Exit ServeRAID Manger and power off the system
Operating is not accessible
1. Reboot your system with the IBM ServeRAID Support CD-ROm in the CD-ROM drive
2. Highlight Local in ServeRAID Manager, right click and select Save printable configuration and event logs. You will be prompted for a blank diskette to be inserted into Drive A:.
3. Insert a diskette and ServeRAID Manager will save a RAIDx.LOG file to the diskette. The log files are saved as RAIDx.LOG where x is the controller number.
4. Take the diskette from Drive A: to another working system and open the RAIDx.LOG text file using any standard text editor. Go to the last page of the RAIDx.LOG file and you will see a list called ServeRAID defunct drive event log for controller x. This portion of the log will list by date and time stamp all the DDDs by the order in which they went off-line with the most recent failure shown at the bottom of the list. Determine which hard disk drive failed first. The first DDD will be at the at the top of list.
  
  Important: Since the "ServeRAID defunct drive event log" is not cleared, there may be entries from earlier incidents of defunct drives that doesn't pertain to the problem currently being worked. Review the list of DDDs from the bottom to the top of the list and identify the point where the first drive associated with this incident is logged. The date and time stamps are usually the strongest indicators of where this point is.
  
  There is no guarantee that the "ServeRAID defunct drive event log" will list the drives in the exact order the disks failed under certain circumstances. One example is when an array is setup across multiple ServeRAID channels. In this configuration, the ServeRAID controller issues parallel I/O's to disk devices on each channel. In the event of a catastrophic failure, disks could also be marked defunct in parallel. This could impede the reliability of the date and time stamps as the ServeRAID controller writes events from multiple channels operating in parallel to a single log.
5. Detach or remove the first DDD from the backplane (or cable after the system is powered off). This hard disk drive will need to be replaced.
6. Exit ServeRAID Manger and power off the system
While the system is powered off, reseat the PCI ServeRAID controller(s). Reseat the SCSI cable(s) and the disks against the backplanes. Reseat the power cables to the backplane and SCSI backplane repeater options if they are present. As you are reseating the components, visually inspect each piece for bent pins, nicks, crimps, pinches or other signs of damage. Take extra time to ensure that each component snaps or clicks into place properly.
Reboot the system with the IBM ServeRAID Support CD-ROM in the CD-ROM drive
Using ServeRAID Manager, undefine all hot spare (HSP) drives to avoid an accidental rebuild from starting
Using ServeRAID Manager, set each DDD in the failed array to an "Online" state, (except the first drive marked DDD) as listed in the "ServeRAID defunct drive event log." The failed logical drives should change to a critical state. If there are problems bringing drives online, or if a drive initially goes online then fails to a Defunct state soon after, see the Hardware Troubleshooting sections below before proceeding. The logical drives must be in a critical state before proceeding.
Access the critical logical drive(s)
1. If you are still in ServeRAID Manager, exit and restart the system
2. If the operating system logical drive is now critical, attempt to boot into the installed operating system. (If you are prompted to perform any CHKDSK activities or file system integrity tests, choose to skip these tests)
3. If the data is on the critical logical drive, boot into the operating system and attempt to access the data. (If you are prompted to perform any CHKDSK activities or file system integrity tests, choose to skip these tests)
4. If the system boots into the operating system, run the appropriate command to do a READ-ONLY file system integrity check on each critical logical drive. If the file system checker determines the file system does not have any data corruption, the data should be good.
  
  For Windows NT or Windows 2000 systems, open a command prompt, type CHKDSK (read only mode) with no parameters for each critical logical drive. If CHKDSK does not report data corruption, the data should be intact.
5. Locate the IPSSEND.EXE executable on the IBM ServeRAID Support CD or the command line diskette. Run IPSSEND GETBST to determine if the bad stripe table has incremented on any logical drive. If the Bad Stripe Table has incremented to one or more, the array should be removed and recreated.
  
  You can attempt to backup the data, however you may encounter "file corrupted" messages for any files that had data on the stripe that was lost. This data is unfortunately lost forever from the current logical drive.
6. Plan to restore or rebuild the missing data on each critical logical drive if any of the following problems persists:
  1. The critical logical drive remains inaccessible
  2. Data corruption is found on the critical logical drives
  3. The system continuously fails to boot properly to the operating system
  4. Partition information on the critical logical drives is unknown
If the data is good, initiate a backup of the data.
After the backup completes, replace the remaining physical DDD. An auto-rebuild should initialize when the DDD is replaced.
Redefine hot spares as necessary.
Capture another set of ServeRAID logs using DUMPLOG.
Clear the ServeRAID logs using the following CLEARLOG.BAT command

CLEARLOG <Controller#>

Click here for clearing event logs using CLEARLOG.BAT appropriate for your operating system.
If a case was opened with your IBM Support Center for this problem, complete steps 14 and 15, otherwise you have completed the recovery process.
Plan to capture the ServeRAID logs again using DUMPLOG within 2-3 business days of normal activity (after the ServeRAID logs were cleared in step #12) to confirm the ServeRAID subsystem is fixed. These logs should be emailed to your IBM support center to ensure the ServeRAID controller and SCSI bus activities are operating within normal parameters. Click here for capturing configuration and event logs using DUMPLOG
If additional issues are observed, an action plan will be provided with corrective actions and steps 12 and 13 should be repeated until the system is running optimally.

Hardware Troubleshooting

If you continue to experience problems, DDDs again or a hard disk disk that will not change to an online state, review the following guidelines to assist in identifying the configuration or hardware problem.

The most common cause of multiple disk failures is poor signal quality across the SCSI Bus. Poor signal quality will result in SCSI protocol overhead as it tries to recover from these problems. As the system becomes busier and demand for data increases, the corrective actions of the SCSI protocol increase and the SCSI bus becomes closer to saturation. This overhead will eventually limit the normal device communications bandwidths and if left unchecked, one or more SCSI devices may not be able to respond to the ServeRAID controller in a timely manner resulting in the ServeRAID controller marking the hard disk drive DDDt. These types of signal problems can be caused by improper installation of the ServeRAID controller in a PCI slot, poor cable connections, poor seating of hot swap drives against the SCSI backplane, improper installation or seating of backplane repeaters, and improper SCSI bus termination.

There are many possible reasons for multiple disk failures, however you should be able to isolate most hardware problems using the following isolation techniques:

Check error codes within the ServeRAID Manager when a device fails to respond to a command. Research these codes in the publicly available Hardware Maintenance Manuals.
In non-hot swap systems, make sure the disk devices are attached to the cable starting from the connector closest to the SCSI terminator and work your way forward to the connectors closest to the controller. Also examine each SCSI devices for the proper jumper settings.
While the server is powered off, reseat the ServeRAID controller in its PCI slot and all cables and disk devices on the SCSI bus.
Examine the cable(s) for bent or missing pins, nicks, crimps, pinches or over stretching.
Temporarily attach the disks to an integrated Adaptec controller (or PCI version as available) and boot into the Adaptec BIOS using CTRL-A. As the Adaptec BIOS POSTs, you should see all the expected devices and the negotiated data rates. You can review this information and determine if this is what you should expect to see.

Once you proceed into the BIOS, choose the SCSI Scan option which will list all the devices attached to the controller. Highlight and select one of the disks and initiate a "Media Test" (this is NOT destructive to data). This will test the device and the entire SCSI bus. If you see errors on the Adaptec controller, try to determine if it is the device or the cable by initiating a Media Test on other disks. Test both Online and Defunct disks, to determine if the test results are consistent with the drive states on the ServeRAID controller. You can also move Hot Swap disks to a different position on the backplane and re-test to see if the results change.

If the problem persists, swap out the SCSI cable and retry the Media Tests on the same disks. If the disks test okay, the previous cable is bad. This is a valuable tool to use as you isolate for a failing component in the SCSI path.

Important: The Adaptec controller can produce varying results from the ServeRAID controller because of the way the Adaptec controllers negotiates data rates with LVD/SE SCSI devices. If an Adaptec controller detects errors operating at LVD speeds, it can downgrade the data rates to single-ended speeds and continue to operate with no reported errors. The ServeRAID controller will not necessarily change data rates under the same conditions.

Use the System Diagnostics in F2 Setup to test the ServeRAID subsystem. If the test fails, disconnect the drives from the ServeRAID controller and reset the controller to factory default settings. Retry the Diagnostic tests. If Diagnostics pass then attach the disks to the ServeRAID controller, one channel at a time and retry the tests to isolate the channel of the failing component. If the controller continues to fails diagnostic tests (when using the latest available diagnostics for the server), call your IBM support center for further assistance.

Disconnect or detach the first drive in the array to be marked defunct from the cable or backplane. Restore default settings to the ServeRAID controller. Attempt to import the RAID configuration from the disk drives. Depending on how the failure occurred, this technique may have mixed results. There is a reasonable chance that all the drives will return to an online state, except the first disk which is disconnected.

Move the ServeRAID controller into a different PCI slot or PCI bus and re-test.

When attaching an LVD ServeRAID controller to legacy storage enclosures, set the data rate for the channel to the appropriate data rate supported by the enclosure.

Mixing LVD SCSI devices with Single Ended SCSI devices on the same SCSI bus will result in switching all devices on the channel to Single Ended mode and data rates.

Open a case with your IBM Support Center and submit all the sets of ServeRAID logs captured on the system for interpretation to isolate a failing component.