Sun Microsystems
Products & Services
 
Support & Training
 
 

Previous Previous     Contents     Next Next
Chapter 9

Troubleshooting and Data Recovery

This chapter describes how to identify ZFS failure modes and how to recover from them. Steps for preventing failures are covered as well.

The following sections are provided in this chapter.

9.1 ZFS Failure Modes

As a combined file system and volume manager, there are many different failure modes that ZFS can exhibit. Before going into detail about how to identify and repair specific problems, it is important to describe some of the failure modes and how they manifest themselves under normal operation. This chapter will begin by outlining the various failure modes, then discuss how to identify them on a running system, and finally how to repair the problems. There are three basic types of errors. It is important to note that a single pool can be suffering from all three errors, so a complete repair procedure will involve finding and correcting one error, proceeding to the next, etc.

9.1.1 Missing Devices

If a device is completely removed from the system, ZFS detects that it cannot be opened and places the device in the FAULTED state. Depending on the data replication level of the pool, this may or may not result in the entire pool becoming unavailable. If one disk out of a mirror or RAID-Z device is removed, the pool will continue to be accessible. If all components of a mirror are removed, more than one device in a RAID-Z device is removed, or a single-disk, top-level device is removed, the pool will become FAULTED, and no data will be accessible until the device is re-attached.

9.1.2 Damaged Devices

The term 'damaged' covers a wide variety of possible errors. Examples include transient I/O errors due to a bad disk or controller, on-disk data corruption due to cosmic rays, driver bugs resulting in data being transferred to/from the wrong location, or simply another user overwriting portions of the physical device by accident. In some cases these errors are transient, such as a random I/O error while the controller was having problems. In other cases the damage is permanent, such as on-disk corruption. Even still, whether or not the damage is permanent does not necessarily indicate that the error is likely to occur again. For example, if an administrator accidentally overwrote part of a disk, it does not indicate any type of hardware failure has occurred, and the device should not be replaced. Identifying exactly what went wrong with a device is not an easy task, and is covered in more detail in a later section.

9.1.3 Corrupted Data

Data corruption occurs when one or more device errors (missing or damaged devices) affects a top level virtual device. For example, one half of a mirror can experience thousands of device errors without ever causing data corruption. If an error is encountered on the other side of the mirror in the exact same location, it will result in corrupted data. Data corruption is always permanent, and requires special consideration when repairing. Even if the underlying devices are repaired or replaced, the original data is lost forever. Most often this will require restoring data from backups. Data errors are recorded as they are encountered, and can be controlled through regular disk scrubbing, explained below. When a corrupted block is removed, the next scrubbing pass will notice that the corruption is no longer present and remove any trace of the error from the system.

9.2 Checking Data Integrity

There is no fsck(1M) equivalent for ZFS. This utility has traditionally served two purposes:

9.2.1 Data Repair

With traditional filesystems, the way in which data is written is inherently vulnerable to unexpected failure causing data inconsistencies. Since the filesystem is not transactional, it is possible to have unreferenced blocks, bad link counts, or other inconsistent data structures. The addition of journalling does solve some of these problems, but can introduce additional problems when the log cannot be rolled. With ZFS, none of these problems exist. The only way for there to be inconsistent data on disk is through hardware failure (in which case the pool should have been replicated), or a bug in the ZFS software. Given that fsck(1M) is designed to repair known pathologies specific to individual filesystems, it is not possible to write such a utility for a filesystem with no known pathologies. Future experience may prove that certain data corruption problems are common enough and simple enough such that a repair utility can be developed, but these problems can always be avoided by using replicated pools.

If your pool is not replicated, there is always the chance that data corruption can render some or all of your data inaccessible.

9.2.2 Data Validation

The other purpose that fsck(1M) serves is to validate that there are no problems with the data on disk. Traditionally, this is done by unmounting the filesystem and running the fsck(1M) utility, possibly bringing the system down to single user mode in the process. This results in downtime that is proportional to the size of the filesystem being checked. Instead of requiring an explicit utility to do the necessary checking, ZFS provides a mechanism to do regular checking of all data. This functionality, known as scrubbing, is commonly used in memory and other systems as a method of detecting and preventing errors before it results in hardware or software failure.

9.2.3 Controlling Data Scrubbing

Whenever ZFS encounters an error, either through scrubbing or when accessing a file on demand, the error is logged internally so that the administrator can get a quick overview of all known errors within the pool.

9.2.3.1 Explicit Scrubbing

The simples way to perform a check of your data integrity is to initiate an explicit scrub of all data within the pool. This will traverse all the data in the pool exactly once and verify that all blocks can be read. It will proceed as fast as the devices allow, though the priority of any I/O will remain below that of normal operations. This may negatively impact performance, though the filesystem should remain usable and nearly as responsive while the scrub is happening. To kick off an explicit scrub, use the zpool scrub command:

# zpool scrub tank

The status of the current scrub can be seen in zpool status output:

# zpool status -v
  pool: tank
 state: ONLINE
 scrub: scrub completed with 0 errors on Tue Nov 15 14:31:51 2005
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0

Note that there can only be one active scrubbing operation per pool.

# zpool status tank
  pool: tank
 state: ONLINE
 scrub: scrub completed with 0 errors on Tue Nov 15 14:31:51 2005
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0

Performing regular scrubbing also guarantees continuous I/O to all disks on the system. Regular scrubbing has the side effect of preventing power management from placing idle disks in low-power mode. If the system is generally performing I/O all the time, or if power consumption is not a concern, then this can safely be ignored.

For more information on interpreting zpool status output, see 4.6 Querying Pool Status.

9.2.3.2 Scrubbing and Resilvering

When a device is replaced, it initiates a resilvering operation to move data from the good copies to the new device. This is a form of disk scrubbing, and therefore only one such action can happen at a given time in the pool. If a scrubbing operation is in progress, a resilvering operation will suspend the current scrub, and start again after the resilvering is complete.

9.3 Identifying Problems

All ZFS troubleshooting is centered around the zpool status command. This command analyzes the various failures seen in the system and identify the most severe problem, presenting the user with a suggested action and a link to a knowledge article for more information. It is important to note that the command only identifies a single problem with the pool, though multiple problems can exist. For example, data corruption errors always imply that one of the devices has failed. Replacing the failed device will not fix the corruption problems.

This chapter describes how to interpret zpool status output in order to diagnose the type of failure and direct the user to one of the following sections on how to repair the problem. While most of the work is done automatically by the command, it is important to understand exactly what problems are being identified in order to diagnose the type of failure.

9.3.1 Determining if Problems Exist

The easiest way to determine if there are any known problems on the system is to use the zpool status -x command. This command only describes pools exhibiting problems. If there are no bad pools on the system, then the command displays a simple message:

# zpool status -x
all pools are healthy

Without the -x flag, the command displays complete status for all pools (or the requested pool if specified on the command line), even if the pools are otherwise healthy.

For more information on command line options to the zpool status command, see 4.6 Querying Pool Status.

Previous Previous     Contents     Next Next