17 High Availability

This chapter discusses the concept of database availability and introduces you to Oracle Database high availability products and features.

Note:

Availability is influenced by many choices you make other than your database software: hardware, application and operating system software, storage media, network reliability, and operational processes are all important.

This chapter includes these topics:

Introduction to High Availability
Causes Of Downtime
Protection Against Computer Failures
Protection Against Data Failures
Avoiding Downtime During Planned Maintenance
Maximum Availability Architecture (MAA) Best Practices

See Also:

These books provide complete information about best practices for deploying a highly available environment and describes the Oracle products and features that support high availability.

Introduction to High Availability

Enterprises have used their information technology (IT) infrastructure to provide competitive advantage, increase productivity, and empower users to make faster and more informed decisions. However, with these benefits has come an increasing dependence on that infrastructure. Revenue and customers can be lost, penalties can be owed, and bad press can have a lasting effect on customers and a company's reputation. Building a high availability IT infrastructure is critical to the success and well being of all enterprises in today's fast moving economy.

Trends in computing technology are also enabling a new IT architecture, referred to as Grid computing, to be deployed. The Grid computing architecture effectively pools large numbers of servers and storage into a flexible, on-demand computing resource for all enterprise computing needs. Technology innovations like low-cost blade servers, small and inexpensive multiprocessor servers, modular storage technologies, and open source operating systems (such as Linux) provide the raw materials for the Grid. By harnessing these technologies and leveraging the Grid technology available in the Oracle Database, enterprises can deliver an extremely high quality of service to users while vastly reducing expenditures on IT. The Oracle Database enables you to capture the cost advantages of Grid enterprise computing without sacrificing performance, scalability, security, manageability, functionality, or system availability.

This chapter examines the causes of downtime and looks at the technology available in the Oracle Database that avoids costly downtime and enables rapid recovery from failures.

Causes Of Downtime

One of the challenges when designing a highly available IT Grid infrastructure is examining and addressing all the possible causes of downtime. Figure 17-1 shows a diagram that classifies system downtime into two primary categories: unplanned and planned downtime. It is important to consider the causes of both unplanned and planned downtime when designing a fault tolerant and resilient IT infrastructure.

Figure 17-1 Causes of Downtime

Description of "Figure 17-1 Causes of Downtime"

Unplanned downtime results from computer failures or data failures. Planned downtime is primarily due to data changes or system changes that must be applied to the production system. The following sections examine each of these causes of downtime and describes the Oracle technology you can apply to avoid downtime.

Protection Against Computer Failures

A computer failure occurs when the computer system or database server unexpectedly fails and causes a service interruption. In most cases, computer failures are due to hardware breakdown. These types of failures are best remedied by taking advantage of cluster technology and fast database crash recovery. The recommended solutions include Enterprise grids with Oracle Real Application Clusters (Oracle RAC), fast start fault recovery, Oracle Data Guard, and Oracle Streams.

This section includes the following topics:

Overview of Enterprise Grids with Oracle Real Application Clusters and Oracle Clusterware
Fast Start Fault Recovery
Oracle Data Guard
Oracle Streams

Overview of Enterprise Grids with Oracle Real Application Clusters and Oracle Clusterware

With Oracle RAC, the enterprise can build database servers across multiple systems that are highly available and highly scalable. In an Oracle RAC environment, Oracle Database runs on two or more systems in a cluster while concurrently accessing a single shared database. This provides a single database system that spans multiple hardware systems yet appears to the application as a single unified database system. This extends the following availability and scalability benefits for all of your applications:

Flexibility and cost effectiveness in capacity planning, so that a system can scale to any desired capacity on demand and as business needs change.
Fault tolerance within the cluster, especially computer failures.

The following list describes the features of an Oracle RAC environment:

Enterprise Grids—Oracle RAC enables enterprise Grids. Enterprise Grids are built out of large configurations of standardized, commodity-priced components: processors, servers, network, and storage. Oracle RAC is the only technology that can harness these components into a useful processing system for the enterprise. Oracle RAC and the Grid dramatically reduce operational costs and provide flexibility so that systems become more adaptive, proactive, and agile. Dynamic provisioning of nodes, storage, CPUs, and memory allow service levels to be easily and efficiently maintained while lowering cost still further through improved use. In addition, Oracle RAC is completely transparent to the application accessing the Oracle RAC database, thereby allowing existing applications to be deployed on Oracle RAC without requiring any modifications.

Scalability—Oracle RAC gives you the flexibility to add nodes to the cluster as the demand for capacity increases, scaling the system up incrementally to save costs and eliminating the need to replace smaller single node systems with larger ones. Grid pools of standard low-cost computers and modular disk arrays make this solution even more powerful with Oracle Database. It makes the capacity upgrade process much easier and faster because you can incrementally add one or more nodes to the cluster, compared to replacing existing systems with new and larger nodes to upgrade systems. The Cache Fusion technology implemented in Oracle RAC and the InfiniBand support provided in Oracle Database enable you to scale the capacity almost linearly, without making any changes to your application.

Fault Tolerance—Another key advantage of the Oracle RAC cluster architecture is the inherent fault tolerance provided by multiple nodes. Because the physical nodes run independently, the failure of one or more nodes will not affect other nodes in the cluster. Failover can happen to any node on the Grid. In the extreme case, an Oracle RAC system will still provide database service even when all but one node is down. This architecture allows a group of nodes to be transparently put online or taken offline, for maintenance, while the rest of the cluster continues to provide database service. Oracle RAC provides built in integration with the Oracle Application Server for failing over connection pools. With this capability, an application is immediately notified of any failure rather than having to wait tens of minutes for a TCP timeout to occur. The application can immediately take the appropriate recovery action. And Grid load balancing redistributes load over time.

Oracle Clusterware—Oracle RAC also provides a complete set of clusterware to manage the cluster. Oracle Clusterware provides all of the features required to run the cluster, including node membership, messaging services, and locking. Because Oracle Clusterware is a fully integrated stack with common event and management APIs, it can be centrally managed from Oracle Enterprise Manager. There is no need to purchase additional software to support your cluster, which helps avoid the additional efforts required to integrate and test third-party clusterware. Oracle Clusterware also provides the same interface and operates the same way across all of the platforms on which Oracle Database is available. While Oracle continues to support third-party clusterware for use with Oracle RAC, there is no need or advantage to using third-party clusterware.

You can extend the high availability capabilities of the Oracle Clusterware framework to your applications. That is, you can use the same high availability mechanisms of Oracle Database and Oracle RAC to make your custom applications highly available. You can use Oracle Clusterware to monitor, relocate, and restart your applications, thus allowing you to integrate and coordinate failover of your applications with database failover.

Services—Oracle RAC supports an entity referred to as a service that you can define to group database workloads and route work to the optimal instances that are assigned to offer the service. Services represent classes of database users or applications. You define and apply business policies to these services to perform tasks such as to allocate nodes for times of peak processing or to automatically handle a server failure. Using services ensures the application of system resources where and when they are needed to achieve business goals.

Fast Start Fault Recovery

One of the most common causes of unplanned downtime is a system fault or crash. System faults are the result of hardware failures, power failures, and operating system or server crashes. The amount of disruption these failures cause depends on the number of affected users and how quickly service is restored. High availability systems are designed to quickly and automatically recover from failures, should they occur. Users of critical systems look to the IT organization for a commitment that recovery from a failure will be fast and take a predictable amount of time. Periods of downtime longer than this commitment can have a direct effect on operations and lead to lost revenue and productivity.

Oracle Database provides very fast recovery from system faults and crashes. However, equally important to being fast is being predictable. The fast start fault recovery technology included in Oracle Database automatically bounds database crash recovery time and is unique to Oracle Database. The database self tunes checkpoint processing to safeguard the desired recovery time objective. This makes recovery time fast and predictable and improves the ability to meet service-level objectives. The Oracle fast start fault recovery feature can reduce recovery time on a heavily loaded database from tens of minutes to a few seconds.

See Also:

Oracle Database Performance Tuning Guide for information on fast start fault recovery

Oracle Data Guard

Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data. Data Guard maintains standby databases as transactionally consistent copies of the primary (production) database. Then, if the primary database becomes unavailable because of a planned or an unplanned outage, Data Guard can switch any standby database to the primary role, minimizing the downtime associated with the outage. Automated failover using Data Guard fast-start failover and fast application notification with integrated Oracle clients provides a high level of data protection and data availability.

Oracle Streams

You can use Streams to configure flexible high availability environments. With Oracle Streams, you can create a local or remote copy of a production database. In the event of human error or a catastrophe, the copy can be used to resume processing.

Protection Against Data Failures

A data failure is the loss, damage, or corruption of critical enterprise data. The causes of data failure are more complex and subtle than computer failure and can be caused by a failure of the storage hardware, human error, corruption, or site failure.

Figure 17-2 focuses on the four types of data failures: storage failure, human error, corruption, and site failure.

Figure 17-2 Downtime Due to Data Failures

Description of "Figure 17-2 Downtime Due to Data Failures"

It is extremely important to design a solution to protect against and recover from data failures. A system or network fault may prevent users from accessing data, but data failures without proper backups or recovery technology can result in either the recovery operation taking many hours to perform, or in lost data.

Oracle Database provides many data protection capabilities. The motivation for many of these enhancements is the new economics around data protection and recovery. Over the last twenty years, disk capacity has grown by three orders of magnitude while the cost per megabyte has fallen dramatically. This is a trend that shows no sign of abating. This has made the cost of disk storage competitive with tape as a backup media. Plus, disk storage has additional benefits of being online and able to provide random access to the data.

These trends allowed Oracle to rethink and make its recovery strategy hierarchical to take advantage of these economic dynamics. By making additional disk storage available to Oracle Database, you can reduce backup and recovery time from hours to minutes. In essence, you can trade inexpensive disk storage for expensive downtime.

This section includes the following topics:

Protecting Against Storage Failures
Protecting Against Human Errors

Protecting Against Storage Failures

Provisioning storage for a single-database instance, much less for an entire enterprise, can be complex. Typically, the process includes the following steps:

Estimate the amount of space needed
Map out what you hope will be an optimal layout (where to put data files, archive files, and so on to avoid hot spots)
Create logical volumes
Create file systems
Define and set up how you will protect and mirror your data
Define and implement your backup and recovery plan for the data
Install the Oracle software
Create the database

Then, the hard work begins—looking for hot spots that negatively affect performance; moving datafiles around to reduce contention, and dreading the day when a disk crash occurs or when you run out of space and must add more disks and shift all the files around to rebalance across your updated storage configuration.

Fortunately, that scenario is changing dramatically with the Automatic Storage Management (ASM) feature of Oracle Database. ASM provides a vertically integrated file system and volume manager directly in the Oracle kernel, resulting in much less work to provision database storage.

ASM provides a higher level of availability, without the expense, installation, and maintenance of specialized storage products, and provides unique capabilities for database applications. ASM spreads its files across all available storage for optimal performance, and it can mirror as well, providing protection against data loss. ASM extends the concept of SAME (stripe and mirror everything) and adds more flexibility in that it can do mirroring at the database file level instead of having to mirror at the entire disk level.

Most importantly, ASM eliminates the complexity associated with managing data and disks, and it simplifies the processes of setting up mirroring, adding disks, and removing disks. Rather than managing hundreds, possibly thousands of files (as in a large data warehouse) database administrators using ASM create and administer a larger-grained object, the disk group, which identifies the set of disks to be managed as a logical unit. The automation of the file naming and placement of the underlying database files save the DBAs time and ensures that best practice standards are followed.

Optionally, you can use the ASM native mirroring mechanism to protect against storage failures. Mirroring is enabled by default and triple mirroring is also available. With ASM mirroring, you can provide an additional level of data protection with the use of failure groups. A failure group is a set of disks sharing a common resource (disk controller or an entire disk array) whose failure can be tolerated.

Once defined, an ASM failure group intelligently places redundant copies of the data in separate failure groups to ensure that the data will be available and transparently protected against the failure of any component in the storage subsystem. In addition, ASM supports the Hardware Assisted Resilient Data capability (discussed below in the Protecting Against Data Corruptions section) to further protect your data.

See Also:

Oracle Database 2 Day DBA

Protecting Against Human Errors

Most research performed on the causes of downtime identifies human error as the single largest cause of downtime. Human errors—such as the inadvertent deletion of important data or when an incorrect WHERE clause in an UPDATE statement updates many more rows than were intended—must be prevented wherever possible and must be undone when the precautions against them fail. Oracle Database provides easy to use yet powerful tools that help administrators quickly diagnose and recover from these errors. It also includes features that allow end users to recover from problems without administrator involvement, reducing the support burden on the administrators and speeding recovery of the lost and damaged data.

The following sections describe the Oracle features that protect against human errors:

Guarding Against Human Errors
Oracle Flashback Technology
LogMiner SQL-Based Log Analyzer
Protecting Against Data Corruptions
Protecting Against Site Failures

Guarding Against Human Errors

The best way to prevent errors is to restrict a user's access to data and services they truly need to conduct their business. Oracle Database provides a wide range of security tools to control user access to application data by authenticating users and then allowing administrators to grant users only those privileges required to perform their duties. In addition, the security model of Oracle Database helps you restrict data access at a row level, using the Virtual Private Database (VPD) feature. This further isolates users from data they do not need access to.

Oracle Flashback Technology

When authorized people make mistakes, you need the tools to correct these errors. Oracle Database provides a family of human error correction technology called Flashback. Flashback revolutionizes data recovery. In the past, it might take minutes to damage a database but hours to recover it. With Flashback technology, the time to correct errors equals the time it took to make the error. It is also easy to use a single, short command to recover the entire database instead of following some complex procedure. Flashback is unique to Oracle Database and provides:

A SQL interface to quickly analyze and repair human errors.
Fine-grained surgical analysis and repair for localized damage, such as when the wrong customer order is deleted.
Correction of more widespread damage yet does it quickly to avoid long downtime, such as when all of this month's customer orders have been deleted.
Recovery at all levels including the row, transaction, table, tablespace, and database wide.

Table 17-1 describes how Flashback technology corrects human errors.

Table 17-1 Protecting Against Human Errors with Oracle Flashback Technology

Feature	Description
Flashback Query	Oracle Flashback Query allows you to query any data at some point in time in the past. You can use Flashback Query to view and reconstruct lost data that may have been deleted or changed by accident. For example: SELECT * FROM employee AS OF TIMESTAMP TO_TIMESTAMP('19-APR-05 02:00:00 PM') WHERE … This statement displays rows from the employee table as of 2:00pm on the specified date. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors without delay rather than burdening administrators to perform this task. Flashback Query is simple to manage, because the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.
Flashback Versions Query	The Flashback Versions Query provides a way to view changes made to the database at the row level. It is an extension to SQL and allows you to retrieve all of the different versions of a row across a specified time interval. For example: SELECT * FROM employee VERSIONS BETWEEN TIMESTAMP TO_TIMESTAMP('19-APR-05 02:00:00 PM') AND TIMESTAMP TO_TIMESTAMP('19-APR-05 03:00:00 PM') WHERE … This statement displays each version of the row, with each row changed by a different transaction between 2:00 and 3:00 p.m. on 19 April. This helps you to pinpoint when and how data is changed and trace it back to the user, application, or transaction, and tracks down the source of a logical corruption in the database and correct it. Flashback Versions Query also helps application developers debug code.
Flashback Transaction	Oracle Flashback Transaction backs out a transaction and its dependent transactions. The `DBMS_FLASHBACK.TRANSACTION_BACKOUT()` procedure rolls back a transaction and its dependent transactions while the database remains online. This recovery operation uses undo data to create and execute the compensating transactions that return the affected data to its original state. You can query the `DBA_FLASHBACK_TRANSACTION_STATE` view to see the current state of a transaction with respect to whether the transaction has been backed out using dependency rules or forced out by either: Backing out nonconflicting rows Applying undo SQL Oracle Flashback Transaction increases availability during logical recovery by easily and quickly backing out a specific transaction or set of transactions and their dependent transactions, with one command while the database remains online.
Flashback Transaction Query	Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that allows you to see all changes made by a transaction. For example: SELECT * FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = '000200030000002D'; This query shows all of the resultant changes made by this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like this, the database administrator and application developer can precisely diagnose and correct logical problems in the database or application.
Flashback Database	To bring an Oracle database to a previous point in time, the traditional method is to do point-in-time recovery. However, point-in-time recovery can take hours, or even days, because it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it will take hours or even days just to restore the whole database. Flashback Database is a strategy for doing point in time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error. It uses flashback logs to capture old versions of changed blocks. This is similar to a continuous backup or storage snapshot. When you need to perform recovery, the Flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. Flashback Database is fast and reduces recovery time from hours to minutes. For example, issue the following command to recover a database to 2:05 p.m.: FLASHBACK DATABASE TO TIMESTAMP TO_TIMESTAMP('19-APR-05 02:05:00 PM');
Flashback Table	Flashback Table recovers a table or a set of tables to a specified point in time in the past. In many cases, Flashback Table alleviates the need to perform more complicated point-in-time recovery operations. For example: FLASHBACK TABLE orders, order_items TIMESTAMP TO_TIMESTAMP('07-APR-2005 02:33:00 PM'); This command rewinds any updates to the `orders` and `order_items` tables that were made between the current time and the specified timestamp in the past. Flashback Table performs this operation online and in place and it maintains any referential integrity constraints between the tables. Flashback Table is very similar to having a rewind or undo button for a table, or a set of related tables.
Flashback Drop	Dropping, or deleting, database objects by accident is a mistake people make, and probably always will make, thinking they were connected to the test database when they actually deleted that table. Users soon realize their mistake but by then it is too late and there is no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped are lost forever. For important tables or other objects (like indexes, partitions or clusters), you need to perform a point-of-time recovery, which may be very time consuming and lead to loss of recent transactions. Flashback Drop provides a safety net when dropping objects in Oracle Database. When a user drops a table, Oracle Database places it in a recycle bin. The recycle bin is a virtual container where all dropped objects reside. Objects remain in the recycle bin until you decide to permanently remove the objects or until the space pressure is placed on the tablespace containing the table. You can undrop the dropped table and its dependent objects from the Recycle Bin. For example, the table employee and all of its dependent objects would be undropped using the following command: FLASHBACK TABLE employee TO BEFORE DROP;
Flashback Restore Points	When an Oracle database point-in-time recovery operation is required, you must determine a time or System Change Number (SCN or transaction time) to which the data must be rolled back. Oracle Database uses restore points, which is a user-defined label that can be substituted for an SCN or clock time when used in conjunction with Flashback Database, Flashback Table, and Recovery Manager (RMAN). Restore points provides the ability to bookmark a known time when the database was in a good state, allowing quick and easy rewind of an inappropriate action done to the database. It also provides the ability to flashback through a previous database recovery and open resetlogs. Guaranteed restore points allow major database changes to be quickly undone (for example, database batch job, upgrade, or patch) by ensuring that the undo required to rewind the database is retained. Restore points provide an easy way to rewind back to a known time. In an Oracle Data Guard environment, this capability also allows a physical standby database that has been opened read-write to later flash back the changes and later convert the database back to a physical standby database that is synchronized with the production database. If a logical error is discovered after a switchover operation, the primary and standby databases can be flashed back to an SCN or a point in time prior to the switchover operation. In addition, to quickly synchronize the standby database with the production database, you can apply an incremental backup of the production database to the standby database instead of applying all the redo data generated since the two databases diverged. This can significantly reduce the time to resynchronization the two databases.
Block recovery using flashback logs	Block recovery can optionally retrieve a more recent copy of a data block from the flashback logs to reduce recovery time. Furthermore, a corrupted block encountered during instance recovery does not result in instance recovery failing. The block is automatically marked as corrupt and added to the RMAN corruption list in the `V$DATABASE_BLOCK_CORRUPTION` table. You can subsequently issue the RMAN `RECOVER BLOCK` command to fix the associated block.
Flashback Data Archive	A Flashback Data Archive provides the ability to track and store all transactional changes to a table over its lifetime. It is no longer necessary to build this intelligence into your application. A Flashback Data Archive is useful for compliance with record stage policies and audit reports. See or complete information.

See Also:

Chapter 13, "Data Concurrency and Consistency" for information on Oracle Flashback Query
Chapter 15, "Backup and Recovery" for information on Oracle Flashback Database and Oracle Flashback Table
Oracle Database Advanced Application Developer's Guide for more information on Flashback Transaction

LogMiner SQL-Based Log Analyzer

Oracle log files contain a wealth of useful information about the activities and history of an Oracle database. Log files contain all of the data needed to perform database recovery. They also record every change made to data and metadata in the database. LogMiner is a relational tool that allows redo log files to be read, analyzed, and interpreted using SQL. You can use analysis of the log files with LogMiner to track or audit changes to data, provide supplemental information for tuning and capacity planning, retrieve critical information for debugging complex applications, or recover deleted data.

See Also:

"Overview of LogMiner"

Protecting Against Data Corruptions

A corruption is created by a faulty component in the I/O stack. For example, the database issues I/O operations as the result of an update transaction. The database I/O is passed to:

The I/O code in the operating system
The file system
The volume manager
The device driver
The host bus adapter
The storage controller
The disk drive to which the I/O is finally written

Bugs or a hardware failure in any component in the I/O sequence could flip some bits in the data resulting in corrupt data being written to the database. This corruption could be due to database control information or user data either of which could be catastrophic to the functioning and availability of the database. Similarly, a disk failure could damage database files requiring backups be used to recover the database.

Protection against data corruptions include the following topics:

Oracle Hardware Assisted Resilient Data (HARD)
Lost Writes
Data Recovery Advisor
Flash Backup And Recovery

Oracle Hardware Assisted Resilient Data (HARD)

The Oracle Hardware Assisted Resilient Data (HARD) prevents data corruptions before they happen. Data corruptions while rare, can have a catastrophic effect on a database, and therefore a business. By implementing the Oracle data validation algorithms inside storage devices, Oracle prevents corrupted data from being written to the database files on persistent storage. This type of end-to-end high-level software to low-level hardware validation is a unique capability provided by Oracle and its storage partners.

Oracle validates and adds protection information to the database blocks and this protection information is further validated by the storage device. HARD detects corruptions from being introduced into the I/O path between the database and storage, and eliminates a large class of failures that the database industry has previously been powerless to prevent. RAID has gained a wide following in the storage industry by ensuring the physical protection of data, HARD takes data protection to the next level by going beyond protecting physical bits, to protecting business data.

See Also:

http://www.oracle.com/technology/deploy/availability/htdocs/HARD.html for more information about HARD

Lost Writes

A lost write is another form of data corruption, but it is much more evasive to detect and repair quickly. A data block lost write occurs when:

An I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage. On a subsequent block read, the I/O subsystem returns the stale version of the data block, which might be used to update other blocks of the database, thereby corrupting it.
The write I/O completed but it was written somewhere else, and a subsequent read operation returns the stale value.
A read I/O from one cluster node returns stale data after a write I/O on another node. For example, this could occur if an NFS caching policy is incompatible with Oracle RAC.

Data block corruption prevention and detection

Prior to Oracle Database 11g, block corruptions detected by RMAN were recorded in V$DATABASE_BLOCK_CORRUPTION. In Oracle Database 11g, several database components and utilities, including RMAN, can now detect a corrupt block and record it in that view. Oracle Database automatically updates this view when block corruptions are detected or repaired (for example, using block media recovery or data file recovery). The benefit is that the time it takes to discover block corruptions is shortened.

Data Recovery Advisor

Data Recovery Advisor automatically diagnoses persistent (on disk) data failures, presents appropriate repair options, and runs repair operations at your request. It includes the following functionality:

Failure diagnosis
Failure impact assessment
Repair generation
Repair feasibility checks
Repair automation
Validation of data consistency and database recoverability
Early detection of corruption
Integration of data validation and repair

Note that the initial release of Data Recovery Advisor does not support Oracle RAC. In addition, while you can use Data Recovery Advisor to manage a primary database in a Data Guard configuration, you cannot use Data Recovery Advisor to troubleshoot a physical standby database. Data Recovery Advisor only takes the presence of a standby database into account when recommending repair strategies if you are using Oracle Enterprise Manager 11g Grid Control.

Flash Backup And Recovery

There is no substitute for backups of enterprise data. Although rare, multiple failures can render even data mirrored in the storage subsystem unusable. Fortunately, Oracle provides online tools to properly backup all your data, to restore data from a previous backup, and to recover changes to that data up to the time just before the failure occurred.

Backing up a large database system is no simple task. A large database can be composed of hundreds of files spread over many different disks. Neglecting to backup a critical file can render the entire database backup unusable. Often these damaged files are not discovered until they are needed. Recovery Manager (RMAN) is a tool that manages the backup, restore, and recovery process for Oracle Database. It creates and maintains backup policies, and catalogs all backup and recovery activities. All data blocks can be analyzed for corruption during backup and restore to prevent propagation of corrupt data through to the backups. Most importantly, Recovery Manager ensures all necessary data files are backed up, and the database is recoverable.

Recovery Manager automatically keeps track of the files needed to restore the database within a user-specified window. It can automatically restart interrupted operations, handle corrupted log files, and restore an individual data block while the remainder of the database remains online.

RMAN radically enhances database backup and recovery. RMAN can automatically manage backing up and recover all of your data to the Flash Recovery Area. The Flash Recovery Area is a unified disk-based storage location for all recovery related files and activities in an Oracle database. Making backups to disk, instead of tape, enables faster backups. But more importantly, if database media recovery is required, then datafile backups are readily available, radically speeding database recovery time.

Recovery Manager manages the recovery files in the Flash Recovery Area. RMAN automatically creates all backups in the Flash Recovery Area and manages the space. The archiver writes redo data to the Flash Recovery Area and RMAN automatically deletes, or moves to tape, obsolete backups and archived redo logs that are no longer required. If you set the RETENTION POLICY to a recovery window of 7 days, then RMAN retains all backups required to recover the database 7 days back. If you require recovery to a time further than 7 days in the past, RMAN restores the data from tape. Oracle Enterprise Manager provides a complete interface to drive Flash Backup and Recovery, including implementing best practices.

Incremental backups have been part of RMAN since it was first released in the Oracle8 Database. Incremental backups record only the changed blocks since the previous backup. Oracle Database delivers the ability for faster incremental backups with the implementation of block change tracking. Oracle Database tracks the physical location of all database changes. RMAN automatically uses this change tracking information to determine which blocks need to be read during an incremental backup and directly accesses the block to back it up. The incremental backups can then be merged into a previously created image backup to minimize the time for recovery. A backup strategy based on incrementally updated backups keeps the time required for media recovery to a minimum. By making incremental backups with change tracking part of your backup strategy, you can: reduce the amount of time needed for daily backups, save network bandwidth when backing up over a network, recover unlogged changes to database, reduce the backup file storage, and reduce the time for database recovery.

Oracle Database backup and recovery also includes many other innovative capabilities

Compressed backups
Automated failover to a previous backup when restore discovers a missing or corrupt backup
Automated recovery through a previous point in time recovery, or recovery through resetlogs
Automated creation of new database and temporary files during recovery
Automated channel failover on backup or restore
Automated tablespace point-in-time recovery
Full database BEGIN BACKUP command for faster mirror split
Faster recovery due to improved recovery parallelism (2 to 4 times)
Tablespace Rename
Proxy (third party) backup for archived redo logs
Time window based throttling of backups
Cross Platform Transportable Tablespaces

See Also:

Chapter 15, "Backup and Recovery" for information on backup and recovery solutions
Oracle Database Backup and Recovery User's Guide for information on RMAN and backup and recovery solutions

Protecting Against Site Failures

Data protection features provide protection from catastrophic events that cripple processing at a site for an extended period of time. Examples include file corruptions, natural disasters, power and communication outages, and even terrorism. Oracle Database offers a variety of data protection solutions that provide the ability to create and maintain a local or remote copy of a production database. In the event of a corruption or disaster, users of the data can continue to function by accessing the remote database.

The simplest form of data protection is off-site storage of database backups. In the event a data center is unable to resume services in a reasonable amount of time, the backups can be restored on a system at another site, and users can connect to the backup system. Unfortunately, restoring backups on another system will be time consuming, and the backup may not be completely up to date. To more quickly recover and maintain continuous database service even in the event of a disaster, Oracle provides Data Guard, which is described in the following topics:

Oracle Data Guard and standby databases
Zero Data Loss Redo Transport
Real-Time Apply and Flashback Database
Data Guard Broker
Fast-Start Failover

Oracle Data Guard and standby databases

Data Guard should be the foundation of any Oracle Database disaster recovery plan. Data Guard provides the ability to set up and maintain a standby copy of your production database. You can locate the standby database a half a world away from the production database or in the same data center. Data Guard includes enhancements to automate complex tasks and provide significant monitoring, alerting and control mechanisms. It enables your database to survive a data center disaster. Data Guard also works transparently across Grid clusters and you can add the servers dynamically to the standby database in the event a failover is required.

Oracle Data Guard supports physical standby databases that use Redo Apply technology, snapshot standby databases, and logical standby databases that use SQL Apply technology:

Physical standby databases and Redo Apply

Data Guard in Redo Apply mode maintains a copy of a production database, called a physical standby database, and keeps it synchronized with the production database. The redo data from the primary database are shipped to the standby and physically applied via media recovery. The standby database is physically identical to the primary (although it may lag the primary). Additionally, you can open the standby database in read-only mode while redo is being applied so the database can also be used to off load reporting work from the production database. Backup processing may also be offloaded from the production database as backups created at the standby database can be used to perform recovery of the production database.

Physical standby databases are good for providing protection from disasters and data errors. In the event of an error or disaster, the physical standby can be opened, and be used to provide data services to applications and end users. Because the efficient media recovery mechanism is used to apply changes to the standby database, it is supported with every application, and can easily and efficiently keep up with even the largest transaction workloads.
Snapshot standby databases

A snapshot standby database is an updatable standby database that you create from a physical standby database. A snapshot standby database receives and archives redo data received from the primary database, but the snapshot standby database does not apply redo data from the primary database while the standby is open read-write. Because of this, the snapshot standby typically diverges from the primary database over time. Moreover, local updates to the snapshot standby database cause additional divergence.

Redo data is not applied until you convert the snapshot standby database back into a physical standby database, and after all local updates to the snapshot standby database are discarded. With a single command, you can revert a snapshot standby back to a physical standby database, at which time the changes made to the snapshot standby state are discarded, and Redo Apply automatically resynchronizes the physical standby database with the primary database using the redo that was archived.
Logical standby databases and SQL Apply

Data Guard in SQL Apply mode takes redo data received from the primary database, transforms the redo into SQL transactions, and applies them to an open standby database. Although a logical standby database can be physically different from the primary database, it is logically the same as the primary and it can take over processing if the primary database is destroyed. Because transactions are applied using SQL to an open database, the standby can be used concurrently for other tasks, and can have a different physical structure than the production database. For example, the logical standby can be used for decision support, and be optimized for reporting by using additional indexes and materialized views that do not exist on the primary database.

SQL Apply is most importantly a data protection feature because it compares before-change values in the log files to the before-change values in the logical standby database providing a check against logical corruption. A logical standby database can therefore offer protection from the widest possible range of corruptions.

Because logical standby databases are open for read/write I/O during recovery, you can query the standby database while SQL Apply applies changes in the redo logs.

Zero Data Loss Redo Transport

Both physical and logical standby databases use the same transport services. Data Guard offers customers the choice of synchronous and asynchronous transport methods. Data Guard synchronous transport services provide zero data loss protection insuring that if a disaster should strike the primary database, the redo data necessary to preserve all previously committed transactions is available at the standby site.

You can also choose to transmit redo data asynchronously to the standby site. This minimizes any potential data loss while providing optimal performance over large distances, and provides protection from network failures.

Real-Time Apply and Flashback Database

With real-time apply, Data Guard apply services can apply redo data on the standby database as soon as it is received from the primary database, without waiting for the current log file to be archived at the standby database. This enables standby databases to be closely synchronized with the primary database, enabling up to date and real-time reporting. This also enables faster switchover and failover times, which in turn reduces planned and unplanned downtime for the business. You may also choose to use Flashback Database on both the primary and standby database to quickly revert the databases to an earlier point in time to back out user errors. Alternatively, if you decide to failover to a standby database, but those user errors were already applied to the standby database (because real-time apply was enabled), you can flash back the standby database to a safe point in time. The use of these two features eliminate the tradeoff sometimes necessary to ensure the standby database stays current and delays applying redo data as way of preventing human errors on the production database from propagating to the standby database.

Data Guard Broker

The primary and standby databases, as well as their various interactions, may be managed by using SQL*Plus. For easier manageability, Data Guard also offers the Data Guard Broker distributed management framework, which automates and centralizes the creation, maintenance, and monitoring of a Data Guard configuration. You can use either Oracle Enterprise Manager or the Broker's own specialized command-line interface (DGMGRL) to take advantage of the Broker's management capabilities. From the easy-to-use Oracle Enterprise Manager GUI, you can initiate failover processing with a single mouse click from the primary to either type of standby database. The Broker and Oracle Enterprise Manager make it easy to manage and operate the standby database. By facilitating activities such as failover and switchover, the possibility of errors is greatly reduced.

Fast-Start Failover

Fast-start failover enables the creation of a fault-tolerant standby database environment by providing the ability to totally automate the failover of database processing from the production to standby database, without any human intervention. Fast-start failover automatically, quickly, and reliably fails over to a designated, synchronized standby database in the event of loss of the primary database, without requiring administrators to perform complex manual steps to invoke and implement the failover operation. This greatly reduces the length of an outage. You set up fast-start failover using either Oracle Enterprise Manager or the Data Guard Broker. The Observer, which monitors the Data Guard environment, automatically triggers and completes the failover when required. After a fast-start failover occurs, the old primary database, upon reconnection to the configuration, is automatically reinstated as a new standby database by the Broker. This enables the Data Guard configuration to restore disaster protection in the configuration easily and quickly, improving the robustness of the Data Guard configuration. With these capabilities, Data Guard not only helps maintain transparent business continuity, but also reduces the management costs for the disaster-recovery configuration.

See Also:

Avoiding Downtime During Planned Maintenance

Planned downtime can be just as disruptive to operations, especially in global enterprises that support users in multiple time zones. In this case it is important to design a system to minimize planned interruptions. Planned downtime includes routine operations, periodic maintenance, and new deployments.

Routine operations are frequent maintenance tasks that include backups, performance management, user and security management, and batch operations. Periodic maintenance, such as installing a patch or reconfiguring the system, is occasionally necessary to update the database, application, operating system, middleware, or network. New deployments may be necessitated by major upgrades to the hardware, operating system, database, application, middleware, or network. It is important to consider not only the time to perform the upgrade, but also the effect the changes may have on the overall application.

The Internet has made it easy to share data globally, but brings new challenges and requirements for data availability. As global users access data 24 hours per day, maintenance windows have all but evaporated. Planned downtime is becoming as disruptive as unplanned downtime. There are no longer any windows of time during which users are not affected. When the volume of data stored in a database becomes very large, maintenance operations can be quite time consuming. It is important that these operations be performed without affecting the users of the data.

This section includes the following topics:

Avoiding Downtime for Data Changes
Avoiding Downtime for System Changes

Avoiding Downtime for Data Changes

Figure 17-3 focuses on the three types of downtime due to data changes: online schema and data reorganization, partitioned tables and indexes, and dynamic resource provisioning.

Figure 17-3 Downtime Due to Data Changes

Description of "Figure 17-3 Downtime Due to Data Changes"

This section includes the following topics:

Online Schema and Data Reorganization
Partitioned Tables and Indexes
Dynamic Resource Provisioning

Online Schema and Data Reorganization

Oracle Database allows you to perform most maintenance operations without disrupting database operations or data availability to users. Indexes can be added, rebuilt, or defragmented while the database is online and while end users are reading or updating data. Similarly, you can relocate or defragment tables online. Tables can be redefined, changing table types, adding, dropping or renaming columns, and changing storage parameters without interrupting end users who are viewing or updating the underlying data. Oracle Database capabilities include:

Support for easy cloning of indexes, grants, constraints, and other characteristics of the table
Conversion from the long to LOB datatype online
Allowing unique indexes instead of requiring a primary key
Ability to the modify the bodies of PL/SQL stored procedures and views that are referenced through synonyms, without recompiling other PL/SQL packages that reference them
Ability to perform an online segment shrink to compact the space used in a segment online and in place, allowing space management to be performed without impacting system or data availability.

Partitioned Tables and Indexes

As databases grow larger, they may become extremely cumbersome to manage. The ability to partition database tables and indexes allows administrators to divide large tables up into smaller, more manageable pieces. While most operations and schema changes can be made online, partitioning allows maintenance tasks to be performed one partition at a time. This allows the bulk of the data to be unaffected during maintenance. In addition, partitions enable the use of parallel execution to perform most operations much faster.

Another benefit of partitions is fault containment. A failure, such as a media failure or corruption, is contained to partitions resident on the failed disk. Only that partition is affected and needs to be recovered. This not only reduces the time to recover, but allows the other unaffected partitions to remain online while the failed partition is recovered.

Often, not all data in a large table has the same access characteristics. Pending orders may be accessed more frequently than closed orders, or analysis of last quarter's sales may be more common than analysis of sales from a quarter three years ago. Partitioning allows for intelligent storage management of data. Frequently accessed data can be stored on the fastest disks, and heavily accessed data can be striped across many drives.

Dynamic Resource Provisioning

Oracle Database continues to broaden support for dynamic reconfiguration enabling it to adapt to changes in demand and hardware with no disruption of service. Oracle Database dynamically accommodates changes to hardware configurations such as:

Add and remove of processors from an SMP server
Add and remove nodes in an Oracle RAC cluster
Dynamically grow and shrink its shared memory allocation and automatically tune memory online
Add and remove database disks online without disturbing database activities
Automatically rebalance IO load across the database storage
Move datafiles online

These capabilities eliminate the impact of system changes and provide true capacity on demand provisioning that is a fundamental requirement of enterprise Grid computing.

Avoiding Downtime for System Changes

Planned system changes occur when you perform routine and periodic maintenance operations and new deployments. Planned system changes include any scheduled changes to the operating environment that occur outside of the organizational data structure in the database.The service level impact of planned system changes varies significantly depending on:

The nature and scope of the planned outage
The testing and validation efforts made prior to implementing the change
The technologies and features in place to minimize the impact

This section includes the following topics:

Rolling Patch Updates
Rolling Release Upgrade

Rolling Patch Updates

Oracle Database supports the application of patches to the nodes of an Oracle RAC system in a rolling fashion. To perform a rolling upgrade involves performing the following high-level steps:

Copy the patch software to the primary upgrade node
Shut down Oracle RAC instances on the upgrade nodes
Stop all Oracle processes on the upgrade nodes
Start OUI and complete the upgrade process on the upgrade nodes

An Oracle RAC system runs with all nodes actively processing transactions on the behalf of database clients. Step 1 of the patch application procedure is to quiesce the first instance to which the patch is to be applied. In step 2, you use an Oracle patch tool (Opatch) to apply the patch to the quiesced instance (for example, the Oracle Home for instance 1 is updated). In step 3, you reactivate the patched instance which then rejoins the cluster. The Oracle RAC system is now running with one instance at a higher maintenance level than the other nodes in the cluster.

An Oracle RAC system can run in this mixed mode for an arbitrary period to ensure the patch corrects the original problem, and has not introduced some other problem. This procedure is then repeated for the remaining nodes in the cluster. When all nodes in the cluster have been patched, the rolling patch update is complete and all nodes are running the same version of the Oracle Database software. In addition, Opatch has the ability to rollback the application of a patch. If you observe some aberrant behavior on the updated instance, you can either uninstall the offending patch or roll it back, without forcing a clusterwide outage. The rollback procedure is the same as the patch apply procedure, but in this case the Opatch utility removes a previously applied patch.

Rolling Release Upgrade

Oracle Database supports the installation of database software upgrades, and the application of patchsets, in a rolling fashion—with near zero database downtime—by using Data Guard SQL Apply and logical standby databases.

Oracle Clusterware also supports upgrades in a rolling fashion for software upgrades and application of patchsets. Additionally, once on Oracle Database 11g, ASM supports rolling release upgrades.

Physical standby database users can elect to add a logical standby database to the configuration for the duration of a rolling upgrade, or they can use a transient logical standby database to avoid creating a second standby database. A transient logical standby is a new feature for Data Guard in Oracle Database 11g that allows you to temporarily convert a physical standby database to a logical standby database for the purpose of executing a rolling database upgrade. Then, once the upgrade is complete, you can convert it back into a physical standby database. Using a transient logical standby database avoids the need to create a separate logical standby database to perform upgrades.

See Also:

Oracle Data Guard Concepts and Administration for step-by-step instructions about performing a rolling upgrade with a transient logical standby database

By supporting rolling upgrades and rolling patch updates, Oracle has eliminated a large portion of the maintenance windows database administrators reserve for administrative tasks, and enables 24x7 operation of their enterprise.

Maximum Availability Architecture (MAA) Best Practices

Operational best practices are the key to a successful implementation of IT infrastructure. Technology alone is not enough. Oracle's Maximum Availability Architecture (MAA) is a fully integrated and proven blueprint for building highly available systems. The MAA blueprint details the combined use of key Oracle Database features for high availability including Oracle RAC and Oracle Clusterware, Data Guard, Recovery Manager, Flashback Technologies, Streams, and Enterprise Manager.

Conceptualized with the philosophy that designing an HA system involving Oracle technologies should not be complex, and should not involve guesswork, the MAA design and configuration recommendations have been extensively reviewed and tested to ensure optimum system availability and reliability. MAA also addresses the configuration and integration of other critical components of highly available systems including servers, storage, networking, and the application server. Enterprises that base their system architecture on MAA can more quickly and efficiently achieve business requirements for high availability and data protection.

You can find more information on MAA at http://otn.oracle.com/deploy/availability/htdocs/maa.htm