Skip Headers
Oracle® Database High Availability Best Practices
10g Release 2 (10.2)

Part Number B25159-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

2 Configuring for High-Availability

This chapter describes Oracle configuration best practices for Oracle Database and related components.

This chapter contains these topics:

2.1 Configuring Storage

This section describes best practices for configuring a fault-tolerant storage subsystem that protects data while providing manageability and performance. These practices apply to all Oracle Database high-availability architectures described in Oracle Database High Availability Overview.

This section contains these topics:

2.1.1 Evaluate Database Performance Requirements and Storage Performance Capabilities

Characterize your database performance requirements using different application workloads. Extract statistics during your target workloads by getting the beginning and end statistical snapshots. Example target workloads include:

  • Average load

  • Peak load

  • Batch processing

The necessary statistics can be derived from Automatic Workload Repository (AWR) reports or gathered from the GV$SYSSTAT view. Along with understanding the database performance requirements, the performance capabilities of a storage array must be evaluated.

Low-cost storage arrays, low-cost storage networks, and Oracle Database 10g can in combination create a low-cost storage grid with excellent performance and availability. Low-cost storage is most successfully deployed for databases with certain types of performance and availability requirements. Compared to traditional high-end storage arrays, low-cost storage arrays have excellent data throughput and superior price for each gigabyte. However, low-cost storage arrays do not have better I/O rates for OLTP type applications than traditional storage, although the cost for each I/O per second is comparable. The Oracle Resilient Low-Cost Storage Initiative is designed to help customers reduce IT spending and promote use of low-cost storage arrays in both departmental and enterprise environments.

The Oracle Database flash recovery area is an ideal candidate for low-cost storage. Because the flash recovery area contains recovery related files that are typically accessed with sequential 1MB streams, the performance characteristics of low-cost storage are well suited for the flash recovery area. The flash recovery area can be configured to use low-cost storage while the database area remains on traditional storage.

See Also:

2.1.2 Use Automatic Storage Management (ASM) to Manage Database Files

ASM is a vertical integration of both the file system and the volume manager built specifically for Oracle database files. It extends the concept of stripe and mirror everything (SAME) to optimize performance, while removing the need for manual I/O tuning (distributing the datafile layout to avoid hot spots). ASM helps manage a dynamic database environment by letting you grow the database size without shutting down the database to adjust the storage allocation. ASM also enables low-cost modular storage to deliver higher performance and greater availability by supporting mirroring as well as striping.

ASM should be used to manage all database files. However, ASM can be phased into your environment initially supporting only the flash recovery area. This approach is particularly well suited for introducing low-cost storage into an existing environment where traditional storage configurations currently exist.

To improve manageability, ASMLib should be used on platforms where it is available. ASMLib is a support library for ASM. ASMLib eliminates the impact when the mappings of disk device names change upon system reboot. Although ASMLib is not required to run ASM, it simplifies the management of disk device names, makes the discovery process simpler, and removes the challenge of having disks added to one node and not be known to other nodes in the cluster.

See Also:

2.1.3 Use a Simple Disk and Disk Group Configuration

When using ASM for database storage, you should create two disk groups: one disk group for the database area and another disk group for the flash recovery area:

  • The database area contains active database files, such as datafiles, control files, online redo log files, Data Guard Broker metadata files, and change tracking files used for RMAN incremental backups. For example:

    CREATE DISKGROUP DATA DISK    
    '/devices/lun01','/devices/lun02','/devices/lun03','/devices/lun04';
    
    
  • The flash recovery area contains recovery-related files, such as a copy of the current control file, a member of each online redo log file group, archived redo log files, RMAN backup sets, and flashback log files. For example:

    CREATE DISKGROUP RECO DISK    
    '/devices/lun05','/devices/lun06','/devices/lun07','/devices/lun08',
    '/devices/lun09','/devices/lun10','/devices/lun11','/devices/lun12';
    
    

To simplify file management, use Oracle managed files to control file naming. Enable Oracle managed files by setting initialization parameters DB_CREATE_FILE_DEST, DB_RECOVERY_FILE_DEST, and DB_RECOVERY_FILE_DEST_SIZE:

Note:

Using a flash recovery area by setting DB_RECOVERY_FILE_DEST requires that you also set DB_RECOVERY_FILE_DEST_SIZE to bound the amount of disk space used by the flash recovery area. DB_RECOVERY_FILE_DEST and DB_RECOVERY_FILE_DEST_SIZE are dynamic parameters that allow you to change the destination and size of the flash recovery area.
DB_CREATE_FILE_DEST=+DATA
DB_RECOVERY_FILE_DEST=+RECO
DB_RECOVERY_FILE_DEST_SIZE=500G

You have two options when partitioning disks for ASM use:

  • Allocate entire disks to the database area and flash recovery area disk groups

  • Partition each disk into two partitions, one for the database area and another for the flash recovery area

Figure 2-1 Allocating Entire Disks

Description of Figure 2-1 follows
Description of "Figure 2-1 Allocating Entire Disks"

Figure 2-1 illustrates allocating entire disks. The advantages of this option are:

  • It is easier to manage the disk partitions at the operating system level, because each disk is partitioned as just one large partition.

  • ASM rebalance operations following a disk failure complete more quickly, because there is only one disk group to rebalance.

The disadvantage of this option is:

  • Less I/O bandwidth, because each disk group is spread over only a subset of the available disks.

Figure 2-2 Partitioning Each Disk

Description of Figure 2-2 follows
Description of "Figure 2-2 Partitioning Each Disk"

The second option is illustrated in Figure 2-2. It requires partitioning each disk into two partitions: a smaller partition on the faster outer portion of each drive for the database area, and a larger partition on the slower inner portion of each drive for the flash recovery area. The ratio for the size of the inner and outer partitions depends on the estimated size of the database area and the flash recovery area.

The advantage of this approach is:

  • Higher I/O bandwidth available, because both disk groups are spread over all available spindles. This advantage is considerable for the database area disk group for I/O intensive applications.

The disadvantages are:

  • A double disk failure may result in the loss of both disk groups, requiring the use of a standby database or tape backups for recovery.

  • An ASM rebalance operation following a disk failure is longer, because both disk groups are affected.

  • Higher initial administrative efforts are required to partition each disk properly.

See Also:

2.1.4 Use Disk Multipathing Software to Protect from Path Failure

Disk multipathing software aggregates multiple independent I/O paths into a single logical path. The path abstraction provides I/O load balancing across host bus adapters (HBA) and nondisruptive failovers when there is a failure in the I/O path. Disk multipathing software should be used in conjunction with ASM.

When specifying disk names during disk group creation in ASM, the logical device representing the single logical path should be used. For example, when using Device Mapper on Linux 2.6, a logical device path of /dev/dm-0 may be the aggregation of physical disks /dev/sdc and /dev/sdh. Within ASM, the asm_diskstring parameter should contain /dev/dm-* to discover the logical device /dev/dm-0, and that logical device should be used during disk group creation:

asm_diskstring='/dev/dm-*'

CREATE DISKGROUP DATA DISK
'/dev/dm-0','/dev/dm-1','/dev/dm-2','/dev/dm-3';

2.1.5 Use Redundancy to Protect from Disk Failure

When setting up redundancy to protect from hardware failures, there are two options to consider:

  • Storage array based RAID

  • ASM redundancy

Oracle recommends that you configure redundancy in the storage array by enabling RAID protection, such as RAID1 (mirroring) or RAID5 (striping plus parity). For example, to create an ASM disk group where redundancy is provided by the storage array, first create the RAID-protected logical unit numbers (LUNs) in the storage array, and then create the ASM disk group using the EXTERNAL REDUNDANCY clause:

CREATE DISKGROUP DATA EXTERNAL REDUNDANCY DISK
    '/devices/lun1','/devices/lun2','/devices/lun3','/devices/lun4';

If the storage array does not offer the desired level of redundancy, or if there is a need to configure redundancy across multiple storage arrays, then use ASM redundancy. ASM provides redundancy with the use of failure groups, which are defined during disk group creation. ASM redundancy can be either Normal redundancy, where files are two-way mirrored, or high redundancy, where files are three-way mirrored. Once a disk group is created, the redundancy level cannot be changed.

Failure group definition is specific to each storage setup, but these guidelines should be followed:

  • If every disk is available through every I/O path, as would be the case if using disk multipathing software, then leave each disk in its own failure group. This is the default ASM behavior if creating a disk group without explicitly defining failure groups.

    CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK
        '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4',
        '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
    
    
  • If every disk is not available through every I/O path, then define failure groups to protect against the piece of hardware that you are concerned about failing. Here are three examples:

    • For an array with two controllers where each controller sees only half of the drives, create a disk group with two failure groups, one for each controller, to protect against controller failure:

      CREATE DISKGROUP DATA NORMAL REDUNDANCY 
        FAILGROUP controller1 DISK
         '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4'
        FAILGROUP controller2 DISK
         '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
      
      
    • For an array with two controllers where every disk is seen through both controllers, create a disk group with each disk in its own failure group:

      CREATE DISKGROUP DATA NORMAL REDUNDANCY 
        DISK
         '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4',
         '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
      
      
    • For a storage network with multiple storage arrays, you want to mirror across storage arrays, then create a disk group with two failure groups, one for each array, to protect against array failure:

      CREATE DISKGROUP DATA NORMAL REDUNDANCY 
        FAILGROUP array1 DISK
         '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4'
        FAILGROUP array2 DISK
         '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
      
      

When determining the proper size of a disk group that is protected with ASM redundancy, enough free space must exist in the disk group so that when a disk fails ASM can automatically reconstruct the contents of the failed drive to other drives in the disk group while the database remains online. The amount of space required to ensure ASM can restore redundancy following disk failure is in the column REQUIRED_MIRROR_FREE_MB in the V$ASM_DISKGROUP view. The amount of free space that can be safely used in a disk group, taking mirroring into account, and yet be able to restore redundancy after a disk failure is in the column USABLE_FILE_MB in the V$ASM_DISKGROUP view. USABLE_FILE_MB should always be greater than zero. If USABLE_FILE_MB falls below zero, then more disks should be added to the disk group.

2.1.6 Consider HARD-Compliant Storage

Consider HARD-compliant storage for the greatest protection against data corruption. Data corruption is very rare, but it can have a catastrophic effect on a business when it occurs.

The goal of the Hardware Assisted Resilient Data (HARD) initiative is to eliminate a class of failures that the computer industry has so far been powerless to prevent. RAID has gained a wide following in the storage industry by ensuring the physical protection of data. HARD takes data protection to the next level by going beyond protecting physical data to protecting business data.

The HARD initiative is designed to prevent data corruption before it happens. Under the HARD initiative, Oracle partners with storage vendors to implement Oracle data validation and checking algorithms inside storage devices. This makes it possible to prevent corrupted data from being written to permanent storage.

The classes of data corruption that Oracle addresses with HARD include:

  • Writes that physically and logically corrupt Oracle blocks

  • Writes of database blocks to incorrect locations

  • Writes of partial or incomplete blocks

  • Writes by other applications to Oracle data blocks

End-to-end block validation is the technology employed by the operating system or storage subsystem to validate the Oracle Database data block contents. By validating Oracle Database data in the storage devices, data corruption is detected and eliminated before it can be written to permanent storage. This goes beyond the current Oracle Database block validation features that do not detect a stray, lost, or corrupted write until the next physical read.

Storage vendors who are partners with Oracle are given the opportunity to implement validation checks based on a specification. A particular vendor's implementation may offer features specific to its storage technology. Oracle maintains a Web site that shows a comparison of each vendor's solution by product and Oracle version.

Note:

When using ASM to manage database storage, ASM should always be configured as external redundancy. Additionally, HARD protections should be disabled when doing any rebalance operations, such as adding a new disk, to avoid the risk of HARD inadvertently flagging the movement of data as a bad write.

See Also:

http://www.oracle.com/technology/deploy/availability/htdocs/HARD.html for the most recent information on the HARD initiative

2.2 Configuring Oracle Database 10g

The best practices discussed in this section apply to Oracle Database 10g database architectures in general, including all architectures described in Oracle Database High Availability Overview:

These recommendations are identical for both the primary and standby databases when Oracle Data Guard is used. It is necessary to adopt these practices to reduce or avoid outages, reduce risk of corruption, and improve recovery performance.

This section contains the following types of best practices for configuring the database in general:

2.2.1 Requirements for High Availability

This section describes the following minimum requirements for configuring Oracle Database for high availability:

2.2.1.1 Enable ARCHIVELOG Mode

ARCHIVELOG mode enables online database backup and is necessary to recover the database to a point in time later than what has already been restored. Architectures such as Oracle Data Guard and Flashback Database require that the production database run in ARCHIVELOG mode.

See Also:

Oracle Database Administrator's Guide for more information about using automatic archiving

2.2.1.2 Enable Block Checksums

By default, Oracle always validates the data blocks that it reads from disk. Enabling data and log block checksums by setting DB_BLOCK_CHECKSUM to TYPICAL enables Oracle to detect other types of corruption caused by underlying disks, storage systems, or I/O systems. Before a data block is written to disk, a checksum is computed and stored in the block. When the block is subsequently read from disk, the checksum is computed again and compared with the stored checksum. Any difference is treated as a media error, and an ORA-1578 error is signaled. Block checksums are always maintained for the SYSTEM tablespace. If DB_BLOCK_CHECKSUM is set to FULL, then in-memory corruption is also detected before being written to disk.

In addition to enabling data block checksums, Oracle also calculates a checksum for every redo log block before writing it to the current log. Redo record corruption is found as soon as the log is archived. Without this option, corruption in a redo log can go unnoticed until the log is applied to a standby database, or until a backup is restored and rolled forward through the log containing the corrupt log block.

RMAN also calculates checksums when taking backups to ensure that all blocks being backed up are validated.

Ordinarily the overhead for TYPICAL is one to two percent and for FULL is four to five percent. The default setting, TYPICAL, provides critical detection of corruption at very low cost and remains a requirement for high availability. Measure the performance impact with your workload on a test system and ensure that the performance impact is acceptable before moving from TYPICAL to FULL on an active database.

2.2.2 Recommendations for High Availability and Fast Recoverability

This section describes Oracle Database best practices for reducing recovery time or increasing its availability and redundancy:

2.2.2.1 Configure the Size of Redo Log Files and Groups Appropriately

Use Oracle log multiplexing to create multiple redo log members in each redo group, one in the data area and one in the flash recovery area. This protects against a failure involving the redo log, such as a disk or I/O failure for one of the members, or a user error that accidentally removes a member through an operating system command. If at least one redo log member is available, then the instance can continue to function.

All online redo log files should be the same size and configured to switch approximately once an hour during normal activity. They should not switch more frequently than every 20 minutes during peak activity.

There should be a minimum of four online log groups to prevent LGWR from waiting for a group to be available following a log switch. A group might be unavailable because a checkpoint has not yet completed or because the group has not yet been archived.

See Also:

2.2.2.2 Use a Flash Recovery Area

The flash recovery area is Oracle managed disk space that provides a centralized disk location for backup and recovery files. The flash recovery area is defined by setting the following database initialization parameters:

  • DB_RECOVERY_FILE_DEST

    This parameter specifies the default location for the flash recovery area.

  • DB_RECOVERY_FILE_DEST_SIZE

    This parameter specifies (in bytes) the hard limit on the total space to be used by target database recovery files created in the recovery area location.

The flash recovery area should be the primary location for recovery. When the flash recovery area is properly sized, files needed for repair will be readily available. The minimum recommended disk limit is the combined size of the database, incremental backups, all archived redo logs that have not been copied to tape, and flashback logs.

See Also:

Oracle Database Backup and Recovery Basics for detailed information about sizing the flash recovery area and setting the retention period

2.2.2.3 Enable Flashback Database

Flashback Database enables you to rewind the database to a previous point in time without restoring backup copies of the datafiles. Flashback Database is a revolutionary recovery feature that operates on only the changed data. Flashback Database makes the time to correct an error proportional to the time to cause and detect the error, without recovery time being a function of the database size. You can flash back a database from both RMAN and SQL*Plus with a single command instead of using a complex procedure.

During normal runtime, Flashback Database buffers and writes before images of data blocks into the flashback logs, which reside in the flash recovery area. Ensure there is sufficient I/O bandwidth available to the flash recovery area to maintain flashback write throughput. If flashback writes are slow, as evidenced by the flashback free buffer waits wait event, then database throughput is affected. The amount of disk writes caused by Flashback Database differs depending on the workload and application profile. For a typical OLTP workload that is using a flash recovery area with sufficient disk spindles and I/O throughput, the overhead incurred by Flashback Database is less than two percent.

Flashback Database can flash back a primary or standby database to a point in time prior to a role transition. In addition, a Flashback Database can be performed to a point in time prior to a resetlogs operation, which allows administrators more flexibility to detect and correct human errors. Flashback Database is required when using fast-start failover so that Data Guard Broker can automatically reinstate the primary database following an automatic failover.

If you have a standby database, then set DB_FLASHBACK_RETENTION_TARGET to the same value for both primary and standby databases.

See Also:

Oracle Database Backup and Recovery Basics for more information on restore points and Flashback Database

2.2.2.4 Use Fast-Start Fault Recovery to Control Instance Recovery Time

The fast-start fault recovery feature reduces the time required to recover from a crash. It also makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint. With this feature, the FAST_START_MTTR_TARGET initialization parameter simplifies the configuration of recovery time from instance or system failure. This parameter specifies a target for the expected recover time objective (RTO), which is the time (in seconds) that it should take to start up the instance and perform cache recovery. Once this parameter is set, the database manages incremental checkpoint writes in an attempt to meet that target. If you have chosen a practical value for this parameter, then you can expect your database to recover, on average, in approximately the number of seconds you have chosen.

See Also:

Oracle Database Backup and Recovery Advanced User's Guide for more information on fast-start fault recovery

2.2.2.5 Enable Database Block Checking

Enable database block checking by setting DB_BLOCK_CHECKING to LOW, MEDIUM, or FULL. The block checking performed for each value is as follows:

  • LOW

    Block checking is performed after any in-memory block change.

  • MEDIUM

    All in-memory block change checking is performed as well as semantic block checking for all non index organized-table blocks.

  • FULL

    Block checking is performed for all in-memory block changes as well as semantic block checking for all non index organized-table blocks and index blocks.

When one of these three forms of block checking is enabled, Oracle Database verifies that the block is self-consistent whenever the corresponding types of block change occur. If the block is inconsistent, then it is marked corrupt, an ORA-1578 error is returned, and a trace file is created containing details of the problem. Without block checking enabled, corruption can go undetected until the block is accessed again. Block checking for the SYSTEM tablespace is always enabled, no matter what setting is chosen for DB_BLOCK_CHECKING.

Block checking can often prevent memory and data corruption. Turning on this feature typically causes an additional one percent to ten percent overhead, depending on the setting and the workload. Measure the performance impact on a test system using your workload and ensure that it is acceptable before introducing this feature on an active database.

To check for block corruption on a disk that was not preventable by utilizing DB_BLOCK_CHECKING use one of the following:

  • RMAN BACKUP command with the VALIDATE option

  • DBVERIFY utility

  • ANALYZE TABLE tablename VALIDATE STRUCTURE CASCADE SQL statement

See Also:

2.2.2.6 Set DISK_ASYNCH_IO

Set DISK_ASYNCH_IO=TRUE to enable asynchronous disk I/O for optimal I/O performance.

2.2.2.7 Set LOG_BUFFER to At Least 8 MB

For large production databases, set the LOG_BUFFER initialization parameter to a minimum of 8 MB. This setting ensures the database allocates maximum memory (typically 16 MB) for writing Flashback Database logs.

2.2.2.8 Use Automatic Shared Memory Management

Memory management has improved significantly with the advent of Automatic Shared Memory Management (ASM). By setting the SGA_TARGET parameter to a nonzero value, the shared pool, large pool, Java pool, Streams pool, and buffer cache can automatically and dynamically resize, as needed. See the Oracle Database Administrator's Guide for more information.

2.2.2.9 Increase PARALLEL_EXECUTION_MESSAGE_SIZE

Increase initialization parameter PARALLEL_EXECUTION_MESSAGE_SIZE from default value of 2048 to 4096. This configuration step accelerates parallel executions, including instance recovery.

2.2.2.10 Tune PARALLEL_MIN_SERVERS

Set PARALLEL_MIN_SERVERS so that the required number of parallel recovery processes are pre-spawned for fast recovery from an instance or node crash. This works with FAST_START_MTTR_TARGET to bound recovery time.

PARALLEL_MIN_SERVERS = CPU_COUNT + average number of parallel query processes in use for GV$ queries and parallel execution

2.2.2.11 Disable Parallel Recovery

When the value of RECOVERY_ESTIMATED_IOS in the V$INSTANCE_RECOVERY view is small (for example, < 5000), then the overhead of parallel recovery may outweigh any benefit. This will typically occur with a very aggressive setting of FAST_START_MTTR_TARGET. In this case, set RECOVERY_PARALLELISM to 1 to disable parallel recovery.

2.2.3 Recommendations to Improve Manageability

This section describes best practices for improving Oracle Database manageability:

2.2.3.1 Use Automatic Performance Tuning Features

Effective data collection and analysis is essential for identifying and correcting performance problems. Oracle provides a number of tools that allow a performance engineer to gather information regarding database performance.

The Oracle Database automatic performance tuning features include:

  • Automatic Workload Repository (AWR)

  • Automatic Database Diagnostic Monitor (ADDM)

  • SQL Tuning Advisor

  • SQL Access Advisor

When using AWR, consider the following best practices:

  • Set the AWR automatic snapshot interval to 10-20 minutes to capture performance peaks during stress testing or to diagnose performance issues.

  • Under usual workloads a 60-minute interval is sufficient

2.2.3.2 Use a Server Parameter File

The server parameter file (SPFILE) enables a single, central parameter file to hold all database initialization parameters associated with all instances of a database. This provides a simple, persistent, and robust environment for managing database parameters. An SPFILE is required when using Data Guard Broker.

See Also:

2.2.3.3 Use Automatic Undo Management

With automatic undo management, the Oracle Database server effectively and efficiently manages undo space, leading to lower administrative complexity and cost. When Oracle Database internally manages undo segments, undo block and consistent read contention are eliminated because the size and number of undo segments are automatically adjusted to meet the current workload requirement.

To use automatic undo management, set the following initialization parameters:

  • UNDO_MANAGEMENT

    This parameter should be set to AUTO.

  • UNDO_RETENTION

    This parameter specifies the desired time in seconds to retain undo data. It must be the same on all instances.

  • UNDO_TABLESPACE

    This parameter should specify a unique undo tablespace for each instance.

Advanced object recovery features, such as Flashback Query, Flashback Version Query, Flashback Transaction Query, and Flashback Table, require automatic undo management. By default, Oracle Database automatically tunes undo retention by collecting database use statistics and estimating undo capacity needs. You can affect this automatic tuning by setting the UNDO_RETENTION initialization parameter. It is only necessary to set this initialization parameter in the following cases:

  • The undo tablespace has the AUTOEXTEND option enabled.

  • You want to have undo retention for LOBs.

  • You want a retention guarantee.

Note:

By default, undo data can be overwritten by ongoing transactions, even if the UNDO_RETENTION setting specifies that the undo data should be maintained. To guarantee that unexpired undo data is not overwritten, retention guarantee must be enabled for the undo tablespace.

With the retention guarantee option, the undo guarantee is preserved even if there is need for DML activity (DDL statements are still allowed). If the tablespace is configured with less space than the transaction throughput requires, then the following four things will occur in this sequence:

  1. If you have an autoextensible file, then it will automatically grow to accommodate the retained undo data.

  2. A warning alert is issued at 85 percent full.

  3. A critical alert is issued at 97 percent full.

  4. Transactions receive an out-of-space error.

See Also:

Oracle Database Administrator's Guide for more information about the UNDO_RETENTION setting and the size of the undo tablespace

2.2.3.4 Use Locally Managed Tablespaces

Locally managed tablespaces perform better than dictionary-managed tablespaces, are easier to manage, and eliminate space fragmentation concerns. Locally managed tablespaces use bitmaps stored in the datafile headers and, unlike dictionary managed tablespaces, do not contend for centrally managed resources for space allocations and de-allocations.

See Also:

Oracle Database Administrator's Guide for more information on locally managed tablespaces

2.2.3.5 Use Automatic Segment Space Management

Automatic segment space management simplifies space administration tasks, thus reducing the chance of human error. An added benefit is the elimination of performance tuning related to space management. It facilitates management of free space within objects such as tables or indexes, improves space utilization, and provides significantly better performance and scalability with simplified administration. The automatic segment space management feature is enabled by default for all new tablespaces created using default attributes.

See Also:

Oracle Database Administrator's Guide for more information on segment space management

2.2.3.6 Use Temporary Tablespaces and Specify a Default Temporary Tablespace

Temporary tablespaces improve the concurrency of multiple sort operations, reduce sort operation overhead, and avoid data dictionary space management operations altogether. This is a more efficient way of handling temporary segments, from the perspective of both system resource usage and database performance.

A default temporary tablespace should be specified for the entire database to prevent accidental exclusion of the temporary tablespace clause. This can be done at database creation time by using the DEFAULT TEMPORARY TABLESPACE clause of the CREATE DATABASE statement, or after database creation by the ALTER DATABASE statement. Using the default temporary tablespace ensures that all disk sorting occurs in a temporary tablespace and that other tablespaces are not mistakenly used for sorting.

See Also:

Oracle Database Administrator's Guide for more information on managing tablespaces

2.2.3.7 Use Resumable Space Allocation

Resumable space allocation provides a way to suspend and later resume database operations if there are space allocation failures. The affected operation is suspended instead of the database returning an error. No processes need to be restarted. When the space problem is resolved, the suspended operation is automatically resumed.

To use resumable space allocation, set the RESUMABLE_TIMEOUT initialization parameter to the number of seconds of the retry time. You must also at the session level issue the ALTER SESSION ENABLE RESUMABLE statement.

See Also:

Oracle Database Administrator's Guide for more information on managing resumable space allocation

2.2.3.8 Use Database Resource Manager

The Database Resource Manager gives database administrators more control over resource management decisions, so that resource allocation can be aligned with the business objectives of an enterprise. The Database Resource Manager provides the ability to prioritize work within the Oracle Database server. Availability of the database encompasses both its functionality and performance. If the database is available but users are not getting the level of performance they need, then availability and service level objectives are not being met. Application performance, to a large extent, is affected by how resources are distributed among the applications that access the database. The main goal of the Database Resource Manager is to give the Oracle Database server more control over resource management decisions, thus circumventing problems resulting from inefficient operating system management and operating system resource managers. The Database Resource Manager is enabled by default.

See Also:

Oracle Database Administrator's Guide for more information on Database Resource Manager

2.3 Configuring Oracle Database 10g with RAC

The best practices discussed in this section apply to Oracle Database 10g with RAC. These best practices build on the Oracle Database 10g configuration best practices described in Section 2.2, "Configuring Oracle Database 10g". These best practices are identical for the primary and standby databases if they are used with Data Guard in Oracle Database 10g with RAC and Data Guard - MAA. Some of these best practices might reduce performance levels, but are necessary to reduce or avoid outages. The minimal performance impact is outweighed by the reduced risk of corruption or the performance improvement for recovery.

This section includes the following topics:

2.3.1 Connect to Database using Services and Virtual Internet Protocol (VIP) Address

With Oracle Database 10g, application workloads can be defined as services so that they can be individually managed and controlled. DBAs control which processing resources are allocated to each service during both normal operations and in response to failures. Performance metrics are tracked by service and thresholds set to automatically generate alerts should these thresholds be crossed. CPU resource allocations and resource consumption controls are managed for services using Resource Manager. Oracle tools and facilities such as Job Scheduler, Parallel Query, and Oracle Streams Advanced Queuing also use services to manage their workloads.

With Oracle Database 10g, rules can be defined to automatically allocate processing resources to services. Oracle RAC 10g instances can be allocated to process individual services or multiple services as needed. These allocation rules can be modified dynamically to meet changing business needs. For example, these rules could be modified at the end of a quarter to ensure that there are enough processing resources to complete critical financial functions on time. Rules can also be defined so that when instances running critical services fail, the workload will be automatically shifted to instances running less critical workloads. Services can be created and administered with Enterprise Manager, Database Configuration Assistant (DBCA), and the DBMS_SERVICE PL/SQL package.

Application connections to the database should be made through a Virtual Internet Protocol (VIP) address to a service defined as part of the workload management facility allowing the greatest degree of availability and manageability.

A VIP address is an alternate public address that client connections use instead of the standard public IP address. If a node fails, then the node's VIP address fails over to another node on which the VIP address can accept connections. Clients that attempt to connect to the VIP address receive a rapid connection refused error instead of waiting for TCP connect timeout messages, thereby reducing the time wasted during the initial connection attempt to a failed node. VIP addresses are configured using the Virtual Internet Protocol Configuration Assistant (VIPCA).

2.3.2 Use Oracle Clusterware to Manage the Cluster and Application Availability

Oracle Clusterware is the only clusterware needed for most platforms on which RAC operates. You can also use clusterware from other vendors if the clusterware is certified for RAC. However, adding unnecessary layers of software for functionality that is already provided by Oracle Clusterware adds complexity and cost and can reduce system availability, especially for planned maintenance.

Oracle Clusterware includes a high-availability framework that provides an infrastructure to manage any application. Oracle Clusterware ensures the applications it manages start when the system starts. Oracle Clusterware also monitors the applications to make sure they are always available. For example, if a process fails, then Oracle Clusterware attempts to restart the process based on scripts that you customize. If a node in the cluster fails, then you can program processes that normally run on the failed node to restart on another node. The monitoring frequency, starting, and stopping of the applications and the application dependencies are configurable.

See Also:

Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information on managing application availability with Oracle Clusterware

2.3.3 Use Client-Side and Server-Side Load Balancing

Client-side load balancing evenly spreads new connection requests across all listeners. It is defined in your client connection definition by setting the parameter LOAD_BALANCE to ON. (The default is ON for description lists). When this parameter is set to ON, Oracle Database randomly selects an address in the address list and connects to that node's listener. This provides a balancing of client connections across the available listeners in the cluster. When the listener receives the connection request, it connects the user to an instance that it knows provides the requested service. To see what services a listener supports, run the LSNRCTL services command.

Server-side load balancing uses the current workload being run on the available instances for the database service requested during a connection request and directs the connection request to the least loaded instance. Server-side connection load balancing requires each instance to register with all available listeners, which is accomplished by setting LOCAL_LISTENER and REMOTE_LISTENER parameters for each instance. This is done by default when creating a database with DBCA. Connection load balancing can be further enhanced by using the load balancing advisory and defining the connection load balancing goal for each service by setting the GOAL and CLB_GOAL attributes with the DBMS_SERVICE PL/SQL package.

See Also:

2.3.4 Mirror Oracle Cluster Registry (OCR) and Configure Multiple Voting Disks

The OCR maintains cluster configuration information as well as configuration information about any cluster database within the cluster. The OCR also manages information about processes that Oracle Clusterware controls. Protect the OCR from disk failure by using the ability of Oracle Clusterware to mirror the OCR. If you have external redundancy, create the OCR on the external redundant storage. If you do not have external redundancy, create a minimum of two OCRs across two different controllers.

RAC uses the voting disk to determine which instances are members of a cluster. The voting disk must reside on a shared disk. For high availability, Oracle recommends that you have multiple voting disks. Oracle Clusterware enables multiple voting disks, but you must have an odd number of voting disks, such as three, five, and so on. If you define a single voting disk, then you should use external mirroring to provide redundancy.

2.3.5 Regularly Back Up OCR to Tape or Offsite

The OCR contains cluster and database configuration information for RAC and Cluster Ready Services (CRS), such as the cluster database node list, CRS application resource profiles, and Event Manager (EVM) authorizations. Oracle Clusterware automatically creates OCR backups every four hours. Oracle always retains the last three backup copies of OCR. The CRSD process that creates the backups also creates and retains an OCR backup for each full day and at the end of each week. The backup files created by Oracle Clusterware should be backed up as part of the operating system backup using Oracle Secure Backup, standard operating-system tools, or third-party tools.

Note:

The default location for generating backups on UNIX-based systems is CRS_HOME/cdata/cluster_name where cluster_name is the name of your cluster. The Windows-based default location for generating backups uses the same path structure.

In addition to using the automatically created OCR backup files, you should also export the OCR contents before and after making significant configuration changes, such as adding or deleting nodes from your environment, modifying Oracle Clusterware resources, or creating a database. Do this by using the ocrconfig -export command. This exports the OCR content to a file format. The export files created by ocrconfig should be backed up as part of the operating system backup using Oracle Secure backup, standard operating-system tools, or third-party tools.

2.3.6 Verify That CRS and RAC Use Same Interconnect Network

For the most efficient network detection and failover, CRS and RAC should use the same interconnect subnet so that they share the same view of connections and accessibility. To verify the interconnect subnet used by RAC, run the Oracle ORADEBUG utility on one of the instances:

SQL> ORADEBUG SETMYPID
Statement processed.
SQL> ORADEBUG IPC
Information written to trace file.
SQL> ORADEBUG tracefile_name
/u01/app/oracle/admin/prod/udump/prod1_ora_24409.trc

In the trace file, examine the SSKGXPT section to determine the subnet used by RAC. In this example, the subnet in use is 192.168.0.3 and the protocol used is UDP:

SSKGXPT 0xd7be26c flags         info for network 0
        socket no 7      IP 192.168.0.3  UDP 14727

To verify the interconnect subnet used by CRS, examine the value of the keyname SYSTEM.css.node_numbers.node<n>.privatename in OCR:

prompt> ocrdump -stdout -keyname SYSTEM.css.node_numbers
 
[SYSTEM.css.node_numbers.node1.privatename]
ORATEXT : halinux03ic0
 
…
 
[SYSTEM.css.node_numbers.node2.privatename]
ORATEXT : halinux04ic0

The hostnames (halinux03ic0 and halinux04ic0 in this example) should match the subnet in the trace file produced by ORADEBUG (subnet 192.168.0.3). Use operating system tools to verify. For example, on Linux:

prompt> getent hosts halinux03ic0
192.168.0.3     halinux03ic0.us.oracle.com halinux03ic0

2.3.7 Configure All Databases for Maximum Instances in the Cluster

During initial setup of a RAC database, the online redo log threads and undo tablespaces for any additional instances in the cluster should be created. If the database might be an Oracle Data Guard standby database at some point, then also create the standby redo logs for each thread at this time.

See Also:

Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for an overview of storage in Oracle Real Application Clusters

2.4 Configuring Oracle Database 10g with Data Guard

The best practices discussed in this section apply to Oracle Database 10g with Data Guard. These best practices build on the ones described in Section 2.2, "Configuring Oracle Database 10g". The proper configuration of Oracle Data Guard Redo Apply and SQL Apply is essential to ensuring that all standby databases work properly and perform their roles within service levels after switchovers and failovers. Most Data Guard configuration settings can be made using Oracle Enterprise Manager. For more advanced, less frequently used Data Guard configuration parameters, the Data Guard Broker command-line interface or SQL*Plus can be used.

Data Guard enables you to use either a physical standby database (Redo Apply) or a logical standby database (SQL Apply), or both, depending on the business requirements. A physical standby database provides a physically identical copy of the primary database, with on-disk database structures that are identical to the primary database on a block-for-block basis. The database schema, including indexes, are the same. A physical standby database is kept synchronized with the primary database by applying the redo data received from the primary database through media recovery.

A logical standby database contains the same logical information as the production database, although the physical organization and structure of the data can be different. It is kept synchronized with the primary database by transforming the data in the redo log files received from the primary database into SQL statements and then executing the SQL statements on the standby database. A logical standby database can be used for other business purposes in addition to disaster recovery requirements.

This section contains configuration best practices for the following aspects of Data Guard:

2.4.1 Physical or Logical Standby

This section contains information that can help you choose between physical standby and logical standby databases.

This section contains these topics:

2.4.1.1 Benefits of a Physical Standby Database

A physical standby database provides the following benefits:

  • Disaster recovery and high availability

    A physical standby database enables a robust and efficient disaster recovery and high-availability solution. Easy-to-manage switchover and failover capabilities allow easy role reversals between primary and physical standby databases, minimizing the downtime of the primary database for planned and unplanned outages.

  • Data protection

    Using a physical standby database, Data Guard can ensure no data loss with certain configurations, even in the face of unforeseen disasters. A physical standby database supports all datatypes, and all DDL and DML operations that the primary database can support. It also provides a safeguard against data corruption and user errors. Storage level physical corruption on the primary database do not propagate to the standby database. Similarly, logical corruption or user errors that cause the primary database to be permanently damaged can be resolved. Finally, the redo data is validated when it is applied to the standby database.

  • Reduction in primary database workload

    Oracle Recovery Manager (RMAN) can use physical standby databases to off-load backups from the primary database saving valuable CPU and I/O cycles. The physical standby database can also be opened in read-only mode for reporting and queries.

  • Performance

    The Redo Apply technology used by the physical standby database applies changes using low-level recovery mechanisms, which bypass all SQL level code layers; therefore, it is the most efficient mechanism for applying high volumes of redo data.

  • Read-write testing and reporting database

    Using Flashback Database and a physical standby database, you can configure a temporary clone database for testing and reporting. The temporary clone can later be resynched with the primary database.

2.4.1.2 Benefits of a Logical Standby Database

A logical standby database provides similar disaster recovery, high availability, and data protection benefits as a physical standby database. It also provides the following specialized benefits:

  • Efficient use of standby hardware resources

    A logical standby database can be used for other business purposes in addition to disaster recovery requirements. It can host additional database schemas beyond the ones that are protected in a Data Guard configuration, and users can perform normal DDL or DML operations on those schemas any time. Because the logical standby tables that are protected by Data Guard can be stored in a different physical layout than on the primary database, additional indexes and materialized views can be created to improve query performance and suit specific business requirements.

  • Reduction in primary database workload

    A logical standby database can remain open at the same time its tables are updated from the primary database, and those tables are simultaneously available for read access. This makes a logical standby database an excellent choice to do queries, summations, and reporting activities, thereby off-loading the primary database from those tasks and saving valuable CPU and I/O cycles.

  • Database rolling upgrade

    A logical standby database can be upgraded to the next version and subsequently become the new primary database after a Data Guard switchover. This rolling upgrade procedure can dramatically reduce the planned downtime of a database upgrade.

2.4.1.3 Determining Which Standby Type Is Best for Your Application

Determining which standby type to implement can be accomplished by examining several key areas. Because logical standby does not support all datatypes, you must first determine if your application uses any unsupported datatypes. To determine if your application uses unsupported datatypes, run the following queries on the primary database:

  • To list unsupported tables, issue the following query:

    SET PAGES 200 LINES 132 
    COL OWNER FORMAT A8 
    COL DATA_TYPE FORMAT A15
    COL TABLE_NAME FORMAT A32
    COL COLUMN_NAME FORMAT A25
    COL ATTRIBUTES FORMAT A15
    SELECT OWNER, TABLE_NAME, REASON
    FROM DBA_STREAMS_UNSUPPORTED
    WHERE OWNER NOT IN (SELECT OWNER FROM DBA_LOGSTDBY_SKIP 
        WHERE STATEMENT_OPT='INTERNAL SCHEMA')
    ORDER BY OWNER
    
    
  • To list unsupported tables with column and data type information, issue the following query:

    COL OWNER FORMAT A9
    COL DATA_TYPE FORMAT A35
    COL TABLE_NAME FORMAT A35
    SELECT OWNER, TABLE_NAME, COLUMN_NAME, DATA_TYPE
    FROM DBA_TAB_COLS
    WHERE OWNER NOT IN (SELECT OWNER FROM DBA_LOGSTDBY_SKIP 
        WHERE STATEMENT_OPT='INTERNAL SCHEMA')    AND DATA_TYPE NOT IN ('BINARY_DOUBLE', 'BINARY_FLOAT', 'INTERVAL YEAR TO
        MONTH', 'INTERVAL DAY TO SECOND', 'BLOB', 'CLOB','CHAR', 'DATE','LONG',
        'LONG RAW', 'NCHAR', 'NCLOB','NUMBER', 'NVARCHAR2','RAW','TIMESTAMP',
        'TIMESTAMP(6)','TIMESTAMP(6) WITH TIME ZONE','TIMESTAMP(9)', 'TIMESTAMP
        WITH LOCAL TIMEZONE', 'TIMESTAMP WITH TIMEZONE', 'VARCHAR','VARCHAR2')
    ORDER BY 1,2
    
    

If either query returns rows with essential application tables, then use a physical standby database or investigate changing the primary database to use only supported datatypes. If the query does not return any rows with essential application tables, then you can use either a physical or a logical standby database.

Next, consider the need for the standby database to be accessible while changes are being applied. If you require that the standby is opened read/write with read-only access to the data being maintained and your application does not make use of unsupported datatypes, then logical standby is your best choice. If access to the standby while changes are being applied is not required or you have datatypes that are not supported by logical standby, then you should implement a physical standby.

If a logical standby database is still a viable choice, then you need to evaluate if it can handle your peak workloads. Because a logical standby database applies changes with SQL instead of a low level recovery mechanism, you need to assess database performance carefully. See "Oracle Database 10g Release 2 Best Practices: Data Guard SQL Apply" at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

2.4.2 Data Protection Mode

In some situations, a business cannot afford to lose data at any cost. In other situations, the availability of the database might be more important than protecting data. Some applications require maximum database performance and can tolerate a potential loss of data if a disaster occurs.

Based on your business requirements, choose one of the following protection modes:

  • Maximum protection mode guarantees that no data loss will occur if the primary database fails. To ensure that data loss cannot occur, the primary database shuts down if a fault prevents it from writing the redo stream to at least one remote standby redo log.

  • Maximum availability mode provides the highest level of data protection that is possible without compromising the availability of the primary database.

  • Maximum performance mode (the default mode) provides the highest level of data protection that is possible without affecting the performance of the primary database. This is accomplished by allowing a transaction to commit as soon as the redo data needed to recover that transaction is written to the local online redo log.

    The redo data stream of the primary database is also written to at least one standby database, but that redo stream is written asynchronously with respect to the commitment of the transactions that create the redo data. When network links with sufficient bandwidth are used, this mode provides a level of data protection that approaches that of maximum availability mode, with minimal effect on primary database performance.

To determine the correct data protection mode for your application, ask the questions in Table 2-1.

Table 2-1 Determining the Appropriate Protection Mode

Question Recommendations

Is data loss acceptable if the primary site fails?

Yes: Use any protection mode.

No: Use maximum protection or maximum availability modes.

How much data loss is tolerated if a site is lost?

None: Use maximum protection or maximum availability modes.

Some: Use maximum performance mode with LGWR ASYNC

Is potential data loss between the production and the standby databases tolerated when a standby host or network connection is temporarily unavailable?

Yes: Use maximum performance or maximum availability modes.

No: Use maximum protection mode, or use maximum availability with multiple standby databases.

How far away should the disaster-recovery site be from the primary site?

The distance between sites and the network infrastructure between the sites determine the network latency and bandwidth, and therefore the protection mode that can be used. In general, latency increases and bandwidth reduces with distance.

For a low-latency, high bandwidth network, use maximum protection or maximum availability. In this case, the performance impact is minimal, and you can achieve zero data loss.

For a high-latency network, use maximum performance mode with the ASYNC transport. In this case, the performance impact on the primary is minimal, and you can limit data loss to seconds in most cases. Maximum availability mode and maximum protection modes with the SYNC transport can still be used, but you need to assess if the additional COMMIT latency will exceed your application performance requirements. In some cases, the response time or throughput overhead is zero or within acceptable requirements. Large batch applications or a message queuing applications are good examples where maximum availability with SYNC is still applicable across a high-latency network

What is the current or proposed network bandwidth and latency between sites?

Bandwidth must be greater than maximum redo generation rate. A guideline for two-way communication is for bandwidth to be 50 percent of the stated network capacity, but you also need to consider the network usage of other applications.

Using maximum performance mode with asynchronous redo transport or the archiver mitigates the effect on performance.


2.4.3 Number of Standby Databases

When running in maximum protection mode, consider using multiple standby databases. In maximum protection mode, when the standby host or a network connection is temporarily unavailable, the primary database continues to retry connecting to the standby database for the number of seconds specified by the NET_TIMEOUT parameter in the LOG_ARCHIVE_DEST_n initialization parameter. The primary database preserves zero data loss during this time period. When it is over, the primary database proceeds with subsequent transactions. By configuring multiple standby databases, the primary database transactions are not interrupted, assuming that the primary database can communicate with at least one standby database that can satisfy the protection mode requirements.

In many cases, logical standby databases can be used for reporting and data protection and recovery. However, if the logical standby database schema requires additional indices or changes to optimize reporting functions, then it is recommended to have a separate physical standby database to maintain to a consistent copy of the primary database as well.

When you use multiple standby databases, consider hosting each one in a different geographic location so that a network outage or natural disaster does not affect multiple standby databases. For example, host one standby database local to the primary database and another standby database at a remote location.

2.4.4 General Configuration Best Practices for Data Guard

This section discusses the following configuration best practices for Data Guard:

2.4.4.1 Enable Flashback Database for Easy Reinstantiation After Failover

Enable Flashback Database on both the primary and standby database so that the old primary database can be easily reinstated as a new standby database following a failover. If there is a failure during the switchover process, then it can easily be reversed when Flashback Database is enabled.

See Also:

Section 2.2.2.3, "Enable Flashback Database" for more information about Flashback Database and for information about enabling Flashback Database

2.4.4.2 Use FORCE LOGGING Mode

When the production database is in FORCE LOGGING mode, all database data changes are logged. FORCE LOGGING mode ensures that the standby database remains consistent with the production database. If this is not possible because you require the load performance with NOLOGGING operations, then you must ensure that the corresponding physical standby datafiles are subsequently synchronized. The physical standby datafiles can be synchronized by either applying an incremental backup created from the primary database or by replacing the affected standby datafiles with a backup of the primary datafiles taken after the nologging operations Before the file transfer, the physical standby database must stop recovery.

For logical standby databases, when SQL Apply encounters a redo record for an operation performed with the NOLOGGING clause, it skips over the record and continues applying changes from later records. Later, if an attempt is made to access one of the records that was updated with NOLOGGING in effect, the following error is returned: ORA-01403 no data found. To recover after the NOLOGGING clause is specified for a logical standby database, re-create one or more tables from the primary database, as described in Oracle Data Guard Concepts and Administration in Section 9.4.6 "Adding or Re-Creating Tables On a Logical Standby Database."

You can enable force logging immediately by issuing an ALTER DATABASE FORCE LOGGING statement. If you specify FORCE LOGGING, then Oracle waits for all ongoing unlogged operations to finish.

2.4.4.3 Use Data Guard Broker

Use Data Guard Broker to create, manage, and monitor a Data Guard configuration. The benefits of using Data Guard Broker include:

  • Integration with RAC

    Data Guard Broker is integrated with CRS so that database role changes occur smoothly and seamlessly. This is especially apparent in the case of a planned role switchover (for example, when a physical standby database is directed to take over the primary role while the former primary database assumes the role of standby). Data Guard Broker and CRS work together to temporarily suspend service availability on the primary database, accomplish the actual role change for both databases during which CRS works with Data Guard Broker to properly restart the instances as necessary, and then resume service availability on the new primary database. Data Guard Broker manages the underlying Data Guard configuration and its database roles while CRS manages service availability that depends upon those roles. Applications that rely on CRS for managing service availability will see only a temporary suspension of service as the role change occurs in the Data Guard configuration.

  • Automated creation of a Data Guard configuration

    Oracle Enterprise Manager provides a wizard that automates the complex tasks involved in creating a Data Guard Broker configuration, including:

    • Adding an existing standby database, or a new standby database created from existing backups taken through Enterprise Manager

    • Configuring the standby control file, server parameter file, and datafiles

    • Initializing communication with the standby databases

    • Creating standby redo log files

    • Enabling Flashback Database if you plan to use fast-start failover

    Although the Data Guard command-line interface (DGMGRL) cannot automatically create a new standby database, you can use DGMGRL commands to configure and monitor an existing standby database, including those created using Enterprise Manager.

  • Simplified switchover and failover operations

    Data Guard Broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager or a single command at the DGMGRL command-line interface. (referred to in this documentation as manual failover). For lights-out administration, you can enable fast-start failover to allow Data Guard Broker to determine if a failover is necessary and initiate the failover to a pre-specified target standby database automatically, with no need for DBA intervention and with no loss of data.

    Fast-start failover allows you to increase availability with less need for manual intervention, thereby reducing management costs. Manual failover gives you control over exactly when a failover occurs and to which target standby database. Regardless of the method you choose, Data Guard Broker coordinates the role transition on all databases in the configuration.

  • Built-in monitoring and alert and control mechanisms

    Data Guard Broker provides built-in validation that monitors the health of all databases in the configuration. From any system in the configuration connected to any database, you can capture diagnostic information and detect obvious and subtle problems quickly with centralized monitoring, testing, and performance tools. Both Enterprise Manager and DGMGRL retrieve a complete configuration view of the progress of redo transport services on the primary database and the progress of Redo Apply or SQL Apply on the standby database.

    The ability to monitor local and remote databases and respond to events is significantly enhanced by the Data Guard Broker health check mechanism and by its tight integration with the Oracle Enterprise Manager event management system.

2.4.4.4 Use a Simple, Robust Archiving Strategy and Configuration

This archiving strategy is based on the following assumptions:

  • Each database uses a flash recovery area.

  • The production instances archive remotely to only one apply instance.

Table 2-2 describes the recommendations for a robust archiving strategy when managing a Data Guard configuration through SQL*Plus. All of the following items are handled automatically when the Data Guard Broker is managing a configuration.

Table 2-2 Archiving Recommendations

Recommendation Description

Archiving must be started on the primary database

Maintaining a standby database requires archiving to be enabled and started on the primary database.

SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT;
SQL> ALTER DATABASE ARCHIVELOG;
SQL> ALTER DATABASE OPEN;

Remote archiving must be enabled.

REMOTE_ARCHIVE_ENABLE=TRUE

Use a consistent log format (LOG_ARCHIVE_FORMAT).

LOG_ARCHIVE_FORMAT should have the thread, sequence, and resetlogs ID attributes and should be consistent across all instances. %S instructs the format to fill the prefix of the sequence number with leading zeros.

If the flash recovery is used, then this format is ignored.

For example: LOG_ARCHIVE_FORMAT=arch_%t_%S_%r.arc

Local archiving is done first by the archiver process (ARCH).

The default setting for LOG_ARCHIVE_LOCAL_FIRST is TRUE, which means that there is a dedicated archiver process archiving redo data to the local destination. Using the flash recovery area implicitly uses LOG_ARCHIVE_DEST_10 for local archiving.

Remote archiving should be done to only one standby instance and node for each standby RAC database.

All production instances archive to one standby destination, using the same net service name. Oracle Net Services connect-time failover is used if you want to automatically switch to the secondary standby host when the primary standby instance has an outage.

If the archives are accessible from all nodes because ASM or some other shared file system is being used for the flash recovery area, then remote archiving can be spread across the different nodes of a standby RAC database.

The standby archive destination should use the flash recovery area.

For simplicity, the standby archive destination (STANDBY_ARCHIVE_DEST) should use the flash recovery area, which is the same as the directory for the local archiving. Because SRLs are present, the standby ARCH process writes to the local archive destination.

For a logical standby database, STANDBY_ARCHIVE_DEST cannot use the flash recovery area. Set STANDBY_ARCHIVE_DEST to an explicit archive directory.

Specify role-based destinations with the VALID_FOR attribute

The VALID_FOR attribute enables you to configure destination attributes for both the primary and the standby database roles in one server parameter file (SPFILE), so that the Data Guard configuration operates properly after a role transition. This simplifies switchovers and failovers by removing the need to enable and disable the role-specific parameter files after a role transition.

See Also: Appendix A, "Database SPFILE and Oracle Net Configuration File Samples"


The following example illustrates the recommended initialization parameters for a primary database communicating to a physical standby database. There are two instances, SALES1 and SALES2, running in maximum protection mode.

*.DB_RECOVERY_FILE_DEST=+RECO
*.LOG_ARCHIVE_DEST_1='SERVICE=SALES stby LGWR SYNC AFFIRM NET_TIMEOUT=30
    REOPEN=300 VALID_FOR=(ONLINE_LOGFILES, ALL_ROLES)) DB_UNIQUE_NAME=SALES_stby'
*.LOG_ARCHIVE_DEST_STATE_1='ENABLE'

The flash recovery area must be accessible to any node within the cluster and use a shared file system technology such as automatic storage management (ASM), a cluster file system, a global file system, or high-availability network file system (HA NFS). You can also mount the file system manually to any node within the cluster very quickly. This is necessary for recovery because all archived redo log files must be accessible on all nodes.

On the standby database nodes, recovery from a different node is required when the node currently running standby apply fails and cannot be restarted. In that case, any of the existing standby instances residing on a different node can initiate managed recovery. In the worst case, when the standby archived redo log files are inaccessible, the new managed recovery process (MRP) or logical standby process (LSP) on the different node fetches the archived redo log files using the FAL server to retrieve from the production nodes directly.

When configuring hardware vendor shared file system technology, verify the performance and availability implications. Investigate the following issues before adopting this strategy:

  • Is the shared file system accessible by any node regardless of the number of node failures?

  • What is the performance impact when implementing a shared file system?

  • Is there any effect on the interconnect traffic?

2.4.4.5 Use Standby Redo Logs and Configure Size Appropriately

Standby redo logs (SRLs) should be used on both sites for improved availability and performance. Use the following formula to determine the number of SRLs:

# of SRLs = sum of all production online log groups for each thread + number of threads

For example, if a primary database has two instances (threads) and each thread has four online log groups, then there should be ten SRLs. Having one more standby log group for each thread than the number of the online redo log groups for the production database reduces the likelihood that the LGWR for the production instance is blocked because an SRL cannot be allocated on the standby.

The following are additional guidelines for creating SRLs:

  • Create the same number of SRLs for both production and standby databases

  • All online redo logs and SRLs for both production and standby databases should be the same size

  • SRLs should exist on both production and standby databases

  • SRLs should be created in the Data Area protected through ASM or external redundancy

  • In a RAC environment, SRLs must be on a shared disk

  • In a RAC environment, assign the SRL to a thread when the SRL is created

2.4.4.6 Parameter Configuration Example

Appendix A, "Database SPFILE and Oracle Net Configuration File Samples" provides detailed examples of parameter settings, including SPFILE samples and Oracle Net configuration files.

2.4.5 Redo Transport Services Best Practices

This section discusses best practices for planning and implementing redo transport services for Data Guard.

This section contains these topics:

2.4.5.1 Conduct Performance Assessment with Proposed Network Configuration

Oracle recommends that you conduct a performance assessment with your proposed network configuration and current (or anticipated) peak redo rate. The network impact between the production and standby databases, and the impact on the primary database throughput, needs to be understood. Because the network between the production and standby databases is essential for the two databases to remain synchronized, the infrastructure must have the following characteristics:

  • Sufficient bandwidth to accommodate the maximum redo generation rate

  • Minimal latency to reduce the performance impact on the production database

  • Multiple network paths for network redundancy

The required bandwidth of a dedicated network connection is determined by the maximum redo rate of the production database and actual network efficiency. Depending on the data protection mode, there are other recommended practices and performance considerations. Maximum protection mode and maximum availability mode require LGWR SYNC transport. Maximum performance protection mode uses the ASYNC transport option or the archiver (ARCHn).

When you compare maximum protection mode or maximum availability mode (with LGWR SYNC operations) with maximum performance mode (with LGWR ASYNC operations), measure whether performance or throughput will be degraded due to the incurred latency. You should also check whether the new throughput and response time are within your application performance requirements. Distance and network configuration directly influence latency, while high latency might slow down your potential transaction throughput and increase response time. The network configuration, number of repeaters, the overhead of protocol conversions, and the number of routers also affect the overall network latency and transaction response time.

2.4.5.2 Best Practices for Primary Database Throughput

When sending redo to a standby database using the LGWR SYNC attributes, a transaction on the primary database will not return commit complete to the foreground until the redo associated with that transaction has been written both locally and remotely. The commit time when using LGWR SYNC will be directly impacted by the network latency and bandwidth, as well as the I/O capacity on the standby database. The total commit time is comprised of the primary database's local write (log file parallel write) and the following factors that are captured through the LNS wait event on SENDREQ: network time + standby write (RFS write obtained from the V$SYSTEM_EVENT view on the standby database) + network acknowledgement. However, how much the primary database is impacted depends on the application profile. In general, batch updates with infrequent commits and message queuing applications may not observe any noticeable difference.

When sending redo to a standby database using the LGWR ASYNC attributes, the effect on primary database throughput is minimal due to the true asynchronous behavior of ASYNC redo transport. Furthermore, there is little effect on the primary database throughput (redo bytes per second) as network latency increases. With the ASYNC attribute, the log writer process writes to the local online redo log file, while the LGWR network server (LNSn) processes (one for each destination) read redo from the online redo log and asynchronously transmit the redo to remote destinations. The LGWR process continues processing the requests without waiting for the LNS network I/O to complete. If redo transport services transmit redo data to multiple remote destinations, then the LNSn processes initiate the network I/O to all of the destinations in parallel.

Figure 2-3 shows the LNSn process collecting redo data from the online redo log files and transmitting it over Oracle Net to the RFS process on the standby database.

Figure 2-3 LGWR ASYNC Archival with Network Server (LNSn) Processes

Description of Figure 2-3 follows
Description of "Figure 2-3 LGWR ASYNC Archival with Network Server (LNSn) Processes"

For remote destinations that are serviced by the ARCH process, you can configure the remote transfer to use multiple streams for more efficient use of network resources. Configuration of multiple streams is performed by setting the MAX_CONNECTIONS attribute in the LOG_ARCHIVE_DEST_n initialization parameter to a value greater than 1. The value determines the maximum number of network connections that will be used to perform the remote archival. The maximum value is five streams for each remote destination.

Because the network connections used in the multiple streams are performed by the ARCH process, care must be taken when setting the LOG_ARCHIVE_MAX_PROCESSES initialization parameter. The value of both LOG_ARCHIVE_MAX_PROCESSES and PARALLEL_MAX_SERVERS initialization parameters must be at least one greater than the total number specified for MAX_CONNECTIONS for all remote destinations.

2.4.5.3 Best Practices for Network Configuration and Highest Network Redo Rates

The following sections include best practices for network configuration and highest redo network redo rates:

2.4.5.3.1 Properly Configure TCP Send / Receive Buffer Sizes

Set the TCP send and receive socket buffer size to twice the bandwidth delay product (BDP) of the network link between the primary and standby systems. BDP is product of the network bandwidth and latency. Socket buffer sizes should be set using the Oracle Net parameters RECV_BUF_SIZE and SEND_BUF_SIZE, so that the socket buffer size setting affects only Oracle TCP connections. The operating system may impose limits on the socket buffer size that must be adjusted so Oracle can use larger values. For example, on Linux, the parameters net.core.rmem_max and net.core.wmem_max limit the socket buffer size and must be set larger than RECV_BUF_SIZE and SEND_BUF_SIZE.

For example, if bandwidth is 622 Mbits and latency =30 ms, then you should set RECV_BUF_SIZE and SEND_BUF_SIZE as follows:

RECV_BUF_SIZE= SEND_BUF_SIZE= 2 x 622,000,000 / 8 x 0.030 = 4,665,000 bytes

See Also:

Primary Site and Network Configuration Best Practices at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
2.4.5.3.2 Increase SDU Size

With Oracle Net Services, it is possible to control data transfer by adjusting the size of the Oracle Net setting for the session data unit (SDU). Setting SDU to its maximum value of 32767 can improve performance. SDU designates the size of the Oracle Net buffer used to collect data before it is delivered to the TCP network layer for transmission across the network. Oracle internal testing of Oracle Data Guard has demonstrated that the maximum setting of 32767 performs best. The gain in performance is a result of the reduced number of system calls required to pass the data from Oracle Net buffers to the operating system TCP network layer. SDU can be set on a per connection basis with the SDU parameter in the local naming configuration file (tnsnames.ora) and the listener configuration file (listener.ora), or SDU can be set for all Oracle Net connections with the profile parameter DEFAULT_SDU_SIZE in the sqlnet.ora file.

See Also:

2.4.5.3.3 Ensure TCP.NODELAY is YES

To preempt delays in buffer flushing in the TCP protocol stack, disable the TCP Nagle algorithm by setting TCP.NODELAY to YES in the SQLNET.ORA file on both the primary and standby systems.

See Also:

Oracle Database Net Services Reference for more information about the TCP.NODELAY parameter
2.4.5.3.4 Increase PARALLEL_MAX_SERVERS

Increase the PARALLEL_MAX_SERVERS initialization parameter to accommodate the MAX_CONNECTIONS attribute setting for the LOG_ARCHIVE_DEST_n initialization parameter. The LOG_ARCHIVE_MAX_PROCESSES and PARALLEL_MAX_SERVERS initialization parameters are related to the MAX_CONNECTIONS attribute and affect the actual number of ARCn processes used by an instance. For example, if the total number of connections specified by the MAX_CONNECTIONS attributes on all destinations exceeds the value of LOG_ARCHIVE_MAX_PROCESSES, then Data Guard will use as many ARCn processes as possible, but the number of connections may be fewer than what is specified by MAX_CONNECTIONS.

2.4.5.3.5 Account for Additional Disk I/O with LGWR ASYNC

Allow for sufficient I/O bandwidth to account for additional read I/O operations caused by enabling LGWR ASYNC redo transmission. When the LGWR and ASYNC attributes are specified on the LOG_ARCHIVE_DEST_n initialization parameter, the log writer process writes to the local online redo log file with no additional writes to the ASYNC buffer as in Oracle Database 10g Release 1 (10.1). In Oracle Database 10g Release 2 (10.2), after LGWR has completed writing to the online log, the LNSn process reads the change vectors from the online redo log and ships the redo to the standby database. With this new approach, the LGWR process writes are completely decoupled from LNSn network writes. Because the LNS process is performing a disk read against the online redo log, when it was not in previous releases, care should be taken that sufficient disk I/O bandwidth exist.

2.4.6 Log Apply Services Best Practices

This section discusses the best practices for Data Guard log apply services for both physical and logical standby databases.

This section contains these topics:

2.4.6.1 Redo Apply Best Practices for Physical Standby Databases

To use Oracle Data Guard Redo Apply with a physical standby database, or to use any media recovery operation effectively, you need to tune your database recovery by following these best practices:

  1. Maximize I/O rates on standby redo logs and archived redo logs.

    Measure read I/O rates on the standby redo logs and archived redo log directories. Concurrent writing of shipped redo on a standby database might reduce the redo read rate due to I/O saturation. The overall recovery rate will always be bounded by the rate at which redo can be read; so ensure that the redo read rate surpasses your required recovery rate.

  2. Assess recovery rate.

    To obtain the history of recovery rates, use the following query to get a history of recovery progress:

    SELECT * FROM V$RECOVERY_PROGRESS;
    
    

    If your ACTIVE APPLY RATE is greater than the maximum redo generation rate at the primary database or twice the average generation rate at the primary database, then no tuning is required; otherwise follow the tuning tips below. The redo generation rate for the primary database can be monitored from Grid Control or extracted from AWR reports under statistic REDO SIZE. If CHECKPOINT TIME PER LOG is greater than ten seconds, then investigate tuning I/O and checkpoints.

  3. Use defaults for DB_BLOCK_CHECKING and DB_BLOCK_CHECKSUM

    The default setting for DB_BLOCK_CHECKING is OFF. Setting DB_BLOCK_CHECKING to FULL may reduce the recovery rate. Block checking is always recommended on the primary database and might be enabled on the standby database if the recovery rate meets expectations.

    The default setting for DB_BLOCK_CHECKSUM is TYPICAL. Block checksum should always be enabled for both primary and standby databases. It catches most block corruption while incurring negligible overhead.

  4. Set PARALLEL_EXECUTION_MESSAGE_SIZE = 4096

    Increasing this parameter to 4096 may improve recovery by as much as twenty percent over the default setting of 2152. The message size parameter is used by parallel query operations, so there must be sufficient shared pool to support this increase.

  5. Set DB_CACHE_SIZE to a value greater than that for the primary database. Set DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE to 0.

    Having a large database cache size can improve media recovery performance by reducing the amount of physical data block reads. Because media recovery does not require DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE or require a large SHARED_POOL_SIZE, the memory can be reallocated to the DB_CACHE_SIZE.

    Prior to converting the standby database into a primary database, reset these parameters to the primary database settings.

  6. Assess database waits

    You can determine the top system and session wait events by querying the standby database's V$SYSTEM_EVENTS, V$SESSION_WAITS, and V$EVENT_HISTOGRAM and looking for the largest TIME_WAITED value. You may have to capture multiple snapshots of the query results and manually extract the difference to accurately assess a certain time period. Unfortunately, there is no equivalent AWR report for physical standby databases.

    If recovery is applying a lot of redo efficiently, the system will be I/O bound and the I/O wait should be reasonable for your system. The following table shows the top recovery-related waits that you may observe and the tuning tips appropriate for each wait type. Apply the tuning tips only if the recovery events are in the top ten waits.

    Wait Name Description Tuning Tips
    log file sequential read Coordinator (recovery session or MRP process) wait for log file read I/O Tune log read I/O
    px deq: par recov reply Coordinator synchronous wait for slave (wait for checkpoints) Tune datafile write I/O, increase DBWR processes, increase primary/standby redo log size
    px deq credit: send blkd Coordinator streaming wait for slave (wait for apply) Increase PARALLEL_EXECUTION_MESSAGE_SIZE to 8192
    free buffer waits Foreground waiting available free buffer in the buffer cache Increase DB_CACHE_SIZE and remove any KEEP or RECYCLE POOL settings
    recovery read Wait for data block read Tune datafile read I/O
    direct path read Coordinator wait for file header read at log boundary checkpoint Tune datafile read I/O
    direct path write Coordinator wait for file header write at log boundary checkpoint Tune datafile write I/O
    checkpoint completed (serial recovery only) Wait for checkpoint completed Tune datafile write I/O

    Increase number of DBWR processes

    db file parallel read (serial recovery only) Wait for data block read Tune file read I/O

  7. Tune I/O operations.

    DBWR must write out modified blocks from the buffer cache to the data files. Always use native asynchronous I/O by setting DISK_ASYNCH_IO to TRUE (default). In the rare case that asynchronous I/O is not available, use DBWR_IO_SLAVES to improve the effective data block write rate with synchronous I/O.

    Ensure that you have sufficient I/O bandwidth and that I/O response time is reasonable for your system either by doing some base I/O tests, comparing the I/O statistics with those for the primary database, or by looking at some historical I/O metrics. Be aware that I/O response time may vary when many applications share the same storage infrastructure such as with a Storage Area Network (SAN) or Network Attached Storage (NAS).

  8. Assess system resources.

    Use system commands such as UNIX sar and vmstat or system monitoring tools to assess system resources. Alternatively, refer to Enterprise Manager, AWR reports or performance views such as V$SYSTEM_EVENT, V$ASM_DISK and V$OSSTAT.

    1. If there are I/O bottlenecks or excessive wait I/O operations, then investigate operational or application changes that increased the I/O volume. If the high waits are due to insufficient I/O bandwidth, then add more disks to the relevant ASM disk group. Verify that this is not a bus or controller bottleneck or any other I/O bottleneck. The read I/O from the standby redo log should be greater than the expected recovery rate.

    2. Check for excessive swapping or memory paging.

    3. Check to ensure the recovery coordinator or MRP is not CPU bound during recovery.

  9. Increase log group size for the primary and standby databases.

    Increase the online redo log size for the primary database and the standby redo logs size for the standby database to a minimum of 1 GB. Oracle Database does a full checkpoint and updates all the file headers (in an optimized manner) at each log file boundary during media recovery. To reduce the frequency of a full database checkpoint and updating all the file headers, increase the log group size so that a log switch is occurring at a minimum of 15 minute intervals.

    If real-time apply is being used and redo is being sent synchronously or asynchronously by way of the LGWR process, then there is no additional data loss risk with this change. If archiver is sending the redo or the primary database is converting to ARCH mode due to heavy load, then you must balance faster recovery rates and higher data loss risk.

    To ensure that the crash recovery time for the primary database is minimized even with very large redo group sizes, set FAST_START_MTTR_TARGET to a nonzero value to enable fast-start fault recovery. If it is currently not set, then set it to 3600. This initialization parameter is relevant only for the primary database.

  10. Assess different degrees of recovery parallelism

    Parallel recovery is enabled by default for media and crash recovery with the default degree of parallelism set to the number of CPUs available. In most cases this is the optimal setting. However, in some circumstances faster recovery may be obtained by using a degree of parallelism that is different (higher or lower) than the default. To override the default setting, explicitly specify it as follows:

    RECOVER MANAGED STANDBY DATABASE PARALLEL <#>;
    

See Also:

Oracle Database 10g Release 2 Best Practices: Data Guard Redo Apply and Media Recovery at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

2.4.6.2 SQL Apply Best Practices for Logical Standby Databases

This section discusses recommendations for Data Guard SQL Apply and logical standby databases.

This section contains these topics:

2.4.6.2.1 Set the MAX_SERVERS Initialization Parameter

If the logical standby database is being used as a reporting or decision-support system, then increase the value of the MAX_SERVERS initialization parameter to reserve parallel query slaves for such operations. Because the SQL Apply process by default uses all of the parallel query slaves, set the MAX_SERVERS parameter as shown in the following equation to enable a specific number of parallel query slaves to be reserved:

MAX_SERVERS = <current MAX_SERVERS setting> + <PQ Slaves needed for reporting and decision-support operations>

Oracle recommends that you initially set the MAX_SERVERS parameter to either 9 or 3 + (3 x CPU), whichever value is larger.

2.4.6.2.2 Increase the PARALLEL_MAX_SERVERS Initialization Parameter

If the current settings has not been adjusted for MAX_SERVER processes, increase the value of the PARALLEL_MAX_SERVERS initialization parameter by the value of the MAX_SERVERS initialization parameter on both the primary and standby database instances:

PARALLEL_MAX_SERVERS = current value + MAX_SERVERS value

2.4.6.2.3 Set the PRESERVE_COMMIT_ORDER Initialization Parameter

For a reporting or decision-support system, PRESERVE_COMMIT_ORDER should be set to TRUE except when the standby database has fallen behind the primary database. Temporarily set PRESERVE_COMMIT_ORDER to FALSE while SQL Apply is getting the standby database caught up to the primary database, but reset the parameter to TRUE after the gap has been resolved.

2.4.6.2.4 Skip SQL Apply for Unnecessary Objects

Database objects that do not require replication to the standby database should be skipped by using the DBMS_LOGSTDBY.SKIP procedure. Skipping such objects reduces the processing of SQL Apply.

2.4.6.2.5 Set the LOG_AUTO_DELETE SQL Apply Parameter

Set the LOG_AUTO_DELETE SQL Apply parameter by running the DBMS_LOGSTDBY.APPLY_SET procedure. The LOG_AUTO_DELETE parameter controls whether SQL Apply automatically deletes archived redo log files sent from the primary database once they have been applied on the logical standby database. Set this parameter to TRUE to enable automatic deletion of archived redo log files. Set it to FALSE to disable automatic deletion. The default value is TRUE.

If the LOG_AUTO_DELETE parameter is set to FALSE, then you can use the DBMS_LOGSTDBY.PURGE_SESSION procedure to delete archived redo log files manually.

See Also:

Oracle Database PL/SQL Packages and Types Reference and Oracle Data Guard Concepts and Administration for more information on the DBMS_LOGSTDBY PL/SQL package

2.4.7 Role Transition Best Practices

With proper planning and execution, Data Guard role transitions can effectively minimize downtime and ensure that the database environment is restored with minimal impact on the business. Whether using physical standby or logical standby databases, MAA testing has determined that switchover and failover times with Oracle Data Guard 10g release 2 have been reduced to seconds. This section describes best practices for both switchover and failover.

This section contains these topics:

2.4.7.1 Role Transition During Switchover

A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Following a successful switchover operation, the standby database assumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that only one instance be active for each database, production and standby.

At times the term switchback is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.

Data Guard enables you to change these roles dynamically by:

See Also:

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line to perform database switchover
2.4.7.1.1 Switchover Best Practices

To optimize switchover processing, use the following best practices and see the Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover white paper available at

http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gR2_SwitchoverFailoverBestPractices.pdf:

  • For logical standby databases:

    • See Oracle Database 10g Release 2 Best Practices: Data Guard SQL Apply at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm to obtain an optimal SQL Apply rate.

    • Verify the LogMiner Multiversioned Data Dictionary was received by the primary database by querying the SWITCHOVER_STATUS column of the V$DATABASE fixed view on the primary database. When the query returns the TO LOGICAL STANDBY value, you can proceed with the switchover. See the discussion about "Switchovers Involving a Logical Standby Database" in Oracle Data Guard Concepts and Administration

  • For physical standby databases:

  • Use real-time apply so that redo data is applied to the standby database as soon as it is received.

    To enable real-time apply for a physical standby database, use the following SQL statement:

    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT 
    USING CURRENT LOGFILE;
    
    

    To enable real-time apply for a logical standby database, use the following SQL statement:

    ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
    
    
  • Enable Flashback Database so that if a failure occurs during the switchover, the process can be easily reversed.

2.4.7.2 Role Transition During Failover

Failover is the operation of taking the production database offline at one site and bringing one of the standby databases online as the new production database. A failover operation can be invoked when a catastrophic failure occurs on the production database, and there is no possibility of recovering the production database in a timely manner.

With Data Guard the process of failover can be completely automated using fast-start failover, or it can be user driven. Fast-start failover eliminates the uncertainty inherent in a process that requires manual intervention. It automatically executes a zero data loss failover within seconds of an outage being detected.

Oracle recommends that you use fast-start failover. The initial MAA tests running Oracle Database 10g Release 2 (10.2) show that failovers performed using the Data Guard Broker and fast-start failover offer a significant improvement in availability.

See Also:

Oracle Database 10g Release 2 Best Practices: Data Guard Fast-Start Failover at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm for a comprehensive review of Oracle failover best practices

Manual failover allows for a failover process where decisions are user driven. Manual failover can be accomplished by:

See Also:

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database failover

This section contains these topics:

2.4.7.2.1 Comparing Fast-Start Failover and Manual Failover

Fast-start failover can be used only in a Data Guard Broker configuration. It can be configured only through DGMGRL or the Data Guard management pages in Oracle Enterprise Manager. Fast-start failover also requires that the redo transport services be configured for LGWR SYNC, and the Data Guard configuration should be in maximum availability mode to achieve the stated guarantee of zero data loss failover. In addition both the primary and standby databases must have Flashback Database enabled. When enabled, fast-start failover monitors the Data Guard configuration and initiates a failover automatically to the specified target standby database automatically, with no need for DBA intervention and with no loss of data.

The following conditions will trigger a fast-start failover:

  • Database instance failure (or last instance failure in a RAC configuration)

  • Shutdown abort (or shutdown abort of the last instance in a RAC configuration)

  • Datafiles taken offline due to I/O errors

  • Both the observer (fast-start failover monitoring process) and the standby database lose their network connection to the primary database, and the standby database confirms that it is in a synchronized state.

Following a fast-start failover, the old primary database is automatically reconfigured as a new standby database upon reconnection to the configuration. This enables Data Guard to restore disaster protection in the configuration quickly and easily, returning the database to a protected state as soon as possible.

A Data Guard manual failover is a series of steps to convert a standby database into a production database. The standby database essentially assumes the role of production database. A Data Guard failover is accompanied by an application failover to fail over the users to the new primary database. After the failover, the former production database must be re-created as a new standby database to restore resiliency. The standby database can be quickly re-created by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".

See Also:

Oracle Data Guard Broker for more information on failover operations
2.4.7.2.2 Fast-Start Failover Best Practices

The following are fast-start failover best practices:

  • Use real-time apply so that redo data is applied to the standby database as soon as it is received.

  • Enable Flashback Database to protect from logical failures.

  • Run the fast-start failover observer process on a host that is not located in the same data center as the primary or standby database.

    Ideally, the observer should be run on a system that is equally distant from primary and standby databases. It should connect to the primary and standby databases using the same network as any end-user client. If the designated observer fails, Enterprise Manager can detect it and can be configured to automatically restart the observer. If unable to run at a third site the observer should be installed on the same network as the application.

  • If the standby database has been opened in read-only mode, then restart the database before starting Redo Apply.

  • Consider configuring multiple standby databases to maintain data protection following a failover.

  • Set the value of the FastStartFailoverThreshold property according to your configuration characteristics, as described in Table 2-3.

    Table 2-3 Minimum Recommended Settings for FastStartFailoverThreshold

    Configuration Minimum Recommended Setting

    Single instance primary, low latency, and a reliable network

    15 seconds

    Single instance primary and a high latency network over WAN

    30 seconds

    RAC primary

    Reconfiguration time + 30 seconds Foot 1 


    Footnote 1 For configurations running Oracle Database software prior to Release 10.2.0.3, use a minimum of RAC miscount + reconfiguration time + 30 seconds

For any of the settings shown in Table 2-3, perform testing to ensure that the fast-start failover threshold is not so aggressive that it will induce false failovers, or so high it does not meet your failover requirements.

See Also:

2.4.7.2.3 Manual Failover Best Practices

A manual failover, which is user-driven, should be used only in case of an emergency and should be initiated due to an unplanned outage such as:

  • Site disaster that results in the primary database becoming unavailable

  • User errors that cannot be repaired in a timely fashion

  • Data failures, to include widespread corruption, which impacts the production application

A failover requires that the initial production database must be reinstated as a standby database to restore fault tolerance to your environment. The standby database can be quickly reinstated by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".

To optimize failover processing and to maintain high availability, use the following best practices:

  • Enable Flashback Database to reinstate databases after the failover operation has completed.

  • Use real-time apply in conjunction with Flashback Database to apply redo data to the standby database as soon as it is received, and to quickly rewind the database should user error or logical corruption be detected.

  • For logical standby databases, see Oracle Database 10g Release 2 Best Practices: Data Guard SQL Apply at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm to obtain an optimal SQL Apply rate.

  • For physical standby databases:

  • When transitioning from read-only mode to Redo Apply (recovery) mode, restart the database.

2.4.8 Maintaining a Physical Standby Database as a Clone

Data Guard, Restore Points, and Recovery Manager can be combined to enable a Redo Apply (physical) standby database to be used as a read/write clone database temporarily for testing, reporting, development, or any other use clone technology might have within your organization.

The following are recommendations to consider when using a standby database as a clone:

  • Evaluate the performance of this approach

    When evaluating using this method for maintaining a clone database, measure the time needed to refresh the clone database, given the degree that the clone database will diverge from the primary database. A clone database that is used heavily and diverges significantly from the primary database (more than 15% of the total data blocks are changed between databases) might be synchronized more quickly by simply running RMAN DUPLICATE to copy the entire database from a disk backup.

    Also, ensure there are sufficient I/O resources available. The process of flashing back the changes made to the clone database and synchronizing it with the primary by applying an incremental backup is I/O intensive.

  • Use a separate standby database for disaster protection.

    While the standby database is activated as the clone, it is not receiving redo data from the primary database and cannot provide disaster protection. To provide disaster protection while a physical standby database is activated, there should be multiple standby databases running in the Data Guard configuration so that the primary database remains protected while one of the standby databases is activated. Using a second standby database allows the Data Guard configuration to remain in maximum protection or maximum availability mode while the clone database is activated.

  • Enable block change tracking on the primary database.

    The primary database INCREMENTAL FROM SCN backup will be created more quickly with block change tracking enabled. This backup is used to apply all changes that occurred on the primary database to the clone database while it was diverged. See Oracle Database Backup and Recovery Basics for information about enabling block change tracking.

  • Use RMAN parallelism for creating and applying the incremental backup.

    When creating and applying the primary database INCREMENTAL FROM SCN backup, use RMAN parallelism to reduce the create and apply time. See Oracle Database Backup and Recovery Advanced User's Guide for more information about RMAN parallelism.

  • Create the primary incremental backup and the flashback clone simultaneously.

    To reduce the overall synchronization time, create the incremental backup from the primary database at the same time the clone database is being flashed back to the initial restore point.

To use a standby database as a clone, locate a physical standby database that is not being used for disaster protection. The following sections list the general phases and steps but you can find detailed information about using a standby database as a clone in Oracle Data Guard Concepts and Administration:.

Phase 1: Activate the physical standby database as a clone

  1. Prepare the physical standby database to be activated.

  2. Prepare the primary database to have the physical standby be diverged.

  3. Activate the physical standby database (diverge it from the primary database).

  4. Use the activated clone database for testing.

Phase 2: Resynchronize the clone to become a physical standby database

  1. Flashback the activated database back to a physical standby database.

  2. Catch up the standby database to the primary database.

2.4.9 Recommendations on Protecting Data Outside of the Database

In a high-availability environment, nondatabase files must be protected along with database files. Oracle Secure Backup provides data protection for heterogeneous UNIX, Linux, Windows, and Network Attached Storage environments. Additionally, for disaster recovery purposes, some third-party tools enable remote synchronization between a set of local and remote files. For example, you can use tools such as rsync, csync2, and DRDB for remote synchronization. These tools are available for download on the internet. The following are recommendations regarding these tools:

  • For software updates, use rsync to synchronize the standby system with the changes made to software on the primary system.

  • For configuration files, use rsync daily or after a change, or use csync2.

  • For important log files, trace files, or debugging files, use rsync daily or hourly, or use DRDB to synchronize the entire file system.

  • For transaction logs or metadata files that must be synchronized with the database, use rsync or csync2 frequently, or use a block synchronization tool such as DRDB, a third party mirroring utility, or remote synchronization tool.

2.4.10 Assessing Data Guard Performance

To accurately assess the primary database performance after adding Data Guard standby databases, obtain a history of statistics from the V$SYSMETRIC_SUMMARY view or Automatic Workload Repository (AWR) snapshots before and after deploying Data Guard with the same application profile and load.

Application profile can be quickly assessed by comparing the following statistics:

  • Physical reads per transaction

  • Physical writes per transaction

  • CPU usage per transaction

  • Redo generated per transaction

Application performance can be quickly assessed by comparing the following statistics:

  • Redo generated per second or redo rate

  • User commits per second or transactions per second

  • Database time per second

  • Response time per transaction

  • SQL service response time

If the application profile has changed between the two scenarios, then this is not a fair comparison. Repeat the test or tune the database or system with the general principles outlined in the Oracle Database Performance Tuning Guide.

If the application profile is similar and you see there is an application performance on the primary because of a decrease in throughput or an increase in response time, then assess these common problem areas:

  • CPU utilization

    If you are experiencing high load (excessive CPU utilization of over 90%, paging and swapping), then you need to tune the system before proceeding with Data Guard. Use the V$OSSTAT or V$SYSMETRIC_HISTORY view to monitor system usage statistics from the operating system.

  • Higher I/O waits

    If you are experiencing higher I/O waits from LGWR and DBWR, then the slower I/O will impact throughput and response time. This can be observed by looking at historical data of the following wait events:

    • Log file parallel writes

    • Log file sequential reads

    • Log file parallel reads

    • Data file parallel writes

    • Data file sequential reads parallel writes

With LGWR SYNC transport, commits will take more time because we need to guarantee that the redo is available on the standby database before foregrounds get an acknowledgement from LGWR that the commit has completed. A LGWR commit includes the following wait events:

  • Log file parallel write (local write for LGWR)

  • LGWR wait on SENDREQ

    This wait event includes:

    • Time to put the packet into the network

    • Time to send the packet to the standby database

    • RFS write or standby write to the standby redo log, which includes the RFS I/O wait event plus additional overhead for checksums

    • Time to send a network acknowledgement back to the primary database (for example, single trip latency time)

Higher LGWR (log writer) commit times can cause longer response time and lower throughput especially for small time-sensitive transactions. However, you may obtain sufficient gains by tuning the LGWR local write (log file parallel write wait event) or the different components that make up the LGWR wait on SENDREQ wait.

The disk write I/O (log file parallel write or the RFS I/O) can be tuned by adding more spindles or increasing the I/O bandwidth. The network time can be reduced by tuning the Oracle Net send and receive buffer sizes, setting SDU=32K, increasing the network bandwidth if there is saturation, and possibly finding a closer site to reduce the network latency.

With LNSn ASYNC transport and ARCH transport, LGWR never waits for LNSn or ARCH before writing its commit record to the current log file; however, both LNSn and archiver processes read the online redo logs; thus causing more I/O contention and possibly longer wait times for LGWR writes (log file parallel writes). If I/O bandwidth and sufficient spindles are not allocated, then you will see higher log file parallel writes and log file sequential reads, which may have an effect on throughput and response time. In most cases, adding sufficient spindles will reduce the I/O latency.

Note:

To enable most of the new statistical gathering and advisors, ensure the STATISTICS_LEVEL initialization parameter is set to TYPICAL (recommended) or ALL.

See Also:

2.5 Configuring Backup and Recovery

While it is prudent that every database has a good backup, consider your Recover Time Objective (RTO) and Recovery Point Objective (RPO) when designing a backup and recovery strategy. While many recoveries involve restoring a backup, Oracle provides other database features such as Data Guard and Flashback Technology to minimize the recovery time from a database outage.

This section discusses the best practices for maintaining a good database backup, as well as other backup options and strategies made possible by the available Oracle database features.

This section contains the following topics:

2.5.1 Use Oracle Database Features and Products

Oracle has multiple database features and products to facilitate Backup and Recovery operations, including Recovery Manager (RMAN), Oracle Secure Backup, the flash recovery area, Flashback Database and restore points.

This section contains these topics:

2.5.1.1 Use Recovery Manager to Back Up Database Files

Recovery Manager (RMAN) is Oracle's utility to backup and recover the Oracle Database. Because of its tight integration with the database, RMAN determines automatically what files need to be backed up. But more importantly, RMAN knows what files need to be restored for media- recovery operations. RMAN uses server sessions to perform backup and recovery operations and stores metadata about backups in a repository. RMAN offers many advantages over typical user-managed backup methods, including:

  • Online database backups without placing tablespaces in backup mode

  • Incremental backups

  • Data block integrity checks during backup and restore operations

  • Test backups and restores without actually performing the operation

RMAN automates backup and recovery. User-managed methods require you to locate backups for each datafile, copy them to the correct place using operating system commands, and choose which logs to apply. RMAN manages these tasks automatically.

There are also capabilities of Oracle backup and recovery that are only available when using RMAN, such as automated tablespace point-in-time recovery and block-media recovery.

2.5.1.2 Use Oracle Secure Backup

Oracle Secure Backup provides data protection for heterogeneous UNIX, Linux, Windows and Network Attached Storage (NAS) environments. Oracle Secure Backup provides tape data protection for the entire Oracle Environment:

  • Oracle Database through integration with RMAN

  • Seamless support of Oracle Real Application Clusters (RAC)

  • File system data protection of distributed servers including:

    • Oracle Application Servers

    • Oracle Collaboration Suites

    • Oracle home and binaries

The combination of RMAN and Oracle Secure Backup provides an end-to-end tape backup solution, eliminating the need for third-party backup software.

See Also:

The Oracle Secure Backup web site at http://www.oracle.com/database/secure-backup.html

2.5.1.3 Use Restore Points

Oracle restore points can be used to protect against logical failures at risky conjunctions during database maintenance. Creating a normal restore point assigns a restore point name to a specific point in time or SCN. The restore point name can be used with Flashback Table and Flashback Database operations. Restore points can be guaranteed to ensure that a Flashback Database operation will succeed in rewinding the database back to the restore point.Guaranteed restore points are recommended for database-wide maintenance such as database or application upgrades, or running batch processes. Guaranteed restore points enable Flashback Database and retain all flashback logs necessary to ensure the database can be flashed back to the restore point. Once maintenance activities complete and results are verified, guaranteed restore points that are no longer needed should be deleted.

See Also:

Oracle Database Backup and Recovery Basics for more information on Flashback Database

2.5.2 Configuration and Administration

This section contains these topics:

2.5.2.1 Understand When to Use Backups

Using backups to resolve an unscheduled outage of a production database may not allow you to meet your Recovery Time Objective (RTO) or SLA. For example, some outages are handled best by using Flashback Database or the standby database. However, some situations require using database backups, including the following:

See Also:

Initial Data Guard Environment Setup During initial setup of a standby database, a backup of the production database is required at the secondary site to create the initial standby database.

Recovering from Data Failures Using File or Block Media Recovery When a block corruption, media failure, or other data failure occurs in an environment that does not include Data Guard, the only method of recovery is using an existing backup.

Double Failure Resolution A double failure scenario affects the availability of both the production and standby databases. An example of a double failure scenario is a site outage at the secondary site, which eliminates fault tolerance, followed by a media failure on the production database. The only resolution of this situation is to re-create the production database from an available backup and then re-create the standby database.

Some multiple failures, or more appropriately disasters (such as a primary site outage followed by a secondary site outage), might require the use of backups that exist only in an offsite location. Developing and following a process to deliver and maintain backup tapes at an offsite location, therefore, is necessary to restore service in the most dire of circumstances.

2.5.2.2 Determine a Backup Frequency

It is important to determine a backup frequency policy and to perform regular backups. A backup retention policy helps ensure that needed data is not destroyed too soon.

Factors Determining Backup Frequency Frequent backups are essential for any recovery scheme. Base the frequency of backups on your estimated recovery time objective for outages that cannot be resolved by Data Guard or Flashback technology. Repair time will be dictated by restore time plus recovery time. The frequency of the backup and the location of the backup will impact both of these factors. The other factor that influences how frequently a datafile is backed up is the rate or frequency of database changes such as:

  • Addition and deletion of tables

  • Insertions and deletions of rows in existing tables

  • Updates to data in tables

To simplify database backup and recovery, the Oracle suggested backup strategy implements the flash recovery area while using incremental backups and updated incremental backup features. For more information about the Oracle suggested strategy, see the section titled "Using the Oracle-Suggested Backup Strategy" in Oracle Database 2 Day DBA.

Establishing a Backup Retention Policy A backup retention policy is a rule set regarding which backups must be retained (on disk or other backup media) to meet recovery and other requirements. It may be safe to delete a specific backup because it is old enough to be superseded by more recent backups or because it has been stored on tape. You may also have to retain a specific backup on disk for other reasons such as archival requirements. A backup that is no longer needed to satisfy the backup retention policy is said to be obsolete.

Backup retention policy can be based on redundancy or a recovery window. In a redundancy-based retention policy, you specify a number n such that you always keep at least n distinct backups of each file in your database. In a recovery window-based retention policy, you specify a time interval in the past (for example, one week or one month) and keep all backups required to let you perform point-in-time recovery to any point during that window.

Keeping Long-Term Backups Some businesses require the ability to maintain long-term backups that may be needed years into the future. By using RMAN with the KEEP option, it is possible to retain backups that are exempt from the retention policy and never expire, providing the capability to restore and recover the database to any desired point in time. It is important that a recovery catalog be used for the RMAN repository so that backup metadata is not lost due to lack of space, which may occur when using the target database control file for the RMAN repository.

2.5.2.3 Use an RMAN Recovery Catalog

RMAN automatically manages the backup metadata in the control file of the database that is being backed up. To protect and keep backup metadata for long periods of time, the RMAN repository is created in a separate database. This repository is usually referred to as a recovery catalog. The advantages of using a recovery catalog include:

  • Stores backup information long-term

  • Store metadata for multiple databases

  • Restore an available backup onto another system

Another reason to use a recovery catalog is the limited maximum size of the target database control file. If the control file is too small to hold additional backup metadata, then existing backup information is overwritten, making it difficult to restore and recover using those backups.

See Also:

Oracle Database Backup and Recovery Advanced User's Guide for more information on RMAN repository

2.5.2.4 Enable Block Change Tracking for Incremental Backups

Oracle Database 10g includes a change tracking feature for incremental backups, which improves incremental backup performance by recording changed blocks in each datafile in a change tracking file. If change tracking is enabled, then RMAN uses the change tracking file to identify changed blocks for incremental backup. This avoids the need to scan every block in the datafile, reducing the number of disk reads during backup.

See Also:

Oracle Database Backup and Recovery Basics for more information on block change tracking

2.5.2.5 Enable Autobackup for Control File and Server Parameter File

RMAN can be configured to automatically back up the control file and server parameter file (SPFILE) whenever the database structure metadata in the control file changes or when a backup record is added. The autobackup enables RMAN to recover the database even if the current control file, catalog, and SPFILE are lost. The RMAN autobackup feature is enabled with the CONFIGURE CONTROLFILE AUTOBACKUP ON statement.

See Also:

Oracle Database Backup and Recovery Basics for more information on autobackup

2.5.3 Backup to Disk

This section contains these topics:

2.5.3.1 Determine Disk Backup Methods

When selecting a backup mechanism, the following priorities will drive your backup strategy:

  • Overall backup time

  • Impact on primary

  • Space used by backup

  • Recovery time

Table 2-4 compares different backup alternatives against the different priorities you might have. The table should guide you to choose the best backup approach for your specific business requirements. You might want to minimize backup space while sacrificing recovery time. Alternatively, a higher priority might be on recovery and backup times while space is not an issue.

Table 2-4 Comparison of Backup Options

Backup Option Overall Backup Time Impact on Primary Space Used by Backup Recovery Time

Full data file copy

Fast

High

Large

Fastest

Full backup sets

Faster

High

Smaller

Slowest

Incremental backup with roll forward immediately

Faster

Low

Large

Fastest

Incremental backup with roll forward deferred until recovery

Fastest

Low

Smallest

Fast


Best Practices on Optimizing Recovery Times If restore time is the primary concern, then either a database copy or an incremental backup with roll forward immediately should be performed. These two are the only options that provide an immediately usable backup of the database, which then needs only to be recovered to the time of the failure using archive logs created since the last backup was performed.

Best Practices on Minimizing Space Usage If space usage is the primary concern, then an incremental backup with a deferred roll forward should be performed. If a cumulative level 1 incremental backup is performed, then it stores only those blocks that have been changed since the last level 0 backup. With a cumulative incremental backup, the last level 1 backup need only be applied to the level 0 backup. With a differential incremental backup, all level 1 backups have to be applied to the level 0 backup. A cumulative incremental backup will initially consume more space in the flash recovery area than a differential incremental backup. But over time the cumulative incremental backup will consume less space.

Best Practices on Minimizing System Resource Consumption (I/O and CPU) If system resource consumption is the primary concern, then an incremental backup with a block change-tracking file will consume the least amount of resources on the database. This is true when the amount of data changed for each backup window is below 20% of the total database size. When the amount of data changed for each backup window exceeds 20%, then performing incremental backups will still reduce the amount of disk space required to hold the backup, but may not reduce the backup time.

Example

For many applications, only a small percentage of the entire database is changed each day regardless of whether the transaction rate is very high. Frequently, applications modify a same set of blocks frequently; so, the total dirty block set is small.

For example, a database contains about 600 GB of user data, not including temp files and redo logs. Every 24 hours, approximately 2.5% of the database is changed, which is approximately 15 GB of data. The first level 0 backup takes about 180 minutes and a subsequent level 1 backup takes 20 minutes, while the merge of the backups take 45 minutes. In this example, we observed the following results:

  • Level 0 backup takes 180 minutes, including READs from the data area and WRITEs to the flash recovery area

  • Level 1 backup takes 20 minutes, including READs from the data area and WRITEs to the flash recovery area

  • Rolling forward and merging the backups takes only 45 minutes included READs and WRITEs from the flash recovery area, which offloads possibly contention to the data area if they use separate storage.

  • The net savings are:

    • 115 minutes or 64% time savings to create a complete backup

    • Reduced I/O on the database during backups

For bigger databases, we observed even larger gains.

2.5.3.2 Create Backups in NOCATALOG Mode and RESYNC CATALOG Afterwards

When creating backups to disk or tape, use the target database control file as the RMAN repository, so that backup success does not depend on the availability of the database holding the RMAN repository. This is accomplished by running RMAN with the NOCATALOG option. After the backup is complete, the new backup information stored in the target database control file can be resynchronized with the recovery catalog using the RESYNC CATALOG command.

See Also:

Oracle Database Backup and Recovery Reference for more information on RESYNC CATALOG

2.5.3.3 Create Database Backups on Disk in the Flash Recovery Area

Using automatic disk-based backup and recovery, you can create a flash recovery area which automates management of backup-related files:

  1. Choose a location on disk.

    This location is specified by DB_RECOVERY_FILE_DEST.

  2. Choose an upper bound for storage space.

    This upper bound is specified by DB_RECOVERY_FILE_DEST_SIZE.

  3. Set a retention policy that governs how long backup files are needed for recovery.

Oracle Database 10g then manages the storage used for backup, archived redo logs, and other recovery-related files for your database within this space. Files no longer needed are eligible for deletion when RMAN must reclaim space for new files.

2.5.3.4 In a Data Guard Environment, Back Up to Flash Recovery Area on All Sites

Take backups at the primary and secondary sites. The advantages of this practice include:

  • Significantly reduces RTO in certain double outage scenarios

  • Avoids introducing new backup procedures upon a switchover or failover

  • RMAN file and block media recovery is a recovery option for data failure outages at both primary and secondary sites

Consider a scenario in which backups are done only at the secondary site. Suppose there is a site outage at the secondary site where the estimated time to recover is three days. The primary site is completely vulnerable to an outage that is typically resolved by a failover, or any outage that could be resolved by having a local backup (such as a data failure outage resolved by block media recovery).

In this scenario, a production database outage can be resolved only by physically shipping the off-site tape backups that were taken at the standby site. If primary site backups were available, then restoring locally would be an available option in place of the failover that cannot be done. Data might be lost, but having primary site backups significantly shortens the RTO.

Primary site disk backups are also necessary to ensure a reasonable RTO when using RMAN file or block media recovery. Without a local on disk backup, a backup taken at the standby site must be restored to the primary site, significantly lengthening the RTO for this type of outage.

2.5.4 Backup to Tape

This section contains these topics:

2.5.4.1 Create Tape Backups from the Flash Recovery Area

Use RMAN command BACKUP RECOVERY FILE DESTINATION to move disk backups created in the flash recovery area to tape. Using a single command, all files not backed up to tape previously are backed up. This prevents you from backing up files more than once and wasting tape or tracking files not backed up before. Tape backups are used to handle certain outage scenarios and for offsite and long-term storage.

2.5.4.2 Maintain Offsite Backups

Regardless of the architecture deployed, including the existence of a standby database, it is still important to have offsite backups for business requirements, to protect against disasters, and to comply with legal and regulatory requirements such as the Securities and Exchange Commission (SEC) and Health Insurance Portability and Accountability Act (HIPPA).

2.5.5 Backup and Recovery Maintenance

This section contains these topics:

2.5.5.1 Regularly Check Database Files for Corruption

Use RMAN command BACKUP VALIDATE RMAN to regularly check database files for block corruption that has not yet been reported by a user session or by normal backup operations. RMAN scans the specified files and verifies content-checking for physical and logical errors, but does not actually perform the backup or recovery operation. Oracle records the address of the corrupt block and the type of corruption in the control file. Access these records through the V$DATABASE_BLOCK_CORRUPTION view, which can be used by RMAN block media recovery.

If BLOCK CHANGE TRACKING is enabled, then do not use the INCREMENTAL LEVEL option with BACKUP VALIDATE to ensure that all data blocks are read and verified.

To detect all types of corruption that are possible to detect, specify the CHECK LOGICAL option. Do not specify the MAXCORRUPT OR NOCHECKSUM option of the BACKUP VALIDATE command.

2.5.5.2 Periodically Test Recovery Procedures

Complete, successful, and tested backups are fundamental to the success of any recovery. Create test plans for different outage types. Start with the most common outage types and progress to the least probable.

Monitor the backup procedure for errors, and validate backups by testing your recovery procedures periodically. Also, validate the ability to backup and restore by using the RMAN commands BACKUP VALIDATE and RESTORE...VALIDATE .

2.5.5.3 Regularly Backup the Recovery Catalog Database

Include the recovery catalog database in your backup and recovery strategy. If you do not back up the recovery catalog and a disk failure occurs that destroys the recovery catalog database, then you may lose the metadata in the catalog. Without the recovery catalog contents, recovery of your other databases is likely to be more difficult.

2.6 Configuring Fast Application Failover

To benefit fully from fast instance and database failover and switchover with Real Application Clusters (RAC) and Data Guard, you should configure fast application failover. When a database service becomes unavailable, fast application failover enables clients (mid-tier applications or any program that connects directly to a database) to failover quickly and seamlessly to an available database service.

Because client failover features have evolved through several Oracle Database releases, the time required for clients to respond to various outages will vary by release. The time required for failover in certain cases is a direct function of TCP/IP network timeouts.

Table 2-5 shows typical wait times when using client failover features

Table 2-5 Typical Wait Times for Client Failover

Oracle Database Version Client Type Site Failure RAC Node Failure Non-RAC Instance Failure RAC Instance Failure

8.0, 8i, 9i

All

TCP timeout

TCP timeout

Seconds to minutesFoot 1 

Seconds

10g Release 1 (10.1)

JDBC

TCP timeout

Seconds

Seconds to minutes1

Seconds

10g Release 1 (10.1)

OCI

TCP timeout

TCP timeout

Seconds to minutes1

Seconds

10g Release 2 (10.2)

JDBC

Seconds

Seconds

Seconds

Seconds

10g Release 2 (10.2)

OCI

SecondsFoot 2 

Seconds

Seconds

Seconds


Footnote 1 The wait times required in non-RAC instance failures are determined by how much time is needed to activate the standby database as the new primary database and for the client to establish a new connection.

Footnote 2 Excluding ODP.NET clients, who suffer an outage equal to that of TCP timeout

With Oracle Database 10g Release 2 (10.2), delays caused by TCP/IP network timeouts are overcome for both JDBC clients and OCI clients by using fast application failover. To use fast application failover for site failovers, a trigger is written that is invoked by the DB_ROLE_CHANGE system event. This trigger can also manage post-failover tasks.

A very detailed description of how to achieve seamless client failover is covered in Oracle Database 10g Release 2 Best Practices: Client Failover for Highly Available Oracle Databases at http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gR2_ClientFailoverBestPractices.pdf.

This section contains these topics:

2.6.1 Configuring Clients for Failover

For JDBC client failover best practices, follow these steps:

  1. Enable fast connection failover for JDBC clients by setting the DataSource property FastConnectionFailoverEnabled to TRUE.

  2. Configure JDBC clients to use a connect descriptor that includes an address list which includes the VIP address for each node in the cluster and connects to an existing service.

  3. Configure a remote Oracle Notification Service (ONS) subscription on the JDBC client so that an ONS daemon is not required on the client.

For OCI client failover best practices, follow these steps:

  1. Enable fast application notification (FAN) notifications for OCI clients by initializing the environment with the OCI_EVENTS parameter.

  2. Link the OCI client applications with the thread library.

  3. Set the AQ_HA_NOTIFICATIONS parameter to TRUE and configure the transparent application failover (TAF) failover attributes for services.

2.6.2 Client Failover in a RAC Database

For client failover best practices in a RAC database:

  1. Use Oracle Enterprise Manager to create new services.

  2. Add all hosts in the cluster to the RAC ONS configuration.

2.6.3 Failover from a RAC Primary Database to a Standby Database

For client failover best practices when failing over from a RAC primary database to a standby database, follow these steps:

  1. Create the necessary support files for Oracle Net alias modifications.

  2. Configure the FAN ONS publisher program so that JDBC clients are notified of the primary site failure and instructed to reconnect to the new primary database.

  3. Create a trigger on the DB_ROLE_CHANGE system event to perform the proper steps so that JDBC and OCI clients connect to the new primary database following a role change after a Data Guard failover or switchover.