Simple Oracle, MS SQL, Postgres, AWS and more: Parallel Rollback may hang database, Parallel query servers get 100% cpu

Problem Description
——————-

Sometimes Parallel Rollback of Large Transaction may become very slow. After killing a large running transaction (either by killing the shadow process or aborting the database) then database seems to hang, or smon and parallel queryservers taking all the available cpu. After killing shadow process or aborting the database the v$transaction entry is lost, so you cannot estimate by
examining v$transaction.used_ublk how the rollback procedure proceeds.

In parallel rollback checking v$lock you see SMON having one TX lock and x (where x is the init.ora -> parallel_max_servers value) PS locks (parallel slave synchronization lock). Also each Parallel Query process has also 1 PS lock. And a lot of redo log activity is happening.

In order to identify if this is your case check the following views:
V$fast_start_servers -> provides information about all the recovery slaves performing parallel transaction recovery. This contains one row for each parallel query slave.
Column STATE shows the state of the server being IDLE or RECOVERING.
Column UNDOBLOCKSDONE shows the percentage of the assigned work done so far.

V$fast_start_transactions -> contains one row for each one of the transactions that Oracle is recovering in Parallel.

Important columns here are:
USN -> the undo segment number of the transaction (join with v$rollstat.usn and v$rollname.usn to find the rollback segment used, v$rollstat.xacts would show a value larger than 0).
UNDOBLOCKSDONE -> the number of undo blocks completed in this transaction
UNDOBLOCKSTOTAL -> the total number of undo blocks that need recovery
CPUTIME -> the time for which recovery has progressed in seconds

You may notice that UNDOBLOCKSDONE is not increasing or increases very slowly.

Solution Description
——————–

If you have fall in the above case you can use

alter system set fast_start_parallel_rollback = false;

in order to disable parallel rollback.

If it hangs, shutdown database and define it inside init.ora file:

fast_start_parallel_rollback = false

FAST_START_PARALLEL_ROLLBACK shows the maximum number of processes which may exist for performing parallel rollback. If the value is false, parallel rollback is disabled. If the value is LOW (default), 2 * CPU_COUNT number of processes may be used. If the value is HIGH, at most 4 * CPU_COUNT number of rollback servers are used for parallel rollback.

Explanation
———–

In fast-start parallel rollback, the background process Smon acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes.

Fast start parallel rollback is mainly useful when a system has transactions that run a long time before comitting, especially parallel Inserts, Updates, Deletes operations. When Smon discovers that the amount of recovery work is above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes.

One special form of fast-start parallel rollback is intra-transaction recovery. Intra-transaction parallelism kicks in only when more than 4 slaves (parallel_max_servers) are available.
There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are interfering with each other. It looks like the changes made by this transaction cannot be recovered in parallel without causing a performance problem. The parallel rollback slave processes are most likely contending for the same resource, which results in
even worse rollback performance compared to a serial rollback. This depends mainly on the type of changes that need to be made during rollback and usually may happen when rolling back INDEX Updates in parallel. One way to check this is by dumping some blocks of the online redo logs or archive logs produced during the parallel rollback by command:

   alter database dump logfile ‘log1SID.dbf’
   rba min . 1
   rba max . 100 -> for 100 blocks of redo log file with sequence

If you check the Redo Records you may notice that they are doing actions with Opcodes 10.x (Index actions) and 5.x (rollback actions).

Simple Oracle, MS SQL, Postgres, AWS and more

Thursday, December 24, 2009

Parallel Rollback may hang database, Parallel query servers get 100% cpu

No comments: