Saturday, September 5, 2009

RMAN-06094 While recovering through new datafile

Summary
The scenario here is that you have a backup of a database from a week ago. It's an Oracle 10.2 DB running with an ASM. You wish to restore this database to different machine from its original location (though this is not really the important issue). Armed with a week's worth of archivelog files, you wish to restore from your old backup and then recover the database through all the log files you have available.


Unfortunately, during the week you had to add some datafiles to the database. You then issue the correct restore commands (to get the DB image unpacked from its backupsets), and then you issue your recover until command. Unfortunately you are greeted with the good old RMAN-06094 message.

RMAN-03002: failure of recover command at 09/05/2009 08:28:22

RMAN-06094: datafile 177 must be restored

It complains about a datafile that needs to be restored. But, you know that this file is one of a few which were created during the week. They are not in the backupset, but, Oracle is insisting that this file should be restored before it can continue with the DB recovery.

This blog is an account of my experience of such an issue and shows the complete case history from restoring the controlfile through to recovering the DB up to the point I needed.

I hope it helps you if you are faced with a similar issue.



RMAN> restore controlfile from '/mnt/adaptBU/adaptBU/29AUG09/c-3688516520-20090904-01';
Starting restore at 05-SEP-2009 05:04:54
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=321 devtype=DISK

channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:08
output filename=+DG_DATA/ad11prd/controlfile/current1
output filename=+DG_DATA/ad11prd/controlfile/current2
Finished restore at 05-SEP-2009 05:05:06

-- No RMAN output shown for this set of commands.
RMAN> alter database mount;
RMAN>crosscheck backupset;
RMAN>delete expired backupset;
RMAN>list backup summary;

RMAN>catalog start with '/mnt/adaptBU/adaptBU/29AUG09/';

RMAN> list backup summary device type disk;



List of Backups
===============
Key TY LV S Device Type Completion Time #Pieces #Copies Compressed Tag
------- -- -- - ----------- -------------------- ------- ------- ---------- ---
9867 B F A DISK 29-AUG-2009 17:43:38 1 1 YES HRMAXPD225_AD11PRD1_HOT_290809
9868 B F A DISK 29-AUG-2009 17:44:01 1 1 YES HRMAXPD225_AD11PRD1_HOT_290809
9869 B F A DISK 29-AUG-2009 17:46:45 1 1 YES HRMAXPD225_AD11PRD1_HOT_290809
9870 B F A DISK 29-AUG-2009 17:46:51 1 1 YES HRMAXPD225_AD11PRD1_HOT_290809
9872 B A A DISK 29-AUG-2009 17:47:13 1 1 YES TAG20090829T174710
9873 B A A DISK 29-AUG-2009 17:47:15 1 1 YES TAG20090829T174710
9874 B A A DISK 29-AUG-2009 17:47:17 1 1 YES TAG20090829T174710
9875 B A A DISK 29-AUG-2009 17:47:18 1 1 YES TAG20090829T174710
9876 B F A DISK 29-AUG-2009 17:47:19 1 1 NO TAG20090829T174719
10117 B A A DISK 04-SEP-2009 18:05:55 1 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10118 B A A DISK 04-SEP-2009 18:07:33 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10119 B A A DISK 04-SEP-2009 18:09:25 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10120 B A A DISK 04-SEP-2009 18:09:56 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10121 B A A DISK 04-SEP-2009 18:13:08 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10122 B A A DISK 04-SEP-2009 18:15:17 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10123 B A A DISK 04-SEP-2009 18:16:54 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10124 B A A DISK 04-SEP-2009 18:19:59 1 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10125 B A A DISK 04-SEP-2009 18:21:06 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10126 B A A DISK 04-SEP-2009 18:22:19 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10127 B A A DISK 04-SEP-2009 18:23:48 1 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10128 B A A DISK 04-SEP-2009 18:28:08 1 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10129 B A A DISK 04-SEP-2009 18:28:17 2 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10130 B A A DISK 04-SEP-2009 18:28:35 1 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10131 B A A DISK 04-SEP-2009 18:29:17 1 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10132 B A A DISK 04-SEP-2009 18:30:02 1 1 YES HRMAXPD225_AD11PRD1_HOT_040909
10134 B F A DISK 04-SEP-2009 18:30:14 1 1 YES TAG20090904T183013
10135 B F A DISK 04-SEP-2009 18:30:15 1 1 NO TAG20090904T183015


-- Off we go, we've cataloged our backup set (which we brought over from the production system),
-- now lets get the Datafiles out of the backupsets...

RMAN> restore database until time "to_date('29-AUG-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";


-- We are about to offload a whole load of archive log files into our Flash Recovery Area.
-- BUMP UP THE db_recovery_file_dest_size to hold all uncompressed archivelogs...
RMAN> alter system set db_recovery_file_dest_size=100G;

RMAN> recover database until time "to_date('01-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

--Just a check of the ASM +DG_FLASH/AD11PRD/ARCHIVELOG area when the restoration moved from the unpack stage to the apply stage.

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 81349 41650688 41943040 thread_3_seq_1661.3975.696757821

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 190888 97734656 99614720 thread_3_seq_1662.3937.696757699

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 190888 97734656 99614720 thread_3_seq_1663.3938.696757701

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 190888 97734656 99614720 thread_3_seq_1664.3939.696757703

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 180813 92576256 94371840 thread_3_seq_1665.3957.696757769

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 180662 92498944 94371840 thread_3_seq_1666.3958.696757769

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 50744 25980928 26214400 thread_3_seq_1667.3976.696757829

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 2 1024 1048576 thread_3_seq_1668.3980.696757833

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 8933 4573696 5242880 thread_3_seq_1669.3979.696757833

ARCHIVELOG UNPROT COARSE SEP 05 07:00:00 Y 512 190917 97749504 99614720 thread_3_seq_1670.3916.696757627

.....

ASMCMD> Sat Sep 5 08:19:02 2009

-- Now at 08:27am




RMAN> recover database until time "to_date('02-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

-- Immediately, this came back with the dreaded error as seen earlier...

Starting recover at 05-SEP-2009 08:28:09
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 09/05/2009 08:28:22
RMAN-06094: datafile 177 must be restored

--At this time, from SQL*Plus...
-- v$datafile where file# > 176
177 1357906524 +DG_DATA/ad11prd/datafile/webint_data.481.696515055
178 1390246309 +DG_DATA/ad11prd/datafile/adapt_data.482.696614533
179 1390251108 +DG_DATA/ad11prd/datafile/adapt_data.483.696614539
180 1390253412 +DG_DATA/ad11prd/datafile/adapt_data.484.696614545
181 1390257349 +DG_DATA/ad11prd/datafile/adapt_data.485.696614551
182 1390261826 +DG_DATA/ad11prd/datafile/adapt_data.486.696614557


--And also for webint's datafiles.... (file 177 is named as a webint file (tablespace name)).
-- v$datafile name like '%webint%'
+DG_DATA/ad11prd/datafile/webint_data.537.696750009
+DG_DATA/ad11prd/datafile/webint_data.825.696750057
+DG_DATA/ad11prd/datafile/webint_data.471.696751555
+DG_DATA/ad11prd/datafile/webint_data.831.696753039
+DG_DATA/ad11prd/datafile/webint_data.454.696753017
+DG_DATA/ad11prd/datafile/webint_data.481.696515055

-- Now from checking the alert logs, I determined that the file 177 was added on 02-Sep-2009 at 12:24 (ish)...
-- So I wanted to see if I could recover to a time just before this event. - I could...
RMAN> recover database until time "to_date('02-SEP-2009 12:00:00','DD-MON-YYYY HH24:MI:SS')";

Done at: 09:08

-- Move up to the minute before the file addition...
RMAN> recover database until time "to_date('02-SEP-2009 12:24:00','DD-MON-YYYY HH24:MI:SS')";
Complete: at 09:19

-- Now try to hop over the minute and boom...
RMAN> recover database until time "to_date('02-SEP-2009 12:25:00','DD-MON-YYYY HH24:MI:SS')";

Same error about file# 177 as before.

-- Now issue that command that you know couldn't actually succeed as you know that the file is not available in any backupsets...
RMAN> restore datafile 177 until time "to_date('02-SEP-2009 12:25:00','DD-MON-YYYY HH24:MI:SS')";

--RMAN responds with...
Starting restore at 05-SEP-2009 09:28:53
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4

creating datafile fno=177 name=+DG_DATA/ad11prd/datafile/webint_data.481.696515055
restore not done; all files readonly, offline, or already restored
Finished restore at 05-SEP-2009 09:29:03

-- This line says that Oracle MADE the file at this time. Before, it had not been available, but, now it's been created - this is a good sign, even though the message

-- carries a bit of a negative message (the bit about the restore not being done) - never mind.

RMAN> recover database until time "to_date('02-SEP-2009 12:25:00','DD-MON-YYYY HH24:MI:SS')";

-- Now when the same command (as was executed earlier) is re-submitted, Oracle prgresses to recover into the newly created file #177.

Starting recover at 05-SEP-2009 09:29:20
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4

starting media recovery

archive log thread 1 sequence 12817 is already on disk as file +DG_FLASH/ad11prd/archivelog/2009_09_05/thread_1_seq_12817.2964.696762513

archive log thread 2 sequence 9024 is already on disk as file +DG_FLASH/ad11prd/archivelog/2009_09_05/thread_2_seq_9024.3346.696762813

archive log thread 3 sequence 1772 is already on disk as file +DG_FLASH/ad11prd/archivelog/2009_09_05/thread_3_seq_1772.3343.696762749

archive log filename=+DG_FLASH/ad11prd/archivelog/2009_09_05/thread_1_seq_12817.2964.696762513 thread=1 sequence=12817
archive log filename=+DG_FLASH/ad11prd/archivelog/2009_09_05/thread_2_seq_9024.3346.696762813 thread=2 sequence=9024
archive log filename=+DG_FLASH/ad11prd/archivelog/2009_09_05/thread_3_seq_1772.3343.696762749 thread=3 sequence=1772
media recovery complete, elapsed time: 00:00:02
Finished recover at 05-SEP-2009 09:29:43

-- Wow, now for the first time it's actually hopped over the missing file...
-- Time to push on with the recovery...
RMAN> recover database until time "to_date('02-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

-- Now it progresses...
-- Let's move it to the next day...
09:59




RMAN> recover database until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

Starting recover at 05-SEP-2009 10:02:15
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 09/05/2009 10:02:29
RMAN-06094: datafile 178 must be restored
-- Again we receive the old 06094 message for the next datafile that was added.... (But at least we know what to do this time)...

RMAN> restore datafile 178 until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

Starting restore at 05-SEP-2009 10:03:01
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4

creating datafile fno=178 name=+DG_DATA/ad11prd/datafile/adapt_data.482.696614533
restore not done; all files readonly, offline, or already restored
Finished restore at 05-SEP-2009 10:03:11

-- Good, a negative looking message but at least it says that it made the dreaded missing file....

RMAN> recover database until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

-- Re-run our next target attempt...

Starting recover at 05-SEP-2009 10:03:19
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 09/05/2009 10:03:33
RMAN-06094: datafile 179 must be restored

-- Now we get an error for the next file... (so deal with this too)..
RMAN> restore datafile 179 until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

Starting restore at 05-SEP-2009 10:04:48
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4

creating datafile fno=179 name=+DG_DATA/ad11prd/datafile/adapt_data.483.696614539
restore not done; all files readonly, offline, or already restored
Finished restore at 05-SEP-2009 10:04:58

-- And so on through the remaining files that need to be created...
RMAN> restore datafile 180 until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";
RMAN> restore datafile 181 until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";
RMAN> restore datafile 182 until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";



-- Now, finally, we are ready to have an uninterrupted run to our next recovery target.
RMAN> recover database until time "to_date('03-SEP-2009 18:00:00','DD-MON-YYYY HH24:MI:SS')";

--10:54 complete.

-- Finally, issue the recovery command that will get us all the way to the endpoint.
-- Obviously, this could have been issued much earlier if we had been better prepared for our troubles.

RMAN> recover database until time "to_date('04-SEP-2009 17:58:00','DD-MON-YYYY HH24:MI:SS')";

-- Eventually, the recovery managed to get to the endpoint withoug further issues.



Tuesday, August 4, 2009

Monotonicity in Contiguous Perfstat Snapshots

Oracle statspack is a well respected performance monitoring, or should I say, performance capturing and reporting tool. It is usually a backdrop reporting system that is set up with little effort to start capturing system performance. Many different things are captured, but, the data I'm particularly interested in for this blog, is the sql summary data.

The default user, into which the statspack objects are stored is 'perfstat'. There are a number of different tables in this schema, all of which relate to the saving of data thay you can normally see just floating around in the v$ views. One of the most interesting sets of data captured is that which is stored in the stats$sql_summary table. This table is the storage location for data normally visiable in the v$sqlarea table and it's contents are used to for the basis of the SQL performance section of the statspack reports.

However, there are a few anomolies that you must understand in order to make a proper assessment of your system's performance using the statspack tools. Basically, they appear to suffer from what I would describe as non-monotonic data across contiguous snapshot ids. Now, what does all that mean? I heard you mumble. Right, so consider a single SQL statement, that has a fairly high execution count. - i.e. it is called a lot of times. Let's suppose that it is called so many times that it is always in the SQL pool.

For example.
SELECT 'x' FROM DUAL;

Now lets say that after taking a good look through your stats$sql_summary table, that you find the hash value of this little statement. If you then select out from the stats$sql_summary table all occurences of this little statement, then you may be in for a surprise. Ordered over time, you would expect such a statement to show ever increasing values: for example, two important columns in the stats$sql_summary table are: executions and cpu_usage. But, the interesting thing is that these metrics are not always always incrementing (monotonic).

Now, if you have been around for a while, you will probably start to say something like, "ah, that'll be because the statement ages out of the pool. When it is then re-discovered, it is given a clean set of statistics". If you said, this then you are correct. That does, indeed, happen with statements. However, I have been working on a case in which a statement like the one in the example shows a daily incrementing 'staircase' pattern, and then glitches downwards to around 1/2 its maximum value, then it starts to increment once again. It then increments for a few more days before then suffering another huge drop in its value. (The glitch affects both executions and cpu_usage parameters. I haven't studied other metrics, but I suspect that it is also true of these too.)

So what is the big deal? Does it matter that this happens? Well yes it does. When you run a statspack report you choose two snap_ids that will be used to compute a 'what has happened between these two snapshots' report. I've discovered that these snapshots do NOT take into account all the intermediate data captured. So, for example, suppose that the above statement was called 100,000 times an hour, over a day, the number of executions that this statement reports being called, rises from say 0 to 2.4 million. After a day and one hour, say, it it's execution value drops from 2.4 million down to 1.2 million, and then one hour later it drops again to 600,000 and then again to 300,000. Now, you run a statspack report to cover the whole period, and instead of being told that the statement has executed 2.7 million times, you are told, quite incorrectly, that it was only executed 300,000 times. This figure being the difference between its starting value and its ending value, now 300,000. But, statspack has missed the fact that in the meantime, this statement achieved 2.4 millions peak and then got glitched downwards.

If the statement ages out over time, then you have an even bigger issue to contend with. For example, suppose that a really big query that only runs once per day at lunchtime and consumes one CPU hour. And that:

1. this statement is not in the midnight snapshot (having been aged out through the day) and
2. this statement remains in the pool only to make it into the 13:00 snapshot. (by 14:00 it is aged out and disappears).

Now, if you run a statspack report for the whole day: midnight to midnight, you will not be shown the SQL statement that was running from 12:00 to 13:00. Furthermore, suppose that the statement starts at 12:01 and completes at 13:01. Then it would only appear in a single snapshot - assuming that the statement ages out by 14:00. Under these conditions you will not be shown this statement if you run a report between 12:00 and 13:00 as the statement did not have a value at 12:00 to form the starting point.