aaplying logs from thread 2 much slower than thread 1 - Data Guard

Hi,
I have a RAC Primary with node 1 and node 2 with a non-RAC dataguard setup.
I do see the archive logs getting shipped fine from both the thread 1 and 2 from primary.
But at the DG database i could see the thread 1 logs getting applied much more than thread 2.
Thread 2 applying logs have been really slow and lags behind compare to thread 1.
What can i do to get the thread 2 also apply logs more often like the thread 1.
I have standby redo logs equally for both thread 1 and thread.
Can someone help me understand and fix this issue.
Thanks in advance. 

Start by gathering information (and sharing it with us).
1. What version and operating system for all servers?
2. What is in the alert log?
3. What form of Data Guard?
4. What mode of Data Guard?
5. Are the log files being shipped in the correct order?
6. Are they arriving in the correct order?
7. What is different between the two server?
8. Are ...
Wait a second ... are you shipping logs from both nodes? 

There is no independent applying of logs from Thread 1 or 2. They are thread merged and applied in SCN order. So you will see the apply going to thread 2 less often if that node is less busy than node 1.
Larry 

Thanks for your responses.
This is a physical standby database.
We have logs getting shipped from both node 1 and node 2.
Alert doesn't show any error regarding why it is slow...i do the logs getting shipped and at the DG end...media recovery happening on the shipped logs except that the thread 2 is getting applied less often.
The logs are getting shipped in correct order w.r.t each thread.
there is no different between these 2 servers. they are exact replica used for the RAC(node 1 and node 2)
But this morning i saw another error which is showing the GAP.
FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 22690-22739
I have resolved the gap issue, by restoring the logs that got deleted after backup.
When i investigated, i found the primary node 1 backed up the logs and after couple of more archive log backup jobs, went ahead and deleted the archive logs even when they were needed.
But the below is the error i saw just before the needed logs thread 1 sequence 22690-22739 got deleted
RMAN-08118: WARNING: could not delete the following archived redo log
archive log filename=+DATA/afcp/archivelog/2010_04_12/thread_1_seq_22689.1771.716095593 thread=1 sequence=22689
error from target database:
ORA-15028: ASM file '+DATA/afcp/archivelog/2010_04_12/thread_1_seq_22689.1771.716095593' not dropped; currently being accessed
I have usually only seen the warning saying could not delete the log, but this is the first time i noticed the ORA-15028 for the sequence 22689.
Exactly the logs after this sequence started getting deleted even though they were needed.
Can someone help me understand this and how to resolve this problem, so the GAPS can be avoided.
Thanks. 

Sorry missed this. And I see that you posted it as a question by itself which I did answer. So for continuity's sake here is the answer.
Unless you set an RMAN deletion policy for standbys they will be deleted based solely on your normal retention policy, i.e. BACKED UP n TIMES. And since you are in 10g you have to be using the Fast Recovery Area for them to work. See the paper at http://www.oracle.com/technology/deploy/availability/pdf/RMAN_DataGuard_10g_wp.pdf
Larry

Related

How resolve the GAP in DATAGUARD

Hi, We have DG(Physical standby, 1.5 TB size and NON-RAC) environment with 11.2.0.2 and in HP-UX.  Yesterday i have noticed that few archive logs got missed and because of this redo apply is on hold. I tried to verify the archive logs in the PRIMARY DATABASE  and i could not found the same even in the backup location also these archive logs not available. Now what are the best options i can use to resolve this issue, i can think of re-creating standby but i feel this is NOT the best option. i heard some thing restoring from RMAN backup etc...can any one shed light on this?  Thanks in-advance.  RegardsVijay.
Roll forward your standby database using an RMAN incremental. The steps are in MOS Note 836986.1 HTH,Brian
Yes BPeaslandDBA already mentioned you no need to go for rebuilt of whole standby database, where you can perform incremental roll forward in order to resolve the GAP's and you can refer the URL also  http://www.oracle-ckpt.com/rman-incremental-backups-to-roll-forward-a-physical-standby-database-2/ HTH
Hi, Thanks a lot for your reply. I followed incremental backup method and problem got resolved.
Awesome! Good to know your standby is back to where it should be.  Cheers,Brian
Hi,A quick question, do you have FAL_CLIENT and FAL_SERVER already setup correctly? If yes, then dataguard should automatically be able to resolve the Gaps.Second thing, you should have a monitoring job to send you the dataguard status alerts about gaps and/or not-applied archived logs rather than noticing it some day and then trying to resolve it. Salman
> If yes, then dataguard should automatically be able to resolve the Gaps. The key word here is "should". The standby will only try to retrieve the missing log just a few times. Then it will give up and wait for manual intervention. > you should have a monitoring job to send you the dataguard status alerts about gaps and/or not-applied archived logs rather than noticing it some day and then trying to resolve it. Agreed. I've found that even with Em12c, the canned alerts don't work correctly to let you know when the standby has diverged from the primary too far. The canned alerts rely on MRP running. If there is a log gap that can't automatically be resolved, MRP will go down and so will the ability to get an alert. So on my standby, I have this script scheduled via cron:  #!/bin/bash## report_apply_lag.sh# by Brian Peasland# 17 October 2012# export ORACLE_HOME=/u01/app/oracle/product/11.2.0.3export LD_LIBRARY_PATH=$ORACLE_HOME/libexport PATH=$ORACLE_HOME/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/binexport SCRIPT_HOME=/home/oracle/scriptsexport DBA_EMAIL=mydba#acme.com rm $SCRIPT_HOME/report_lag.out for SID in standbydb1 standbydb2doexport ORACLE_SID=$SID$ORACLE_HOME/bin/sqlplus "/ as sysdba" #$SCRIPT_HOME/report_apply_lag.sqlcat $SCRIPT_HOME/lag.out >> $SCRIPT_HOME/report_lag.outdone cat $SCRIPT_HOME/report_lag.out | /bin/mailx -s "Apply Lag Report" $DBA_EMAILrm $SCRIPT_HOME/lag.out And the contents of report_apply_lag.sql is: column apply_lag format a20spool /home/oracle/scripts/lag.outselect i.instance_name,d.value as apply_lag,  to_char(sysdate,'YYYY-MM-DD HH24:MI:SS') as curr_timefrom v$instance i,v$dataguard_stats dwhere d.name='apply lag';spool offexit This just emails me a report to my Inbox daily. I keep 7 days worth of archived redo logs on disk. Because I have a four hour apply delay and because I ensure that the online redo logs are switched at least once per hour, my apply lag will be anywhere between 4 and 5 hours. If my report has a time outside of this range, then I know I have an issue to manually correct. I should catch this in my 7 day window so I can manually copy the archived redo log on the primary to the standby and register it there.   HTH,Brian
Hi, 
> If yes, then dataguard should automatically be able to resolve the Gaps.
 
The key word here is "should". The standby will only try to retrieve the missing log just a few times. Then it will give up and wait for manual intervention.
Well, there are a lot of possibilities (including bugs and firewall issues) which may hinder resolving a gap, but, in normal circumstances, it should resolve the gap. you said "The standby will only try to retrieve the missing log just a few times", can you please let us know how many time it would try? Any note/document which mentions this?Please read MOS document 1537316.1. Salman
I don't have any source to cite, just my personal experience. I've seen it, on more than one occasion, where FAL quits requesting the archive log to fill in the gap after a period of time.  Cheers,Brian

[DG Physical] ORA-00368: checksum error in redo log block

Hi all,
I'm building a DR solution with 1 primary & 2 DR site (Physical).
All DBs use Oracle 10.2.0.3.0 on Solaris 64bit.
The first one ran fine for some days (6), then I installed the 2nd. After restoring the DB (DUPLICATE TARGET DATABASE FOR STANDBY) & ready to apply redo. The DB fetched missing arc gaps & I got the following error:
==================
Media Recovery Log /global/u04/recovery/billhcm/archive/2_32544_653998293.dbf
Errors with log /global/u04/recovery/billhcm/archive/2_32544_653998293.dbf
MRP0: Detected read corruption! Retry recovery once log is re-fetched...
Wed Jan 27 21:46:25 2010
Errors in file /u01/oracle/admin/billhcm/bdump/billhcm1_mrp0_12606.trc:
ORA-00368: checksum error in redo log block
ORA-00353: log corruption near block 1175553 change 8236247256146 time 01/27/2010 18:33:51
ORA-00334: archived log: '/global/u04/recovery/billhcm/archive/1_47258_653998293.dbf'
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 8236247255373
===================
I see that may be RFS get the file incorrectly so I ftp to get this file & continue the apply, it pass. Comparison the RFS file & ftp is difference. At that time, I think that something wrong with the RFS because the content of arc is not right. (I used BACKUP VALIDATE ARCHIVELOG SEQUENCE BETWEEN N1 AND N2 THREAD X to check all arcs the RFS fetched, there was corrupt in every file);
I restore the DR DB again & apply incremental backup from the primary, now it run well. I don't know what's happening as I did the same procedures for all DR DB.
Yesterday night, I have to stop & restart DR site 1. Today, I check and it got the same error as the 2nd site, with corrupted redo. I try to delete the arcs, & let RFS to reget it, but the files were corrupt too.
If this continue to happen with the 2nd site again, that'll be a big problem.
The DR site 1 & Primary is linked by a GB switch, site 2 by a 155Mbps connection (far enough for my db load at about 1.5MB/s avg apply rate).
I seach Oracle support (metalink) but no luck, there is a case but it mentions max_connection>1 (mine is default =1)
Can someone show me how to troubleshooting/debug/trace this problem.
That would be a great help!
Thank you very much. 
This (Replication) is the wrong forum for your posting.
Please post to the "Database - General" forum at
General Database Discussions
But, first, log an SR with Oracle Support.
Hemant K Chitale

archive log location as single point of failure in RAC env

Hi,
Currently in our Oracle 10g RAC databases on Solaris, we use shared cluster filesystem to keep the archive logs for the 2 node RAC database.
If the archive log destination becomes full for some reason, then the whole database hangs until we clean up the logs.
If we keep two separate archive log destinations for two instances and if one of the archive destinations becomes full, will the whole database will be affected or only that particular instance will be affected?
Is any one using this type of configuration and any issues with setting separate shared archive location for each instance.
node1 - /arch_logs1
node2 - /arch_logs2
If we were to maintain each archivelog thread for a RAC database in a different destination and only one destination were to fill up, would it result in a database-wide event in which all instances are frozen until the condition is remediated, or would only the instance whose archivelog destination becomes full be affected?
thanks . 
Hi,
why not trying to correct the root cause (the amassing of archive redologs), then to try to circumvent it?
Normally if you set db_recovery_file_dest_size to 90% of your used space for the archive redo log destination (defined in db_recovery_file_dest) and configure RMAN as such
that either a recovery window or a number of backups for the archived redo logs are needed, the databases will free up the space if needed.
Regards
Sebastian 
BTW, forgot to mention:
Even though the archiver only gets stuck on one instance, the real problem is, that Oracle cannot commit the last changes (since it cannot write to the online redolog, since there is no one free).
And if this COMMIT holds block which are needed on the other instance... then the other instance will starts to "wait"... (though maybe a little later)....
Regards
Sebastian 
BTW, forgot to mention:
Even though the archiver only gets stuck on one instance, the real problem is, that Oracle cannot commit the last changes (since it cannot write to the online redolog, since there is no one free).
And if this COMMIT holds block which are needed on the other instance... then the other instance will starts to "wait"... (though maybe a little later)....
Regards
Sebastian 
BTW, forgot to mention:
Even though the archiver only gets stuck on one instance, the real problem is, that Oracle cannot commit the last changes (since it cannot write to the online redolog, since there is no one free).
And if this COMMIT holds block which are needed on the other instance... then the other instance will starts to "wait"... (though maybe a little later)....
Regards
Sebastian 
thanks for the reply. This is helpful.
We will go with adding more space to archive logs and regularly backing up the archive logs to address this issue.

ORA-16724/ORA-16783: cannot resolve gap

Primary Database: 2 node rac cluster, 11.2.0.2 GI and RDBMS
Standby Database: single instance 11.2.0.2
OS: AIX 7.1
Hello, I'm in the middle of setting up a data guard system, and am seeing a 16783 and a 16724. The system is saying it cannot resolve a gap. I performed this query to find the archive sequences that were needed: select * from v$archive_gap. 2 archive sequences were missing. I restored both of them to the primary system. 1 of the sequences was automatically recovered by the data guard system, the other was not. I manually copied the archive log that was not automatically recovered to the standby system. When I perform the select * from v$archive_gap now, no rows are returned. All archive log files that are currently in the primary system's archive directory have been copied to the standby system.
Please let me know if anyone has an idea where the problem could be.
Thanks 
It would be very helpful if we could see the actual error messages rather than your interpretation of them.
Personally I'd burn it to the ground and do it all again properly. Following the written docs for a Data Guard physical standby you would not have had a gap requiring manual intervention. 
Hello;
Those are tough errors. I have yet to see anybody fix them, repair the standby so its good. In a nutshell the base for the Standby is bad. At the point the Standby would start applying the logs it cannot because it is missing Archive from before that point.
You can spend many hours trying to add those Archive logs so you apply will work. Most likely you will crew up a lot of time for nothing. Kind of like throwing good money after bad.
Your best option is to cut your loses and redo the Standby. Nobody in your shoes wants to hear that, but there it is.
Best Regards
mseberg 
damorgan wrote:
It would be very helpful if we could see the actual error messages rather than your interpretation of them.The actual error messages are in the subject line of my post.
damorgan wrote:
Following the written docs for a Data Guard physical standby you would not have had a gap requiring manual intervention.Link to the Oracle whitepaper I used for this installation: http://www.oracle.com/us/solutions/sap/wp-ora4sap-dataguard11g-303811.pdf
I encountered this error because the whitepaper specifies to run a cold backup and copy it to the standby site. The source system is a multiple TB production system. A long outage for a cold backup/system copy is not an option. I had to use hot incremental backups to get the copy as close to the source as possible, and let DG sync the delta. 
Hi mseberg,
To your point:
mseberg wrote:
Your best option is to cut your loses and redo the Standby. Nobody in your shoes wants to hear that, but there it is.Unfortunately, this was a production system and I had a limited amount of time to initiate the data guard system. The system copy process takes a long time because the source is so large, and I wouldn't have been able to fit it into my window.
As I only had one option for getting this system up in time, I did try to resolve this gap manually. I found something very interesting during this process. I ran this query on the standby site multiple times: 'select * from v$archive_gap;', and it returned no rows. But, the archive logs kept piling up and were not being applied. I eventually found in the standby DB alert log that the database was waiting for a previous archive sequence that did not exist in my archive mount point. V$archive_gap did not show that this sequence was required, but the alert log said otherwise. On the primary, I pulled this sequence up from a tape backup, and it applied to the standby and resolved the gap. I guess the moral of the story for anyone else who encounters the ORA-16724/ORA-16783 errors... v$archive_gap may not show a complete picture of all missing archive logs. Check the alert log and verify that the archive sequence required by the standby really exists in your archive mount point.
Mseberg, thanks also for your reference to using incremental backups to "catch up" the standby db with the primary db in a previous post of mine. I used this concept in preparing my standby copy. I took a hot lev 0 backup that ran through the night, and then took several hot lev 1 backups throughout the day prior to the conversion window. When it came time to start the system copy, I used the 'dorecover' option in my rman duplicate command (duplicate target database for standby nofilenamecheck dorecover;). Your idea prompted me to use the incremental backups, and this ended up saving a lot of time during this process. 
Hello again;
Active duplication might be an option, The main concern is load will be placed on the network and source host during this process.
In theory you could set the "RATE Channel Parameter" in RMAN to control this :
http://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmtunin.htm#BABDCEHG
Best Regards
mseberg 
Nice, I didn't realize you could throttle the channel rate, that could be very useful. We've also seen situations in the past where a full lev 0 backup causes a performance degradation while the system is under heavy load. This parameter could help us to minimize the impact in this situation. Thanks for the tip.

standby DB is 30 minutes behind from Priamary DB

Hello,
Primary Standby DB monitoring check is telling that Standby database is 30 minutes behind. Please help how I can solve the issue urgently.
Bundle of thanks in advance.
DB=11.2.0.3 on RHL
Best Regards 
Where is the issue? Are the logs transported but not applied or not transported?
Does the delay ever change or is it always exactly 30 minutes?
Is there an apply delay configured?
What is the nature of the network?
How many km between sites?
How much redo is being generated?
What is the bandwidth?
You've given us nothing to work with to help you. Sort of like saying "my watch is 30 minutes behind ... tell me why?" 
You have given very less information about your problem.
If archives are not shipped because of network glitch and archive gap is huge then you should go for roll forwarding of standby using RMAN incremental backups.
If delay is configured then their is no issue at all.
Please provide complete information so that anyone can help you. 
Keep monitoring the delay. Is it fluctuating ? Are your redo volumes fluctuating by the hour ? Is the network bandwidth sufficient for the redo volume ?
Hemant K Chitale 
Before giving solution to your problem - You need to be clearly put whats the problem? - What was error are u getting from alert log from standby ? - Also give me inputs what damorgan/hemanth asked 
Hello,
Apologize ...
There was no error in the standby db alert log, but in the grid control of the standby section ORA-16810 was noticed.
Standby dbs '30 minutes behind from Primary' message was both on test and prod dbs.
The problem was solved somehow, but honestly speaking I am still confused or not sure how it got solved. I tried different things in the test environment but used only 1 command in prod and that worked.
In Test environment:
- In grid control (11g), I reset DG and then verify configuration from the data guard option. DG Status was not maked with green color tick mark and Normal.
- Verify Configuration reported ORA-16810 was reported (but it is sill there even the problem is resolved)
- When reset DB didn't solved, I changed the DG option to Maximum Performance from maximum protection to try to recover things manually.
- Current log was 1870 and last applied was 1860 (usually it is 1 behind or same).
- Checked List archivelog all with rman target and list backup of archivelog all with rman catalog, the archivelog files was there if I remember correctly.
Note: The command which I think did the magic was: RESTORE ARCHIVE LOG SEQUENCE 1871 from the Primary with RMAN catalog. It seems that Standby knew that oh someone restored my missing archivelog from the Primary. Was that a great feature of 11gr2 DB?
I would be thankful if you could write how I can find from the standby throw command current log and last log applied, especially may be with rman if there are archive log missing? and if for example couple of them are missing how to resolve them.
Thanks alot.
Best Regards,
Edited by: John-MK on Apr 25, 2013 11:13 AM 
The alert log on the standby will tell you which logs have been applied.
Are you using real time apply or redo apply? With real time apply you don't have to wait for the log to be archived before it is applied.
http://docs.oracle.com/cd/B28359_01/server.111/b28294/log_apply.htm
You also have the v$arcvhived_log view which has an applied column which tells you which log has been applied.
select SEQUENCE#, APPLIED, COMPLETION_TIME from v$archived_log;When a redo log on the primary is archived it should automatically be shipped to your standby. How did the logs go missing? If the logs have been removed from the standby before they are applied then the standby will try and fetch the logs again from primary (FAL_SERVER and FAL_CLIENT parameters). If they have been "removed" from the primary then you will need to restore the archive logs from your backups. You can check to see if there is a gap by running the following from your standby.
select * from v$archive_gap;Are you removing your logs manually or using RMAN to remove them. You should be using RMAN to remove them by using the following RMAN configuration:
configure archivelog deletion policy to applied on standby;Check http://docs.oracle.com/cd/E11882_01/server.112/e25608/rman.htm#BAJBGEIF.
Are you really using maximum protection? Do you have multiple standby's? In maximum protection if a transaction can't be written to the standby then the primary will be shut down.
http://docs.oracle.com/cd/B28359_01/server.111/b28294/protection.htm

Categories

Resources