Backup Admin: March 2014

Friday, March 28, 2014

[90:801] Active Removable Storage Manager (RSM) service found on local system.

[Normal] From: BMA@mediasrv01.in.com "MSL2024_D02" Time: 3/23/2014 2:57:16 PM

STARTING Media Agent "MSL2024_D02"

[Warning] From: BMA@mediasrv01.in.com "MSL2024_D02" Time: 3/23/2014 2:57:16 PM

[90:801] Active Removable Storage Manager (RSM) service found on local system.

[Warning] From: UMA@mediasrv01.in.com "MSL2024" Time: 3/23/2014 2:57:18 PM

[90:59] Changer0:0:0:0

Cannot open exchanger control device ([2] The system cannot find the file specified. )

[Warning] From: UMA@mediasrv01.in.com "MSL2024" Time: 3/23/2014 2:57:18 PM

The device "MSL2024" could not be opened ("Device could not be accessed")

[Normal] From: UMA@mediasrv01.in.com "MSL2024" Time: 3/23/2014 2:57:18 PM

Starting the device path discovery process.

[Warning] From: UMA@mediasrv01.in.com "MSL2024" Time: 3/23/2014 2:57:19 PM

[90:59] Changer0:0:0:0

Cannot open exchanger control device ([2] The system cannot find the file specified. )

[Normal] From: BMA@ mediasrv01.in.com "MSL2024" Time: 3/23/2014 2:57:21 PM ABORTED Media Agent " MSL2024_D02"

Solution:

Removable Storage Manager (RSM) service on MS Windows 2003 might cause device issues in backup environment.

It is recommended to stop and disable the RSM service on all MS Windows 2003 systems.

RSM Service in Started status

Thursday, March 27, 2014

[61:2015] Timeout waiting for the devices to get free.

[Critical] From: BSM@cellsrv01.in.com "cellsrv01_IDB" Time: 02/27/14 02:00:31

[61:2015] Timeout waiting for the devices to get free.

The session will terminate.

Ø It is very common in every backup environment to share the same device for different backups. This is an error message due to device contention issue.

Ø The backup device selected in the backup specification is unavailable for the backup to start. It is in use by another process or by another backup/restore/copy sessions.

Ø Check the device status using the lock name. (Check Here for the commands used to find a locked device). Wait for the device to be free.

Ø The backup will be queued for global timeout seconds and will fail if no device is freed / allocated to the backup session.

Ø The queuing time can be found at the end of backup session from backup statistics. Shown below

Backup Statistics:

Session Queuing Time (hours) 0.00

-------------------------------------------

Completed Disk Agents ........ 5

Failed Disk Agents ........... 0

Aborted Disk Agents .......... 0

-------------------------------------------

Disk Agents Total ........... 5

=====================================

Completed Media Agents ....... 1

Failed Media Agents .......... 0

Aborted Media Agents ......... 0

-------------------------------------------

Media Agents Total .......... 1

===========================================

Mbytes Total ................. 17985 MB

Used Media Total ............. 1

Disk Agent Errors Total ...... 0

Tuesday, March 25, 2014

[90:63] Cannot load exchanger medium (Medium error.)

[Normal] From: BMA@winsrv01.in.com "MSL_D01" Time: 3/24/2014 10:53:43 PM

By: UMA@winsrv01.in.com@Changer0:0:0:0

Loading medium from slot 9 to device Tape0:0:0:0

[Major] From: BMA@winsrv01.in.com "MSL_D01" Time: 3/24/2014 11:03:16 PM

[90:63] By: UMA@winsrv01.in.com@Changer0:0:0:0

Cannot load exchanger medium (Medium error.)

[Normal] From: BMA@winsrv01.in.com "MSL_D01" Time: 3/24/2014 11:03:16 PM

ABORTED Media Agent "MSL_D01"

[Normal] From: BMA@winsrv01.in.com " MSL_D02" Time: 3/24/2014 11:03:21 PM

STARTING Media Agent " MSL_D02"

[Normal] From: BMA@winsrv01.in.com " MSL_D02" Time: 3/24/2014 11:03:26 PM

By: UMA@winsrv01.in.com@Changer0:0:0:0

Loading medium from slot 9 to device Tape1:0:0:0

[Major] From: BMA@winsrv01.in.com " MSL_D02" Time: 3/24/2014 11:13:10 PM

[90:63] By: UMA@winsrv01.in.com@Changer0:0:0:0

Cannot load exchanger medium (Medium error.)

[Normal] From: BMA@winsrv01.in.com " MSL_D02" Time: 3/24/2014 11:13:10 PM

ABORTED Media Agent " MSL_D02"

Solution:

Exchanger medium is failing while handling media from particular slot (here it is slot 9). The media has errors that is residing slot 9.

Move to 'Devices & Media' context -> Click Library -> Click Slot -> move to Slot 9.

--- Check the status of media in slot 9.

--- It was poor media. Isolate the media by moving it to separate media pool and unload it from library. Wait for the media to expire and reformat it. If the status is still shows poor even after formatting, it is not advisable to use it henceforth.

--- Clean the affected drives with good cleaning tape and run a test backup. They should be good for backups.

Tuesday, March 18, 2014

IDB on Exclusive mode

Cannot open Internal Database in exclusive mode

Problem: Cannot open Internal Database in exclusive mode
Cannot backup internal database because another database check in progress

Solution 1: Bring down the omni services by typing the below commands

/etc/init.d/omni stop in unix
<omni dirc >/bin> omnisv -stop
Check if there is any hung process
do " ps -ef | grep omni "
if there is any hung process kill the hung sessions

use Kill -9 <process ID >

if there is no hung session , Bring up the omni services up

/etc/init.d/omni start
check the services are up & running or not
if all the services are up & running

go to /opt/omni/sbin

omnidbutil -clear ( this command will kill the ghost sessions )
you will get the message " Done ! "

to check the IDB database check is still running or not , go to " /opt/omni/sbin "
use " omnidbcheck " command

Now you will not get the message " Database check is in Process "

Now re initiate the IDB backup .... This will run the IDB backup successfully

Solution 2: If still the issue is not resolve by solution 1

log on to the cell manager and go the below path
/var/opt/omni/tmp

you can see a file name " tmp_dbcheck.lk" , Remove that file by using the command

rm tmp_dbcheck.lk

Restart the DP Services , your issue will resolve after performing any of the solutions

BR0073E Setting of BRBACKUP lock failed

Problem:

BR0051I BRBACKUP 7.20 (25)
BR0055I Start of database backup: benini.qub 2014-03-18 08.00.26
BR0484I BRBACKUP log file: /oracle/SID/sapbackup/benini.qub
BR0071E BRBACKUP currently running or was killed
BR0072I Please delete file /oracle/SID/sapbackup/.lock.brb if BRBACKUP was killed
BR0073E Setting of BRBACKUP lock failed

BR0056I End of database backup: benini.qub 2014-03-18 08.00.26
BR0280I BRBACKUP time stamp: 2014-03-18 08.00.26
BR0054I BRBACKUP terminated with errors
[Major] From: OB2BAR_OMNISAP@orsapsrv1.in.com "OMNISAP" Time: 03/18/2014 08:00:26 AM
BRBACKUP /usr/sap/SID/SYS/exe/run/brbackup -t online_split -d util_file -c -p initSID.sap.bc -m all -q split -u / returned 3

[Normal] From: BSM@cellsrv01.incom "orsapsrv1_SID01" Time: 3/18/2014 8:00:27 AM
OB2BAR application on "orsapsrv1.in.com" disconnected.

Solution:

>> Login to the client with SID of the database and check for any brbackup process running.

*****No brbackup process running*****

# ps -ef | grep brbackup

root 17138 9230 0 08:19 pts/0 00:00:00 grep brbackup

>> Check for any progressing DP backups from monitor context. This can also be initiated from DB end too. Wait until the backup completes and then check for the brbackup process again.

>> Kill the .lock.brb file <path - /oracle/SID/sapbackup/.lock.brb > if there was aborted/hung backup session. Start the backup spec which should complete.

Sunday, March 16, 2014

ORA-19588: archived log RECID <> STAMP <> is no longer valid

Error:

RMAN-00571: ===========================================================

RMAN-00569: ======== ERROR MESSAGE STACK FOLLOWS ============

RMAN-00571: ===========================================================

RMAN-03009: failure of backup command on dev_0 channel at 03/08/2014 07:50:43

ORA-19588: archived log RECID 99528 STAMP 841638632 is no longer valid

Recovery Manager complete.

[Major] From: ob2rman@orasrv01.in.com "oradb01" Time: 03/08/14 07:50:56

External utility reported error.

RMAN PID=12391

[Major] From: ob2rman@orasrv01.in.com "oradb01" Time: 03/08/14 07:50:56

The database reported error while performing requested operation.

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-03009: failure of backup command on dev_0 channel at 03/08/2014 07:50:43

ORA-19588: archived log RECID 99528 STAMP 841638632 is no longer valid

Recovery Manager complete.

Problem:

>> The error can be due to invalid/stale archive files which were left uncleared even after the backup completion. Reason could be due to 2 simultaneous backup sessions where one refers to and the other had already backed up and not needed to be backed up.

>> It also happens when RMAN locates an archive file from the archive location that is no longer available in the Controlfile. These store very less info when compared to Recovery log.

Solution:

Execute the command “crosscheck archivelog all”.

The output can be as below:

Crosschecked 826 objects

Now you can re-trigger the backup spec and get a successful backup.

Batch script for " IDB Maintenance & Resolving the Velocis Error " for Windows Servers…

Solution : copy the below script and paste in notepad and save it as " IDB_velosis.bat" file and just click on the file ...

Your Velocis error and IDB maintenance will be completed in Just-a-click!

Note:- Modify the script according to where [Which Drive/Path] we installed the DataProtector

Copy the text which in blue color

echo # IDB Maintenance & Resolving the Velosis Error #

echo # Resolving the Velosis Error #
cd \

D:

cd Program files\omniback\bin

omnisv -status

omnisv -stop

taskkill /IM vbda.exe /F /T
taskkill /IM bsm.exe /F /T
taskkill /IM dbsm.exe /F /T
taskkill /IM vrda.exe /F /T
taskkill /IM uma.exe /F /T
taskkill /IM crs.exe /F /T
taskkill /IM rds.exe /F /T
taskkill /IM mmd.exe /F /T

cd \

cd Program Files\OmniBack\tmp

del CRS.pid
del dbcheck.cdb
del dbcheck.mmdb
del lic.ctx
del mmd.ctx

cd \

cd Program Files\OmniBack\db40\logfiles\syslog

del *.chg
del *.chk

cd \

cd Program Files\OmniBack\db40\datafiles\catalog

rename rdm.bil rdm.bil.old
rename rdm.chi rdm.chi.old

cd \

cd Program Files\OmniBack\bin

echo # Now Bring up the databae

omnisv -start

omnidbutil -clear

omnidbutil -free_locked_devs

echo # IDB Maintanence

omnidbutil -purge -messages 30 -force
omnidbutil -purge -sessions 30 -force
omnidbutil -purge -dcbf -force
omnidbutil -purge -filenames -force

exit

End of the script

Try the same in Testing environment prior to Production, All the best :)

Saturday, March 15, 2014

[70:24] A system error occurred when starting the target script or an agent module.

Problem:

Backup of client machine (Win 2003) fails with the following errors.

[Critical] From: INET_thread_vbda.exe@winclient01.in.com \"winclient01.in.com\" Time: 03/04/14 08:46:39

[70:24] A system error occurred when starting the target script or an agent module.

The system error code reported is 2 and the message resolves to [2] The system cannot find the file specified.

Solution:

>> Uninstall & re-install DP client.
>> Import the DP client in the cell server using GUI [or] add the entry in the cell_info file.
>> Start the backup of DP client which should be successful.

Friday, March 14, 2014

Zero KB Archive backup

Archive backup of all the databases from a specific linux server was completing successfully with 0 bytes of data backed up.

[Normal] From: BSM@cellsrv01.in.com "dbclient01_DB01_AR" Time: 03/03/14 07:15:23

Backup session 2014/03/03-86 started.

[Normal] From: BSM@cellsrv01.in.com "dbclient01_DB01_AR" Time: 03/03/14 07:23:36

OB2BAR application on "dbclient01.in.com" successfully started.

[Normal] From: BSM@cellsrv01.in.com "dbclient01_DB01_AR" Time: 03/03/14 07:23:36

OB2BAR application on "dbclient01.in.com" disconnected.

[Normal] From: BSM@cellsrv01.in.com "dbclient01_DB01_AR" Time: 03/03/14 07:23:37

Backup Statistics:

Session Queuing Time (hours) 0.00

-------------------------------------------

Completed Disk Agents ........ 0

Failed Disk Agents ........... 0

Aborted Disk Agents .......... 0

-------------------------------------------

Disk Agents Total ........... 0

================================

Completed Media Agents ....... 0

Failed Media Agents .......... 0

Aborted Media Agents ......... 0

-------------------------------------------

Media Agents Total .......... 0

================================

Mbytes Total ................. 0 MB

Used Media Total ............. 0

Disk Agent Errors Total ...... 0

What was causing this??

Is there any transactions happening in these DBs or not??

Any patching on server or DB that prevents the data transfer??

Why DP doesn't err anything??

No idea until we checked the seaudit log. The inet service was restarted from an local admin account, from then onwards the DP processes were lacking the root access rights which prevented from reading the data.

05 Mar 2014 04:47:42 D SURROGATE useradm1 Read 69 2 USER.oradba /opt/omni/lbin/inet 198.10.10.2 root

Solution:

Once again the inet daemon was restarted from root.

Commands to restart the DP inet service from Linux client:

#/etc/init.d/omni stop

#/etc/init.d/omni start

Thursday, March 13, 2014

Tip of the Day!

If you attempt to get the serial number of a device and found "N/A", which means the device is missing/malfunctioning. This also happens when the device is in locked status by BSM/RSM/CSM/UMA/LTT.

C:\>devbra -dev

Tape HP:Ultrium 1-SCSI Path: "Tape0:0:0:0" SN: "N/A"
Description: CLAIMED:HP LTO drive
Revision: E38W Device type: lto [13] Flags: 0x0011

To confirm the reason, run the below command to identify:

>>omnimm -show_locked_devs | grep <lock name>

If there's a lock, wait for the device to become free. Or you can forcibly free the device by executing the below:

>>omnidbutil -free_locked_devs <lock_name> //this would abort if any session utilizing the device

Once the device is free, run the 'devbra -dev' command and you will get the serial number. If not reboot the media server or the library depends on the error messages. If it still continues to be, replace the drive.

Friday, March 7, 2014

RMAN-03002: failure of shutdown command at 03/02/2014 09:31:22

DP sends the commands from the script (provided in the backup spec) sequentially via. DP Integration to RMAN to start Offline backup of Oracle database as follows.

RMAN> CONNECT TARGET *

2> CONNECT CATALOG *

3> HOST 'exit';

4> run {

5> shutdown immediate;

6> startup mount;

7> allocate channel 'dev_0' type 'sbt_tape'

8> parms 'SBT_LIBRARY=/opt/omni/lib/libob2oracle8_64bit.so,ENV=(OB2BARTYPE=Oracle8,OB2APPNAME=oradb101,OB2BARLIST=Oracle_DB_Spec01)';

10> send device type 'sbt_tape' 'OB2BARHOSTNAME=Orasrv01';

11> backup incremental level 0

12> format 'Oracle_DB_Spec01<oradb101_%s:%t:%p>.dbf'

13> database;

14> backup

15> format 'Oracle_DB_Spec01<oradb101_%s:%t:%p>.dbf'

16> current controlfile;

17> alter database open;

18> }

19> EXIT

RMAN-06005: connected to target database: oradb101 (DBID=101)

RMAN-06008: connected to recovery catalog database

RMAN-00571: ==================================================

RMAN-00569: ========= ERROR MESSAGE STACK FOLLOWS =========

RMAN-00571: ==================================================

RMAN-03002: failure of shutdown command at 03/02/2014 09:31:22

ORA-01013: user requested cancel of current operation

Recovery Manager complete.

[Major] From: ob2rman@Orasrv01 "oradb101" Time: 03/02/14 09:31:22

External utility reported error.

Execution of the script starts and there was connection established between DP and the Oracle database via. DP Integration. Eventually, once the backup goes smooth the script would bring the database online and complete, if not continue reading the post till end.

The backup has error messages:

RMAN-03002: failure of shutdown command at 03/02/2014 09:31:22

ORA-01013: user requested cancel of current operation

Here, the database is refusing the shutdown command issued by the backup script. Offline backup requires database to be in down and mounted status. Check if the database is engaged by long running tasks if any.

That's right, when we checked the database was busy with long running oracle jobs which was the root cause to override the shutdown command. Try running the backup when the DB is free from jobs/tasks and during non-business/less or no traffic to DB, after all it's an offline backup.

To check/free locked devices in HP DP

HP DP Command to check/free a device

>> omnimm -show_locked_devs | grep <lock_name>
this command would list all the devices that are being used by the DP cell server. Medium, cartridge, devices

Ex:

Type: Medium
Name/Id: c75f:0034:0000:XXXX //Media ID
Pid: 23362 //Process Id that utilizes the media
Host: cellsrv01.in.com //name of the cell manager (useful in MOM)
Label: BM1000L3 //Medium label

Type: Device
Name/Id: ESLE1_D02 //Drive name
Pid: 25754 //Process Id that utilizes the drive
Host: cellsrv01.in.com //name of the cell manager, (useful in MOM)

Type: Cartridge
Name/Id: ESLE1 //Name of the library
Pid: 25836 //Process Id that utilises the drive
Host: cellsrv01.in.com //name of the cell manager, (useful in MOM)
Location: 102

>> omnidbutil -show_locked_devs
The command would list all the devices, media, cartridge and slots that are in use by Data Protector cell manager.

>> omnidbutil -free_locked_devs <device_name>

Ex:

>> omnidbutil -free_locked_devs ESLE1_D02
Confirm the command by hitting 'Y': Y

After execution of this command, the drive will get released and can be used for any other purpose.

P.S: Omnidbutil / Omnimm commands can be used interchangeably to identify locked devices, to free locked devs, use omnidbutil.

Sunday, March 2, 2014

[61:4006] Couldnot connect to inet in order to start BMA@ "Device_name".

Backup Error

[Major] From: BSM@cellsrv01.in.com "winsrvspec01" Time: 2/28/2014 6:53:57 PM
[61:4006] Couldnot connect to inet in order to start BMA@Mediasrv02.in.com "MSL2024_D1".

[Critical] From: BMA@cellsrv01.in.com "MSL2024_D1" Time: 2/28/2014 6:54:55 PM
[90:1004] Device address not found.

Solution

>> The DP Inet service running in the media server is not reachable from the cell server and the Media Agent couldn't be started. When checked the media server is 'Removed' from production and was shutdown.

>> In this case, the device mapping of the drive should be removed/disabled 'MSL2024_D1' from DP and the backup spec need to be modified accordingly to eliminate failures in future.

>> You can verify the media server reachability from cell manager by pinging, telnet the DP port 5555, can also use omnitcpchk commands.

For ex:

[root@ cellsrv01:/root]

# ping Mediasrv02.in.com

PING Mediasrv02.in.com: 64 byte packets

----Mediasrv02.in.com PING Statistics----

11 packets transmitted, 0 packets received, 100% packet loss

[root@ cellsrv01:/root]

# /opt/omni/sbin/utilns/omnitcpchk -host Mediasrv02.in.com

Testing connection Mediasrv02.in.com <---> cellsrv01.in.com....

ERROR!

==============================

TcpCheck failed for host pair:

Mediasrv02.in.com <---> cellsrv01.in.com

[root@ cellsrv01:/root]

# telnet Mediasrv02.in.com 5555

Trying...

telnet: Unable to connect to remote host: Connection timed out

>> The backup would complete using the next device in line, if there are more than one device selected in the backup specification.

>> Optionally, the backup would also run with the next zoned drive that has correct SCSI path and available for this backup session.