I don’t believe disk arrays. Even you have great RAID protection, replication, snapshots there is still something called firmware. And it can have a bug. Usually it has (read release notes of next version) but fortunately not mandatory. And there can be a bug screwing parity and this can be really bad.
You need to protect your disk based backups. I will describe you 3 different options by using of NetBackup:
- Multiple copies of backup job
- Storage lifecycle policies
- Own perl script using bpduplicate
As it happened on one my backup environment. The whole raidgroup gone.
O.K. Fine. We have a backup. Let’s get data back.
Ehm. These data was the BACKUP!
Few tenths of TBs of backups. Really bad situation.
Vendor recreated raidgroup, put a firmware patch and we prayed to not get any restore request from these data.
Fortunately there was just backups with retention about 1-2 weeks.
In few weeks…
…BUM. Again.
The same raidgroup with “parity error proof” patch.
The failure probability is something every data owner has to take into consideration and the “insurance” is called “backup”. But how to protect disk based backup itself? Just to duplicate data somewhere else.
To tapes. Or to other disk array as it’s easier. I know, I know. If the array has a bug the other one (of the same vendor) has it too. Probability (p) of failure will never be zero but p^2 (probability of failure two arrays at the same time) is much better.
Usually you have not only one backup server in some locality. (Do you? Really? What’s your SLA for backup/restore?) Then it’s easy to duplicate backups among two or more backup servers.
Let’s discuss some options oferred by Symantec NetBackup. Assume two or three media servers mediaA, mediaB, mediaC. They are using dedup pools. Either separate PureDisk servers, external dedup appliances or (like my preferred) local MSDP pools.
Here are the options…
1. Multiple copies of backup job
If you have just two media servers in a location the situation can be as follows:
Then in every schedule of every policy you have to explicitly configure multiple copies. As only STUs located on the same media server are allowed for multiple copies you have to use configuration as on the picture above. In the case the option “If this copy fails” is configured to “continue” you can have successful backup but you cannot be sure whether you have one or two copies. In case of selected “fail all copies” you will be sure that every valid backup has two copies but you miss the resiliency against media server or storage pool outage.
To load balance jobs across media servers you have to do it yourself by selecting media server for every combination of policy and schedule.
What in case of 3 media servers?
Pretty interesting, isn’t it? But resource balancing is again up to you.
Summary:
- easy to configure
- you don’t know whether you have 1 or 2 copies (“continue” for “If this copy fails”)
- no resiliency against unavailable storage unit (“fail all copies”)
- “manual” resource balancing
2. Usage of Storage lifecycle policies
O.K. What about to use of SLPs? They are designed for duplication. Let’s use it. Fine. Assume next easy configuration.
Let’s configure SLP as follows:
1. Backup to STUG
2. Duplicate to STUG
What will happen? Some images will have both copies on the same storage pool. Huh? Yes. You never know which STU will be selected for duplication. It can be the same as was selected during initial backup. This is not the protection I wanted.
You can play with some more complicated config.
Then you can have SLPs as follows:
SLP1
1. Backup to STUG-1
2. Duplicate to STU-C
SLP2
1. Backup to STUG-2
2. Duplicate to STU-A
SLP3
1. Backup to STUG-3
2. Duplicate to STU-B
Summary:
- still easy to configure
- resiliency advantage of STUGs
- duplication still require “manual” resource balancing
+/- duplication can fail if destination STU is down - acceptable risk, duplication will run after STU failure resolution
3. My own script
As none of these approaches satisfy me I have decided to use my own way. What I expect from system to create second “paranoid” copy? Here is some short list:
- to create second copy within reasonable time (1 day) by using as much duplication attempts as necessary (if some duplication will fail)
- to prevent duplication to the same media server
- to not run this my “paranoid duplication” for SLP managed backups (why to waste my storage space if customer demands explicit offsiting?)
- exclude particular images (client producing backup images - about 10000 daily - faster than bpduplicate is just copy them)
The result is my own perl script starting by policy schedule with frequency every 1 hour. You can play with backup window to not run duplication during nightly backups. You need a database policy if you want script defined in backup selection to execute and not backup. I’m using Informix policy as we don’t have any Informix. The functionality of the script is as follows:
- get list of all unexpired images that have only one copy
- skip images created by SLP
- skip images matching some regular expression (“bad” clients with huge amount of images)
- note which media server owns the image
- create list of images per media server
- cut each such list to smaller ones to limit size of each duplication job (to have maximum X images or not more than Y GBs - X, Y configurable)
- write all these prepared lists to “Bidfiles”
- select destination STU according the images current ownership (media server)
- run
bpduplicate
commands with “Bidfile” as parameter - all in parallel - wait until completion of all children
The last step assures that no next processing will start until all duplicationd will finish.
NBU scheduler will not start new job of the same policy+schedule.
Therefore no image from already running bpduplicate
will not be processed again (as it may be not duplicated yet).
In case of interest about this solution, contact me.