Netbackup Best Practices For Ridiculously Busy Environments (But Not E…
While waiting for another EMC World session to start (this one is at “Guru” level, let’s see) I thought I might proportion some of my experience regarding running Netbackup on very large setups – nothing like learning by pain.
Don’t get me wrong – NBU has its marketshare for a reason. However, I want to make sure I dispel everyone’s deluded romantic notions about NBU being the be-all, end-all backup tool. It can work well, but only if you truly know its idiosyncrasies.
I can’t say I was tending the busiest NBU systems but, at one point, just one of my environments was doing about 15,000 backups jobs a day. Which is way too much – we fixed that pronto…
I won’t go too thorough into each point. If anyone cares then post a comment and I will expand on it.
If you have a small shop running NBU on a single server, much of this is not for you – but there may nevertheless be a nugget or two in there… However, if you don’t at the minimum use barcodes, I will go after you. Use tar or Windows backup, or already a rusty abacus, go to your corner and be quiet.
Have a dedicated master server – if there are many jobs, the last thing you want is your master also being busy doing backups and vaults. It’s the half-witted brains of the operation, don’t stress it.
Go way beyond the tuning recommendations in the manual – if you know what you’re doing. for example, I have some voodoo tunings for Solaris (up to 9) that make a huge difference. Prepare for comments from Veritas (Symantec, at all event) sustain… “no sir it’s not like in the book sir, we can’t guarantee it will work sir…” at all event, I’ve gotten such ridiculously bad advice from their sustain I nevertheless cringe (and sometimes pee a little) every time I get a flashback, not to mention the endless dreams and the screaming that wake me up at night.
Separate HBA ports for disk and tape. No exceptions. I don’t care what vendors say.
Separate TAN (Tape Area Network), if you can swing it.
Separate backup LAN. And/or Ethernet port bonding/trunking/teaming (at all event nomenclature appears in your systems). 4 gig ports per media server. 10G if you have the dough. 4 10G ports teamed and I will do the Wayne’s World “we’re not worthy” bit in front of you. Offer ends Dec 2007.
Experiment with TOE cards, such as the Alacritech ones. You will get closer to complete gig, though they’re expensive. Bonding is way cheaper and effective if you have many clients.
Try to use port bonding that works at the switch level, too – 802.3ad is the standard, Cisco’s Etherchannel is Cisco’s. The software on the server and the setting on the switch have to jive. Half-assed intermediate approaches are just that.
Don’t use ineffective switches at the chief. I’m tired of seeing people with Cisco 4506 switches (6509 wannabe) and 8:1 oversubscribed 48-port cards. YOU WILL HAVE PROBLEMS! Do your homework, find out whether or not the switch is oversubscribed, find out the total backplane throughput, figure out the blade throughput, don’t plug everything in the same port octet if you’re going to be oversubscribed – i.e. a 4-port team going to the octet that shares 1Gbit in a 4506 will not give you 4Gbits, it will give you, at best, a thoroughly confined 150Mbits per port, tops, with problems. Did you know that if one of the 8 ports starts out before the rest and continues pumping, the rest will NOT make the first port reduce its speed but will instead trickle along at 10Mbits sometimes? already after the initial move that was fast is finished and there’s nothing else going on? As Rutger Hauer said in Blade Runner, “I have… seen things you people wouldn’t believe”. Figure THAT one out when you’re having throughput problems.
Use jumbo frames if you can. Bigger is better in this case. Do your homework, there are caveats.
Use the right block size for your tape devices. Windows users, beware. Patches are necessary. SP1 broke block sizes over 64K on 2003 Server.
Don’t go nuts with SSO! Among the myriad things Veritas doesn’t tell you unless you know the right people is that at around 250 instances of devices you will have weird device problems (25 tape drives shared among 10 media servers would make 250 instances). The safe number is closer to 150. Ignore this at your peril. If you use VTL just make more virtual drives.
Use snapshots as much as possible.
If you have more than a associate of media servers, consider a VTL.
If you have DBAs that insist on flushing the redo logs to tape every few seconds, get a heavy-gauge jumpstart cable and a strength supply that can put out, say, 20KV, a coat hanger, and wearing nothing but a stained leather apron go to work on them until they regain their senses (or not). Good times.
If the DBAs can’t be persuaded already after their various body parts have been charred by high voltage, try to send the smaller backups to disk. Do NOT send frequent backups to tape. If a job is going to take less than 10min send it to disk.
As a corollary to #15, only use tape for large jobs that will truly stream your tape drives.
Know what your boxes can push. Most servers, already very large ones, will be hard-pressed to push 2 LTO3 drives, let alone LTO4. FYI, I’ve gotten LTO3 to go as fast as 130MB/s, consistent. Do the math. Beat the score! I cheated, BTW.
Know what expansion slots to use – not all are equal, already if they look the same.
Don’t push too much backup traffic over switch ISLs. Preferably don’t push any.
Be super-careful with command-line manipulation of the NBU DB. Perfectly authentic commands will not function as you might think due to silly heuristics (or without thereof). Stay tuned, there will be a large post outing NBU in the future. The amount of dirt I have is beyond staggering. Maybe I shouldn’t have said that, I might have to look out for contract killers or Veritas people offering payola, not sure which is preferable. I’m 5 feet tall, with a goatee, skinny and blond, by the way. You can’t miss me. I also have a distinct limp.
Beware of multiplexing. Too much and restores take forever. Too little and you can’t stream your devices. Disk is your friend. Anything beyond 4-way multiplexing on tape is not.
Do not send tapes offsite only once a week. You are asking for pervy uncle Murphy to pay you a visit, and he is a known repeat sex offender. He won’t discriminate, either.
If you use tapes, have 2 copies of everything.
Replicate to far away sites if at all possible. Tape should be a last resort.
Use VMWare if at all possible. Along with #12 and #24, this helps quick recovery.
Do at the minimum 2-3 different backups of the NBU catalog. In really busy systems it’s impossible to do it after each session – there’s just no quiet time. Just have a copy on disk and 2 on tape (you can do the ones on tape inline, will create 2 at the same time, it works), then send the ones on tape to 2 different offsite locations. Have NBU email you the tape(s) barcodes it used for the catalog if you’re doing a non-standard catalog backup. Send an additional email to an externally obtainable address. You’re not paranoid if they’re really out to get you!
Can you already read from disk as fast as you can write to your backup medium? Benchmark.
What’s your current network throughput if you max out all the media servers? Benchmark.
Don’t use your production systems as media servers. You are inviting uncle Murphy again and he’s feeling randy.
Use storage unit groups. Why on earth would you not?
Cluster the master.
Do NOT put media traffic by firewalls, it’s too much. ACLs on switches can work just fine.
Do NOT put a dedicated media server for a subset of your boxes that are secured from the main network. If they lose access to that media server, backups fail. At any rate you’ll have to allow a few ports for the master to communicate with the media server, might in addition let media server traffic by. If it seems that #32 and #33 are slightly self-contradictory, give yourself a cigar.
Simplify your life. Elaborate and numerous policies are more ways to invite uncle Murphy.
That’s all I have for now. Is there more? Tons, but I need to go to the bathroom.