Today I will describe you the real situation from one huge IT environment.
Imagine that you spend million dollars from your budget. You build high-capacity NAS in your primary data center. You build the second one in your backup data center.
All NFS and SMB shares are replicated from primary site to backup one.
Besides the replication you have snapshots configured as well.
But do you have DR?
The architecture of this particular NAS is pretty nifty. Data are spread across several nodes. By default data survive not only failure of two disks in one node but even the complete failure of the whole node. (If your level of paranoia is higher, you can increase the level of protection as well.)
Well, data are protected on several levels:
- data spread over cluster nodes to survive HW failure
- snapshots as protection against data corruption by user or application
- and - of course - data replication for the case when primary array (or even the whole primary site) is destroyed.
In this last case, the backup site will start to provide NFS and SMB shares in read-write mode. The customers’ data outage will take maximally few minutes.
After disaster resolution, the freshest version of data will be replicated back to the primary site and NFS and SMB services will be switched over to the primary site again.
The ideal design. The invested money is optimally utilized.
No NFS/SMB service failover is configured!
Then why the data are replicated?
I just don’t get it. I try to find out why the high-availability of shares is not configured.
The answer really surprised me:
He: It wasn’t intended to be high-available.
Me: O.K. But we can configure it very easy.
He: It wasn’t intended to be high-available.
Me: We can use Microsoft DFS for high availability of SMB shares.
He: We are not going to use DFS anymore. It’s used for merge of a big SMB share from several smaller ones.
Me: Yes, but also for fail-over of SMB paths. And we can use DNS aliases as well.
He: It wasn’t intended to be…
I hear the well known “All your base are belong to us!” over there.
I have got an explanation why the data are replicated. The replication serves as an ordinary backup and the recovery procedure is as follows:
- Recovery the primary site (if required)
- Repair or replace primary NAS
- Replicate all data from secondary NAS back to the primary one
- Configure shares
As you have to transmit few petabytes of data it could take few days or even weeks.
Me: But the customers will be without data!
He: They know about it and agreed.
Brilliant. Great design.
I call it a waste of money and hazard with customers’ trust.
And just small configuration can make the mediocre solution the ideal one.
Let me show you how the data structure is configured. Let’s call the primary NAS nas1 and secondary one nas2.
The snapshots are configured on the levels “production”, “development” and “test”.
Also, “production”, “development” and “test” have own replication streams and schedules.
The same directory structure is on the secondary side (nas2) but just data are read-only and the shares are not configured.
What are the possibilities? Let’s ask the vendor.
Me: How to achieve high availability on these NASes?
Vendor: It’s simple. At first, you need DR management software. At second, you need a dedicated management server where the software will run.
Of course, that server has to be high-available as well. And don’t forget for implementation work, training and s .o.
It wasn’t the answer I expected.
Finally, it’s a just a few commands that will play with your application and switching between read-only and read-write mode.
There has to be a more simple way.
Just to automate few steps NAS admin knows already.
Let’s go into some details.
Even I mentioned above the usage of Microsoft Distributed Filesystem (DFS), finally, I wouldn’t recommend it.
- it doesn’t know that data on nas2 are read-only
- non-trivial implementation for NFS shares
- recommended limitation for maximum 500 links in ActiveDirectory server
Seems that DFS purpose is for simple tasks.
The better solution is to use DNS.
You have an A record in DNS for nas1 and nas2. And then create DNS alias (CNAME) nas pointing to nas1.
DNS TTL for alias should be relatively short (50-10 minutes) if not even zero. The disaster recovery procedure then can be as follows:
- disconnect replication pairs
nas1:/prod --> nas2:/prod nas1:/dev --> nas2:/dev nas1:/test --> nas2:/test
- switch paths from read-only to read-write mode on nas2
- change DNS alias nas from nas1 to nas2
After expiring DNS cache (that’s the reason for short DNS TTL of this alias), all requests for nas will point to nas2 and the data are available again.
You will probably lose opened SMB sessions and NFS mounts but still better than few days without data at all.
The nas1 recovery or even recovery of the whole primary site can take days or even weeks, the customers have data available.
Yes, long nas1 recovery increases a probability of nas2 outage. If you are extremely paranoid or the losing of SLA cost you too much, just build nas3 in some other location as well.
Even long replication of data back from nas2 to nas1 (you will probably have to transfer all data) don’t prevent to fulfill your data availability SLA.
A lot of customers require - in the case of HA systems - to run regular “Disaster Recovery” (DR) tests. Sometimes once yearly, sometimes on monthly basis.
The configuration described above doesn’t allow you to run DR test just for a particular customer.
At first, the DNS alias is for the whole NAS. At second, we are replicating the whole folders “prod”, “dev”, “test”.
The change allowing you to have separate DR on the customer level is not complicated but requires some test.
The more simple part is the DNS configuration. Every customer, every his business unit, even every share can have dedicated DNS alias pointing either to nas1 or nas2.
But the same level of granularity you need on the level of replication jobs.
I recommend keeping the level of replication granularity on a reasonable level.
Few hundreds of replication jobs should not be an issue for modern NAS but few thousands could be a problem already.
(Just ask your NAS vendor for recommended limits.)
The replication reconfiguration is the more tricky part.
Let’s consider directory structure as on the following picture.
You cannot just simply configure new replication job on customer level if the old replication job is running. Two replication jobs are writing data to folder “cust1” and it’s unacceptable.
One way is to completely break replication on
prod --> prod level and start hundreds of new replications on the customers’ level.
It’s the most simple way but it takes pretty long. You have to transfer all data from the scratch.
(To transfer 1PB over 20Gbps line takes close to 150 hours.)
If you are managing DR, SLA is pretty hard and you cannot just decrease data protection level for few days.
So I recommend you a way which has zero impact to SLA but requires some free space on the nas2.
The replication reconfiguration will be done as described in the following picture.
Both replications (old and new ones) can run in parallel without any data conflicts.
You can add new replication jobs in steps/waves (it could be an advantage for your change management as well)
nas1:/prod/cust1 --> nas2:/dr/prod/cust1 nas1:/prod/cust2 --> nas2:/dr/prod/cust2 nas1:/prod/cust3 --> nas2:/dr/prod/cust3 ...
and when all new replications are in sync, you will just break the old replication
nas1:/prod --> nas2:/prod
Don’t forget to change filesystem paths for all shares on nas2.
Now you can completely delete the filesystem path
/prod on nas2.
Because customers are accessing data on nas1, the shares paths reconfiguration has zero impact to your SLA.
It will surely pay.
Not? Because you need double capacity available on nas2?
Well, not quite.
nas2:/dr/prod/cust1 are identical files.
And what deduplication storage will do with identical data?
Yes, you are right. Storage will keep just one copy.
(Ehm, I expect you have deduplication-capable storage!)
Your NAS is DR ready now.
It’s completely up to you whether in case of demand to failover the service your admin will click everything in storage management GUI (do you have 12 hours RTO?) or you will script all steps (and RTO will be 10-15 minutes).
Or… Or contact me!