If you don’t know, here is a short summary:
During PgSQL replication troubleshooting, the admin deleted part of the database by mistake. Every real admin made such mistake at least once in the life.
“Let him who didn’t screw up some important system be the first to throw the stone.”
I don’t want to pick on the GitLab itself, they are doing a good job for a lot of developers among the world. I will use GitLab as an example of lacking paranoia when you are protecting important data.
And because life is not easy, “Sh*t hit the fan” in this case.
The company loses data, it happens. That’s the reason we have a data protection. If the data are important, there is multi-level protection. We are saved. Or not?
You have snapshots configured but only once per day. It’s enough for common data but not for business critical ones. With copy-on-write technology more frequent snapshots don’t occupy more space.
The paranoid person - like me - would make snapshots every 2-3 hours during past 24 hours and - probably - every 15 minutes during past 2 hours.
As I mentioned, more frequent copy-on-write snapshots don’t occupy more space on the storage. At least the increase storage performance requirements but in most of the implementations - where the storage (or volume manager, filesystem) just keeps the table of pointers to previous versions of data blocks - the performance gain is negligible.
More frequent snapshots keep less amount of “protected data” in one particular snapshot, therefore, its “expiration” would be even faster.
And when you are using snapshots as part fo your data protection strategy, be sure you have configured them everywhere they are required.
You have a traditional backup once per day. It’s enough in most of the cases if the critical data are protected by additional snapshots mentioned above.
(I know, I know. To protect databases is more complicated.)
The trouble is when you have no idea where the backups are stored and how to restore them. And when you will find the backups finally, you will find out that the backup files have just a few bytes in size.
There is no time to search for backup locations in scripts and configuration files during the emergency situation. But you don’t need to remember it. if you are using some backup software instead of the own backup script.
It’s up to backup software to know where the backup images are stored. Even if you have multiple copies of backups in different locations, backup software will choose the most suitable copy for restore.
But there is still a situation that the backup images are corrupted and unusable for restore. Even you configure the software the wrong way or there is a bug in the backup software itself.
There is no protection for any of these cases. (Never ever believe vendor of your backup software that there are no bugs inside.)
It’s up to you - especially for business critical data - to prepare procedures of test restores in regular intervals. Not only you will find out whether your backups are technically correct but you will also check your backup/restore procedures.
And your team will be less stressed if they can try the restore as “playing on the playground” and not “save the company at 4 a.m.”.
The replication is often used as kind of data high-availability. When you lose disk array in the primary location, you still have data available in some other location. And with the usage of right clustering solution (= configured and functional) the outage of your application can be relatively painless.
But don’t forget that replication IS NOT the backup method.
The replication protects data only in some particular cases. The replication is mandatory if you need the protection against data center outage. It’s the ideal solution for such case as you have data “somewhere else”. But it will not protect you against data corruption or accidental deletion.
An application bug? A virus? A ransomware? An admin mistake? The corrupted data are replicated to backup location as well.
You can decrease replication frequency to use the remote copy as backup.
Pretty weird kind of backup.
The replication is a very inappropriate backup type. The purpose of the replication is to offer least RPO in the case of disaster.
More frequent replication, the better RPO.
The synchronous replication is the ideal state.
But in the case of data corruption/deletion, data are overwritten sooner than you can use them for recovery.
I see customers replicating once daily, every 6 hours and s.o. considering such replicas as backups. What if the data corruption occurs just minutes or seconds before scheduled replication job? You are in a trouble.
So you must combine the replication with snapshots on the remote site or with traditional backup.
The data are “somewhere else” and you are able to return to some previous version.
If you consider it in your infrastructure design.
In the case of data protection, the paranoia is more important than the cost of the protection. No, this formulation is not the right one.
How expensive the data protection should be?
The cost of the protection should reflect the value of the data.
If you own a luxury car, you don’t rely on factory default immobilizer only. You are paying for the satellite tracking and 24x7 monitoring. Besides, the lose is a lose only for you.
If you lose data, it’s not a disaster just for you. You put in a danger the business of your company, the business of your customers.
One of the most difficult questions I’m asking my customers is “What’s the value of your data?”
Surprisingly, almost nobody is able to answer. Despite the fact, it’s about company survival. The direct cost as penalties, reputation lose, possible court costs,…
The cost of the technical solution for data protection will be always lower.
Remember! You are not paranoid enough.