Not surprisingly, here at Rewind we have a lot of data to protect (worth over 2 petabytes). One of the databases we use is called Elasticsearch (ES or Opensearch as it’s currently called in AWS). Put simply, ES is a document database that enables lightning-fast search results. Speed is critical when customers are looking for a specific file or item they need to recover with Rewind. Every second of downtime counts, so our search results need to be fast, accurate and reliable.
Another consideration was disaster recovery. As part of our System and Organization Controls Level 2 (SOC2) certification process, we needed to ensure that we had a working disaster recovery plan to restore service in the unlikely event that the entire AWS region went down.
“An entire AWS Region?? That will never happen!” (Unless it did)
Anything is possible, things can go wrong and to meet our SOC2 requirements we needed a working solution. What we specifically needed was a way to safely, efficiently, and cost-effectively replicate our customers’ data to an alternative AWS region. The answer was to do what Rewind is so good at – make a backup!
Let’s take a look at how Elasticsearch works, how we’ve used it to securely back up data, and what our current disaster recovery process looks like.
First we need a quick vocabulary lesson. Backups can be called in ES snapshots. Snapshots are stored in a snapshot repository. There are several types of snapshot repositories, including one powered by AWS S3. Since S3 is able to replicate its content to a bucket in another region, it was a perfect solution to this particular problem.
AWS ES comes with an automated snapshot repository pre-enabled for you. By default, the repository is configured to take hourly snapshots and you cannot change it. This was a problem for us because we wanted a Daily Snapshot sent to a repository backed by one of our own S3 buckets that has been configured to have its contents replicated to another region.
|List of automated snapshots GET _cat/snapshots/cs-automated-enc?v&s=id|
Our only choice was to create and manage our own snapshot repository and snapshots.
Maintaining our own snapshot repository wasn’t ideal and sounded like a lot of unnecessary work. We didn’t want to reinvent the wheel, so we looked for an existing tool that would do the heavy lifting for us.
Snapshot Lifecycle Management (SLM)
The first tool we tried was Elastic’s Snapshot Lifecycle Management (SLM), a feature described as follows:
The easiest way to back up a cluster regularly. An SLM policy automatically creates snapshots on a preset schedule. The policy can also delete snapshots based on the retention rules you define.
You can even use your own snapshot repository. However, as soon as we tried to set this up in our domains, it failed. We quickly learned that AWS ES is a modified version of Elastic. co’s ES and that SLM was not supported in AWS ES.
The next tool we examined is called Elasticsearch Curator. It was open source and maintained by Elastic.co itself.
Curator is simply a Python tool to help you manage your indexes and snapshots. It even has helper methods for creating custom snapshot repositories, which was an added bonus.
We chose to run Curator as a Lambda function driven by a scheduled EventBridge rule, all packaged in AWS SAM.
This is what the final solution looks like:
ES snapshot lambda function
Lambda uses the Curator tool and is responsible for snapshot and repository management. Here is a diagram of the logic:
As you can see above, it’s a very simple fix. But to make it work, we needed a few things:
- IAM roles to grant permissions
- An S3 bucket replicated to another region
- An Elasticsearch domain with indexes
The S3SnapshotsIAMRole allocations curator the permissions required to create the snapshot repository and manage the actual snapshots themselves:
The EsSnapshotIAMRole grants lambda the permissions the curator needs to interact with the Elasticsearch domain:
Replicated S3 buckets
The team had previously set up replicated S3 buckets for other services to facilitate cross-region replication in Terraform. (More information here)
With everything in place, the Cloudformation stack used in early production testing ran fine and we were done…or were we?
Backup and Restore Athon I
Part of the SOC2 certification requires you to validate your production database backups for all critical services. Because we like to have some fun, we decided to host a quarterly “Backup and Restore-a-thon”. We would assume that the original region no longer exists and we had to restore each database from our cross-region replica and validate the content.
You might think “Oh my god, that’s a lot of unnecessary work!” and you would be half right. It’s a lot of work, but it’s absolutely necessary! At every restore-a-thon, we’ve uncovered at least one issue with services that don’t have backups enabled, don’t know how to restore, or access the restored backup. Not to mention the hands-on training and experience that team members gain from actually doing something that doesn’t come under the high pressure of a real failure. Like a fire drill, our quarterly recovery athons help prepare and keep our team ready for any emergency.
The first ES recovery athon happened months after the feature was completed and deployed to production, so many snapshots were taken and many old ones deleted. We configured the tool to keep 5 days worth of snapshots and delete everything else.
All attempts to restore a replicated snapshot from our repository failed with an unknown error and there wasn’t much else to do.
Snapshots in ES are incremental, meaning the higher the frequency of snapshots, the faster they complete and the smaller they are. The initial snapshot for our largest domain took over 1.5 hours and all subsequent daily snapshots took minutes!
This observation prompted us to try to protect the initial snapshot and prevent it from being deleted by using a name suffix (-initial) for the very first snapshot created after repository creation. This initial snapshot name is then excluded from the snapshot deletion process by Curator using a regex filter.
We emptied and restarted the S3 buckets, snapshots and repositories. After waiting a few weeks for snapshots to accumulate, the restore failed again with the same cryptic error. However, this time we noticed that the initial snapshot (which we protected) was also missing!
Since we didn’t have any cycles left for the problem, we had to park it to work on other cool and awesome stuff that we’re working on here at Rewind.
Backup and Restore Athen II
Before you know it, the next quarter is upon us and it’s time for another backup and recovery athon and we recognize that this is still a gap in our disaster recovery plan. We need to be able to successfully restore the ES data to another region.
We decided to add additional logging to Lambda and review the execution logs on a daily basis. Days 1 through 6 work fine – the restore works, we can list all the snapshots, and the original one is still there. On the 7th day, something strange happened – the call to list the available snapshots returned a “not found” error for only the first snapshot. What external force is deleting our snaps?
We decided to take a closer look at the contents of the S3 bucket and see that they are all UUIDs (universally unique identifiers) with a few objects correlating snapshots, except for the initial snapshot, which was missing.
We noticed the “Show Versions” toggle in the console and found it odd that the bucket had versioning enabled. We enabled the version switcher and immediately saw “clear marks” everywhere, including one on the original snapshot that corrupted the entire snapshot set.
We found very quickly that the S3 bucket we were using had a 7-day lifecycle rule that deleted all objects older than 7 days.
The lifecycle rule exists so that unmanaged objects in the buckets are automatically deleted to keep costs low and the bucket tidy.
We restored the deleted item and voila, the snapshot listing worked fine. Most importantly, the recovery was a success.
The home track
In our case, Curator needs to manage the snapshot lifecycle, so all we had to do was prevent the lifecycle rule from removing anything in our snapshot repositories by using a scoped path filter on the rule.
We created a specific S3 prefix called “/auto-purge” that the rule was scoped to. Anything older than 7 days in /auto-purge would be purged and everything else in the bucket would be left alone.
We cleaned everything up again, waited >7 days, ran the restore again with the replicated snapshots, and finally it worked perfectly – backup and restore athon finally complete!
Developing a disaster recovery plan is a tough mental exercise. Implementing and testing every part of it is even more difficult, but it’s an essential business practice that will ensure your business can weather any storm. Sure, a house fire is an unlikely occurrence, but when it does happen, you’ll probably be glad you’ve practiced what to do before smoke starts to billow.
Ensuring business continuity in the event of a vendor failure for the critical parts of your infrastructure presents new challenges, but also offers amazing opportunities to explore solutions like the one presented here. Hopefully our little adventure will help you avoid the pitfalls we encountered while building your own Elasticsearch disaster recovery plan.
note – This article was written and contributed by Mandeep Khinda, DevOps Specialist at Rewind.