Elasticsearch Cluster Backup

This folder contains a Terraform module to take and backup snapshots of an Elasticsearch cluster to an S3 bucket. The module is a scheduled lambda function that calls the Elasticsearch API to perform snapshotting and backup related tasks documented here;

Terminologies

Snapshot: A snapshot represents the current state of the indices in an Elasticsearch cluster. This is the information stored in a backup repository.
Repository: A repository is an Elasticsearch abstraction over a storage medium like a Shared File System, S3 Bucket, HDFS etc. It's used to identify where snapshot files are stored and doesn't contain any snapshots itself.

Taking Backups

Cluster snapshots are incremental. The first snapshot is always a full dump of the cluster and subsequent ones are a delta between the current state of the cluster and the previous snapshot. Snapshots are typically contained in .dat files stored in the storage medium (in this case S3) the repository points to.

CPU and Memory Usage

Snapshots are usually run on a single node which automatically co-ordinates with other nodes to ensure completenss of data. Backup of a cluster with a large volume of data will lead to high CPU and memory usage on the node performing the backup. This module makes backup requests to the cluster through the load balancer which routes the request to one of the nodes, during backup, if the selected node is unable to handle incoming requests the load balancer will direct the request to other nodes.

Frequency of Backups

How often you make backups depends entirely on the size of your deployment and the importance of your data. Larger clusters with high volume usage will typically need to be backed up more frequently than low volume clusters because of the amount of data change between snapshots. It's a safe bet to start off running backups on a nightly schedule and then continually tweak the schedule based on the demands of your cluster.

Backup Notification

The time it takes to backup a cluster is dependent on the volume of data. However, since the backup module is implemened as a Lambda function which has a maximum execution time of 5 minutes a separate notification Lambda is kicked off. A Cloudwatch metric is incremented any time the notification lambda confirms that a backup occured and an alarm connected to that metric notifies you where or not it was updated.

Restoring Backups

Restoring snapshots is handled by the elasticsearch-cluster-restore module.

Reference

Inputs
Outputs

Required

alarm_periodnumberrequired

How often, in seconds, the backup lambda function is expected to run. You should factor in the amount of time it takes to backup your cluster.

alarm_sns_topic_arnslist(string)required

The ARN of SNS topics to notify if the CloudWatch alarm goes off because the backup job failed.

bucketstringrequired

The S3 bucket that the specified repository will be associated with and where all snapshots will be stored

cloudwatch_metric_namestringrequired

The name for the CloudWatch Metric the AWS lambda backup function will increment every time the job completes successfully.

cloudwatch_metric_namespacestringrequired

The namespace for the CloudWatch Metric the AWS lambda backup function will increment every time the job completes successfully.

elasticsearch_dnsstringrequired

The DNS to the Load Balancer in front of the Elasticsearch cluster

namestringrequired

The name of the Lambda function. Used to namespace all resources created by this module.

regionstringrequired

The AWS region (e.g us-east-1) where the backup S3 bucket exists.

repositorystringrequired

The name of the repository that will be associated with the created snapshots

schedule_expressionstringrequired

An expression that defines the schedule for this lambda job. For example, cron(0 20 * ? ) or rate(5 minutes).

Optional

elasticsearch_portnumberoptional

The port on which the API requests will be made to the Elasticsearch cluster

Default:9200

lambda_runtimestringoptional

The runtime to use for the Lambda function. Should be a Node.js runtime.

Default:"nodejs14.x"

protocolstringoptional

Specifies the protocol to use when making the request to the Elasticsearch cluster. Possible values are HTTP or HTTPS

Default:"http"

run_in_vpcbooloptional

Set to true to give your Lambda function access to resources within a VPC.

Default:false

subnet_idslist(string)optional

A list of subnet IDs the Lambda function should be able to access within your VPC. Only used if run_in_vpc is true.

Default:[]

vpc_idstringoptional

The ID of the VPC the Lambda function should be able to access. Only used if run_in_vpc is true.

Default:null

lambda_arn

lambda_name

Elasticsearch Cluster Backup

Terminologies​

Taking Backups​

CPU and Memory Usage​

Frequency of Backups​

Backup Notification​

Restoring Backups​

Reference​

Required​

Optional​