Elasticsearch Cluster Backup
This folder contains a Terraform module to take and backup snapshots of an Elasticsearch cluster to an S3 bucket. The module is a scheduled lambda function that calls the Elasticsearch API to perform snapshotting and backup related tasks documented here;
Terminologies
- Snapshot: A snapshot represents the current state of the indices in an Elasticsearch cluster. This is the information stored in a backup repository.
- Repository: A repository is an Elasticsearch abstraction over a storage medium like a Shared File System, S3 Bucket, HDFS etc. It's used to identify where snapshot files are stored and doesn't contain any snapshots itself.
Taking Backups
Cluster snapshots are incremental. The first snapshot is always a full dump of the cluster and subsequent ones are a delta between the current state of the cluster and the previous snapshot. Snapshots are typically contained in .dat
files stored in the storage medium (in this case S3) the repository points to.
CPU and Memory Usage
Snapshots are usually run on a single node which automatically co-ordinates with other nodes to ensure completenss of data. Backup of a cluster with a large volume of data will lead to high CPU and memory usage on the node performing the backup. This module makes backup requests to the cluster through the load balancer which routes the request to one of the nodes, during backup, if the selected node is unable to handle incoming requests the load balancer will direct the request to other nodes.
Frequency of Backups
How often you make backups depends entirely on the size of your deployment and the importance of your data. Larger clusters with high volume usage will typically need to be backed up more frequently than low volume clusters because of the amount of data change between snapshots. It's a safe bet to start off running backups on a nightly schedule and then continually tweak the schedule based on the demands of your cluster.
Backup Notification
The time it takes to backup a cluster is dependent on the volume of data. However, since the backup module is implemened as a Lambda function which has a maximum execution time of 5 minutes a separate notification Lambda is kicked off. A Cloudwatch metric is incremented any time the notification lambda confirms that a backup occured and an alarm connected to that metric notifies you where or not it was updated.
Restoring Backups
Restoring snapshots is handled by the elasticsearch-cluster-restore module.
Reference
- Inputs
- Outputs
Required
alarm_period
numberHow often, in seconds, the backup lambda function is expected to run. You should factor in the amount of time it takes to backup your cluster.
alarm_sns_topic_arns
list(string)The ARN of SNS topics to notify if the CloudWatch alarm goes off because the backup job failed.
bucket
stringThe S3 bucket that the specified repository will be associated with and where all snapshots will be stored
cloudwatch_metric_name
stringThe name for the CloudWatch Metric the AWS lambda backup function will increment every time the job completes successfully.
The namespace for the CloudWatch Metric the AWS lambda backup function will increment every time the job completes successfully.
elasticsearch_dns
stringThe DNS to the Load Balancer in front of the Elasticsearch cluster
name
stringThe name of the Lambda function. Used to namespace all resources created by this module.
region
stringThe AWS region (e.g us-east-1) where the backup S3 bucket exists.
repository
stringThe name of the repository that will be associated with the created snapshots
schedule_expression
stringAn expression that defines the schedule for this lambda job. For example, cron(0 20 * ? ) or rate(5 minutes).
Optional
elasticsearch_port
numberThe port on which the API requests will be made to the Elasticsearch cluster
9200
lambda_runtime
stringThe runtime to use for the Lambda function. Should be a Node.js runtime.
"nodejs14.x"
protocol
stringSpecifies the protocol to use when making the request to the Elasticsearch cluster. Possible values are HTTP or HTTPS
"http"
run_in_vpc
boolSet to true to give your Lambda function access to resources within a VPC.
false
subnet_ids
list(string)A list of subnet IDs the Lambda function should be able to access within your VPC. Only used if run_in_vpc
is true.
[]
vpc_id
stringThe ID of the VPC the Lambda function should be able to access. Only used if run_in_vpc
is true.
null