ZooKeeper Cluster
This folder contains a Terraform module for running a cluster of Apache ZooKeeper nodes. Under the hood, the cluster is powered by the server-group module, so it supports attaching ENIs and EBS Volumes, zero-downtime rolling deployment, and auto-recovery of failed nodes. This module assumes that you are deploying an AMI that has both ZooKeeper and Exhibitor installed.
Quick start
- See the root README for instructions on using Terraform modules.
- See the zookeeper-cluster example for sample usage.
- See vars.tf for all the variables you can set on this module.
Key considerations for using this module
Here are the key things to take into account when using this module:
- ZooKeeper AMI
- User Data
- Cluster size
- Health checks
- Rolling deployments
- Static IPs and ENIs
- Transaction logs and EBS Volumes
- Exhibitor
- Data backup
ZooKeeper AMI
You specify the AMI to run in the cluster using the ami_id
input variable. We recommend creating a
Packer template to define the AMI with the following modules installed:
- install-oracle-jdk: Install the Oracle JDK.
- install-supervisord: Install Supervisord as a process manager.
- install-zookeeper: Install ZooKeeper.
- install-exhibitor: Install Exhibitor.
- run-exhibitor: Start Exhibitor, which, in turn, will start ZooKeeper.
See the zookeeper-ami example for working sample code.
User Data
When your servers are booting, you need to tell them to start Exhibitor (which, in turn, will start ZooKeeper). The
easiest way to do that is to specify a User Data
script via the user_data
input
variable that runs the run-exhibitor script. See
user-data.sh for an example.
Cluster size
Although you can run ZooKeeper on just a single server, in production, we strongly recommend running multiple ZooKeeper servers in a cluster (called an ensemble) so that:
- ZooKeeper replicates your data to all servers in the ensemble, so if one server dies, you don't lose any data, and the other servers can continue serving requests.
- Since the data is replicated across all the servers, any of the ZooKeeper nodes can respond to a read request, so you can scale to more read traffic by increasing the size of the cluster.
Note that ZooKeeper achieves consensus by using a majority vote, which has three implications:
- Your cluster must have an odd number of servers to make it possible to achieve a majority.
- A ZooKeeper cluster can continue to operate as long as a majority of the servers are operational. That means a
cluster with
n
nodes can tolerate(n - 1) / 2
failed servers. So a 1-node cluster cannot tolerate any failed servers, a 3-node cluster can tolerate 1 failed server, a 5-node cluster can tolerate 2 failed servers, and a 7-node cluster can tolerate 3 failed servers. - Larger clusters actually make writes slower, since you have to wait on more servers to respond to the vote. Most use cases are much more read-heavy than write-heavy, so this is typically a good trade-off. In practice, because writes get more expensive as the cluster grows, it's unusual to see a ZooKeeper cluster with more than 7 servers.
Putting all of this together, we recommend that in production, you always use a 3, 5, or 7 node cluster depending on your availability and scalability requirements.
Health checks
We strongly recommend associating an Elastic Load Balancer
(ELB) with your ZooKeeper cluster and configuring it
to perform TCP health checks on the ZooKeeper client port (2181 by default). The zookeeper-cluster
module allows you
to associate an ELB with ZooKeeper, using the ELB's health checks to perform zero-downtime
deployments (i.e., ensuring the previous node is passing health checks before deploying the next
one) and to detect when a server is down and needs to be automatically replaced.
Note that we do NOT recommend connecting to the ZooKeeper cluster via the ELB. That's because you access the ELB via its domain name, and most ZooKeeper clients (including Kafka) cache DNS entries forever. So if the underlying IPs stored in DNS for the ELB change (which could happen at any time!), the ZooKeeper clients own't find out about it until after a restart. You should always connect directly to the ZooKeeper nodes themselves via their static IP addresses.
Check out the zookeeper-cluster example for working sample code that includes an ELB.
Rolling deployments
To deploy updates to a ZooKeeper cluster, such as rolling out a new version of the AMI, you need to do the following:
- Shut down ZooKeeper on one server.
- Deploy the new code on the same server.
- Wait for the new code to come up successfully and start passing health checks.
- Repeat the process with the remaining servers.
This module can do this process for you automatically by using the server-group module's support for zero-downtime rolling deployment.
Static IPs and ENIs
To connect to ZooKeeper, either from other ZooKeeper servers, or from ZooKeeper clients such as Kafka, you need to provide the list of IP addresses for your ZooKeeper servers. Most ZooKeeper clients read this list of IPs during boot and never update it after. That means you need a static list of IP addresses for your ZooKeeper nodes.
This is a problem in a dynamic cloud environment, where any of the ZooKeeper nodes could be replaced (either due to an outage or deployment) with a different server, with a different IP address, at any time. Using DNS doesn't help, as most ZooKeeper clients (including Kafka!) cache DNS results forever, so if the underlying IPs stored in the DNS record change, those clients won't find out about it until they are restarted.
Our solution is to use Elastic Network Interface
(ENIs). An ENI is a static IP address that you can
attach to any server. This module creates an ENI for each ZooKeeper server and gives each (server, ENI) a matching
eni-0
tag. You can use the attach-eni
script in the User
Data of each server to find an
ENI with a matching eni-0
tag and attach it to the server during boot. That way, if a server goes down and is
replaced, its replacement reattaches the same ENI and gets the same IP address.
See user-data.sh for an example.
Transaction logs and EBS Volumes
Every write to a ZooKeeper server is immediately persisted to disk for durability in ZooKeeper's transaction log. We recommend using a separate EBS Volume to store these transaction logs. This ensures the hard drive used for transaction logs does not have to contend with any other disk operations, which can significantly improve ZooKeeper performance.
This module creates an EBS Volume for each ZooKeeper server and gives each (server, EBS Volume) a matching
ebs-volume-0
tag. You can use the persistent-ebs-volume
module in the User
Data of each server to find an
EBS Volume with a matching ebs-volume-0
tag and attach it to the server during boot. That way, if a server goes down
and is replaced, its replacement reattaches the same EBS Volume.
See user-data.sh for an example.
Exhibitor
This module assumes that you are running an AMI with Exhibitor installed. Exhibitor performs several functions, including acting as a process supervisor for ZooKeeper and cleaning up old transaction logs. ZooKeeper also exposes a UI you can use to see what's stored in and manage your ZooKeeper cluster. By default, this UI is available at port 8080 of every ZooKeeper server. We also expose Exhibitor at port 80 via the ELB used for health checks in the zookeeper-cluster example.
Data backup
ZooKeeper's primary mechanism for backing up data is the replication within the cluster, since every node has a copy of all the data. It is rare to backup data beyond that, as the type of data typically stored in ZooKeeper is ephemeral in nature (e.g., the leader of a cluster), and it's unusual for older data to be of any use.
That said, if you need more backup, you can do so from the Exhibitor UI, which offers Backup/Restore functionality that allows you to index the ZooKeeper transaction log and backup and restore specific transactions.
Reference
- Inputs
- Outputs
Required
allowed_client_port_inbound_cidr_blocks
list(string)A list of CIDR-formatted IP address ranges that will be allowed to connect to client_port
allowed_client_port_inbound_security_group_ids
list(string)A list of security group IDs that will be allowed to connect to client_port
allowed_exhibitor_port_inbound_cidr_blocks
list(string)A list of CIDR-formatted IP address ranges that will be allowed to connect to exhibitor_port
A list of security group IDs that will be allowed to connect to exhibitor_port
allowed_health_check_port_inbound_cidr_blocks
list(string)A list of CIDR-formatted IP address ranges that will be allowed to connect to health_check_port
A list of security group IDs that will be allowed to connect to health_check_port
ami_id
stringThe ID of the AMI to run in this cluster. Should be an AMI that has ZooKeeper and Exhibitor installed by the install-zookeeper and install-exhibitor modules.
aws_region
stringThe AWS region to deploy into.
cluster_name
stringThe name of the ZooKeeper cluster (e.g. zookeeper-stage). This variable is used to namespace all resources created by this module.
cluster_size
numberThe number of nodes to have in the cluster. This MUST be set to an odd number! We strongly recommend setting this to 3, 5, or 7 for production usage.
instance_type
stringThe type of EC2 Instances to run for each node in the cluster (e.g. t2.micro).
The number of security group IDs in allowed_client_port_inbound_security_group_ids
. We should be able to compute this automatically, but due to a Terraform limitation, we can't: https://github.com/hashicorp/terraform/issues/14677#issuecomment-302772685
The number of security group IDs in allowed_exhibitor_port_inbound_security_group_ids
. We should be able to compute this automatically, but due to a Terraform limitation, we can't: https://github.com/hashicorp/terraform/issues/14677#issuecomment-302772685
The number of security group IDs in allowed_health_check_port_inbound_security_group_ids
. We should be able to compute this automatically, but due to a Terraform limitation, we can't: https://github.com/hashicorp/terraform/issues/14677#issuecomment-302772685
The name of the S3 bucket to create and use for storing the shared config for the ZooKeeper cluster.
subnet_ids
list(string)The subnet IDs into which the EC2 Instances should be deployed. You should typically pass in one subnet ID per node in the cluster_size variable. We strongly recommend that you run ZooKeeper in private subnets.
user_data
stringA User Data script to execute while the server is booting. We remmend passing in a bash script that executes the run-exhibitor script, which should have been installed in the AMI by the install-exhibitor module.
vpc_id
stringThe ID of the VPC in which to deploy the cluster
Optional
allowed_ssh_cidr_blocks
list(string)A list of CIDR-formatted IP address ranges from which the EC2 Instances will allow SSH connections
[]
allowed_ssh_security_group_ids
list(string)A list of security group IDs from which the EC2 Instances will allow SSH connections
[]
If set to true, associate a public IP address with each EC2 Instance in the cluster. We strongly recommend against making ZooKeeper nodes publicly accessible.
false
client_port
numberThe port clients use to connect to ZooKeeper
2181
connect_port
numberThe port ZooKeeper nodes use to connect to other ZooKeeper nodes
2888
custom_tags
map(string)Custom tags to apply to the ZooKeeper nodes and all related resources (i.e., security groups, EBS Volumes, ENIs).
{}
deployment_batch_size
numberHow many servers to deploy at a time during a rolling deployment. For example, if you have 10 servers and set this variable to 2, then the deployment will a) undeploy 2 servers, b) deploy 2 replacement servers, c) repeat the process for the next 2 servers. For ZooKeeper in production, we STRONGLY recommend keeping this at 1.
1
The maximum number of times to retry the Load Balancer's Health Check before giving up on the rolling deployment. After this number is hit, the Python script will cease checking the failed EC2 Instance deployment but continue with other EC2 Instance deployments.
360
The common portion of the DNS name to assign to each ENI in the zookeeper server group. For example, if confluent.acme.com, this module will create DNS records 0.confluent.acme.com, 1.confluent.acme.com, etc. Note that this value must be a valid record name for the Route 53 Hosted Zone ID specified in route53_hosted_zone_id
.
""
dns_names
list(string)A list of DNS names to assign to the ENIs in the zookeeper server group. Make sure the list has n entries, where n = cluster_size
. If this var is specified, it will override dns_name_common_portion
. Example: [0.acme.com, 1.acme.com, 2.acme.com]. Note that the list entries must be valid records for the Route 53 Hosted Zone ID specified in route53_hosted_zone_id
.
[]
The TTL (Time to Live) to apply to any DNS records created by this module.
300
ebs_volumes
list(object(…))A list that defines the EBS Volumes to create for each server. Each item in the list should be a map that contains the keys 'type' (one of standard, gp2, or io1), 'size' (in GB), and 'encrypted' (true or false). We recommend attaching an EBS Volume to ZooKeeper to use for transaction logs.
list(object({
type = string
size = number
encrypted = bool
}))
[
{
encrypted = true,
size = 50,
type = "gp2"
},
{
encrypted = true,
size = 50,
type = "gp2"
}
]
elb_names
list(string)A list of Elastic Load Balancer (ELB) names to associate with the ZooKeeper nodes. We recommend using an ALB or ELB for health checks. If you're using an Application Load Balancer (ALB), use target_group_arns
instead.
[]
elections_port
numberThe port ZooKeeper nodes use to connect to other ZooKeeper nodes during leader elections
3888
Enable detailed CloudWatch monitoring for the servers. This gives you more granularity with your CloudWatch metrics, but also costs more money.
false
If true, create an Elastic IP Address for each ENI and associate it with the ENI.
false
enabled_metrics
list(string)A list of metrics the ASG should enable for monitoring all instances in a group. The allowed values are GroupMinSize, GroupMaxSize, GroupDesiredCapacity, GroupInServiceInstances, GroupPendingInstances, GroupStandbyInstances, GroupTerminatingInstances, GroupTotalInstances.
[]
Example
enabled_metrics = [
"GroupDesiredCapacity",
"GroupInServiceInstances",
"GroupMaxSize",
"GroupMinSize",
"GroupPendingInstances",
"GroupStandbyInstances",
"GroupTerminatingInstances",
"GroupTotalInstances"
]
exhibitor_port
numberThe port Exhibitor uses for its Control Panel UI
8080
If you set this to true, when you run terraform destroy, this tells Terraform to delete all the objects in the S3 bucket used for shared config storage. You should NOT set this to true in production! This property is only here so automated tests can clean up after themselves.
false
Time, in seconds, after instance comes into service before checking health. For CentOS servers, we have observed slower overall boot time and recommend a value of 600.
300
health_check_port
numberThe port Health Check uses for incoming requests
5000
health_check_type
stringControls how health checking is done. Must be one of EC2 or ELB.
"EC2"
Whether the root volume should be encrypted.
true
Whether the root volume should be destroyed on instance termination.
true
If true, the launched EC2 instance will be EBS-optimized.
false
root_volume_size
numberThe size, in GB, of the root EBS volume.
50
root_volume_type
stringThe type of volume. Must be one of: standard, gp2, or io1.
"standard"
The ID of the Route53 Hosted Zone in which we will create the DNS records specified by dns_names
. Must be non-empty if dns_name_common_portion
or dns_names
is non-empty.
""
script_log_level
stringThe log level to use with the rolling deploy script. It can be useful to set this to DEBUG when troubleshooting the script.
"INFO"
The Amazon Resource Name (ARN) of the KMS Key that will be used to encrypt/decrypt shared config files in the S3 bucket. The default value of null will use the managed aws/s3 key.
null
If set to true, skip the health check, and start a rolling deployment without waiting for the server group to be in a healthy state. This is primarily useful if the server group is in a broken state and you want to force a deployment anyway.
false
If set to true, skip the rolling deployment, and destroy all the servers immediately. You should typically NOT enable this in prod, as it will cause downtime! The main use case for this flag is to make testing and cleanup easier. It can also be handy in case the rolling deployment code has a bug.
false
ssh_key_name
stringThe name of an EC2 Key Pair that can be used to SSH to the EC2 Instances in this cluster. Set to an empty string to not associate a Key Pair.
null
ssh_port
numberThe port used for SSH connections
22
target_group_arns
list(string)A list of target group ARNs of Application Load Balanacer (ALB) targets to associate with ZooKeeper nodes. We recommend using an ALB or ELB for health checks. If you're using a Elastic Load Balancer (AKA ELB Classic), use elb_names
instead.
[]
tenancy
stringThe tenancy of the instance. Must be one of: default or dedicated.
"default"
wait_for
stringBy passing a value to this variable, you can effectively tell this module to wait to deploy until the given variable's value is resolved, which is a way to require that this module depend on some other module. Note that the actual value of this variable doesn't matter.
""
A maximum duration that Terraform should wait for ASG instances to be healthy before timing out. Setting this to '0' causes Terraform to skip all Capacity Waiting behavior.
"10m"
Other modules can depend on this variable to ensure those modules only deploy after this module is done deploying. Note that the actual value of this output can be ignored.