ZooKeeper Cluster

This folder contains a Terraform module for running a cluster of Apache ZooKeeper nodes. Under the hood, the cluster is powered by the server-group module, so it supports attaching ENIs and EBS Volumes, zero-downtime rolling deployment, and auto-recovery of failed nodes. This module assumes that you are deploying an AMI that has both ZooKeeper and Exhibitor installed.

Quick start

See the root README for instructions on using Terraform modules.
See the zookeeper-cluster example for sample usage.
See vars.tf for all the variables you can set on this module.

Key considerations for using this module

Here are the key things to take into account when using this module:

ZooKeeper AMI
User Data
Cluster size
Health checks
Rolling deployments
Static IPs and ENIs
Transaction logs and EBS Volumes
Exhibitor
Data backup

ZooKeeper AMI

You specify the AMI to run in the cluster using the ami_id input variable. We recommend creating a Packer template to define the AMI with the following modules installed:

install-oracle-jdk: Install the Oracle JDK.
install-supervisord: Install Supervisord as a process manager.
install-zookeeper: Install ZooKeeper.
install-exhibitor: Install Exhibitor.
run-exhibitor: Start Exhibitor, which, in turn, will start ZooKeeper.

See the zookeeper-ami example for working sample code.

User Data

When your servers are booting, you need to tell them to start Exhibitor (which, in turn, will start ZooKeeper). The easiest way to do that is to specify a User Data script via the user_data input variable that runs the run-exhibitor script. See user-data.sh for an example.

Cluster size

Although you can run ZooKeeper on just a single server, in production, we strongly recommend running multiple ZooKeeper servers in a cluster (called an ensemble) so that:

ZooKeeper replicates your data to all servers in the ensemble, so if one server dies, you don't lose any data, and the other servers can continue serving requests.
Since the data is replicated across all the servers, any of the ZooKeeper nodes can respond to a read request, so you can scale to more read traffic by increasing the size of the cluster.

Note that ZooKeeper achieves consensus by using a majority vote, which has three implications:

Your cluster must have an odd number of servers to make it possible to achieve a majority.
A ZooKeeper cluster can continue to operate as long as a majority of the servers are operational. That means a cluster with n nodes can tolerate (n - 1) / 2 failed servers. So a 1-node cluster cannot tolerate any failed servers, a 3-node cluster can tolerate 1 failed server, a 5-node cluster can tolerate 2 failed servers, and a 7-node cluster can tolerate 3 failed servers.
Larger clusters actually make writes slower, since you have to wait on more servers to respond to the vote. Most use cases are much more read-heavy than write-heavy, so this is typically a good trade-off. In practice, because writes get more expensive as the cluster grows, it's unusual to see a ZooKeeper cluster with more than 7 servers.

Putting all of this together, we recommend that in production, you always use a 3, 5, or 7 node cluster depending on your availability and scalability requirements.

Health checks

We strongly recommend associating an Elastic Load Balancer (ELB) with your ZooKeeper cluster and configuring it to perform TCP health checks on the ZooKeeper client port (2181 by default). The zookeeper-cluster module allows you to associate an ELB with ZooKeeper, using the ELB's health checks to perform zero-downtime deployments (i.e., ensuring the previous node is passing health checks before deploying the next one) and to detect when a server is down and needs to be automatically replaced.

Note that we do NOT recommend connecting to the ZooKeeper cluster via the ELB. That's because you access the ELB via its domain name, and most ZooKeeper clients (including Kafka) cache DNS entries forever. So if the underlying IPs stored in DNS for the ELB change (which could happen at any time!), the ZooKeeper clients own't find out about it until after a restart. You should always connect directly to the ZooKeeper nodes themselves via their static IP addresses.

Check out the zookeeper-cluster example for working sample code that includes an ELB.

Rolling deployments

To deploy updates to a ZooKeeper cluster, such as rolling out a new version of the AMI, you need to do the following:

Shut down ZooKeeper on one server.
Deploy the new code on the same server.
Wait for the new code to come up successfully and start passing health checks.
Repeat the process with the remaining servers.

This module can do this process for you automatically by using the server-group module's support for zero-downtime rolling deployment.

Static IPs and ENIs

To connect to ZooKeeper, either from other ZooKeeper servers, or from ZooKeeper clients such as Kafka, you need to provide the list of IP addresses for your ZooKeeper servers. Most ZooKeeper clients read this list of IPs during boot and never update it after. That means you need a static list of IP addresses for your ZooKeeper nodes.

This is a problem in a dynamic cloud environment, where any of the ZooKeeper nodes could be replaced (either due to an outage or deployment) with a different server, with a different IP address, at any time. Using DNS doesn't help, as most ZooKeeper clients (including Kafka!) cache DNS results forever, so if the underlying IPs stored in the DNS record change, those clients won't find out about it until they are restarted.

Our solution is to use Elastic Network Interface (ENIs). An ENI is a static IP address that you can attach to any server. This module creates an ENI for each ZooKeeper server and gives each (server, ENI) a matching eni-0 tag. You can use the attach-eni script in the User Data of each server to find an ENI with a matching eni-0 tag and attach it to the server during boot. That way, if a server goes down and is replaced, its replacement reattaches the same ENI and gets the same IP address.

See user-data.sh for an example.

Transaction logs and EBS Volumes

Every write to a ZooKeeper server is immediately persisted to disk for durability in ZooKeeper's transaction log. We recommend using a separate EBS Volume to store these transaction logs. This ensures the hard drive used for transaction logs does not have to contend with any other disk operations, which can significantly improve ZooKeeper performance.

This module creates an EBS Volume for each ZooKeeper server and gives each (server, EBS Volume) a matching ebs-volume-0 tag. You can use the persistent-ebs-volume module in the User Data of each server to find an EBS Volume with a matching ebs-volume-0 tag and attach it to the server during boot. That way, if a server goes down and is replaced, its replacement reattaches the same EBS Volume.

See user-data.sh for an example.

Exhibitor

This module assumes that you are running an AMI with Exhibitor installed. Exhibitor performs several functions, including acting as a process supervisor for ZooKeeper and cleaning up old transaction logs. ZooKeeper also exposes a UI you can use to see what's stored in and manage your ZooKeeper cluster. By default, this UI is available at port 8080 of every ZooKeeper server. We also expose Exhibitor at port 80 via the ELB used for health checks in the zookeeper-cluster example.

Data backup

ZooKeeper's primary mechanism for backing up data is the replication within the cluster, since every node has a copy of all the data. It is rare to backup data beyond that, as the type of data typically stored in ZooKeeper is ephemeral in nature (e.g., the leader of a cluster), and it's unusual for older data to be of any use.

That said, if you need more backup, you can do so from the Exhibitor UI, which offers Backup/Restore functionality that allows you to index the ZooKeeper transaction log and backup and restore specific transactions.

Reference

Inputs
Outputs

Required

allowed_client_port_inbound_cidr_blockslist(string)required

A list of CIDR-formatted IP address ranges that will be allowed to connect to client_port

allowed_client_port_inbound_security_group_idslist(string)required

A list of security group IDs that will be allowed to connect to client_port

allowed_exhibitor_port_inbound_cidr_blockslist(string)required

A list of CIDR-formatted IP address ranges that will be allowed to connect to exhibitor_port

allowed_exhibitor_port_inbound_security_group_idslist(string)required

A list of security group IDs that will be allowed to connect to exhibitor_port

allowed_health_check_port_inbound_cidr_blockslist(string)required

A list of CIDR-formatted IP address ranges that will be allowed to connect to health_check_port

allowed_health_check_port_inbound_security_group_idslist(string)required

A list of security group IDs that will be allowed to connect to health_check_port

ami_idstringrequired

The ID of the AMI to run in this cluster. Should be an AMI that has ZooKeeper and Exhibitor installed by the install-zookeeper and install-exhibitor modules.

aws_regionstringrequired

The AWS region to deploy into.

cluster_namestringrequired

The name of the ZooKeeper cluster (e.g. zookeeper-stage). This variable is used to namespace all resources created by this module.

cluster_sizenumberrequired

The number of nodes to have in the cluster. This MUST be set to an odd number! We strongly recommend setting this to 3, 5, or 7 for production usage.

instance_typestringrequired

The type of EC2 Instances to run for each node in the cluster (e.g. t2.micro).

num_allowed_client_port_inbound_security_group_idsnumberrequired

The number of security group IDs in allowed_client_port_inbound_security_group_ids. We should be able to compute this automatically, but due to a Terraform limitation, we can't: https://github.com/hashicorp/terraform/issues/14677#issuecomment-302772685

num_allowed_exhibitor_port_inbound_security_group_idsnumberrequired

The number of security group IDs in allowed_exhibitor_port_inbound_security_group_ids. We should be able to compute this automatically, but due to a Terraform limitation, we can't: https://github.com/hashicorp/terraform/issues/14677#issuecomment-302772685

num_allowed_health_check_port_inbound_security_group_idsnumberrequired

The number of security group IDs in allowed_health_check_port_inbound_security_group_ids. We should be able to compute this automatically, but due to a Terraform limitation, we can't: https://github.com/hashicorp/terraform/issues/14677#issuecomment-302772685

shared_config_s3_bucket_namestringrequired

The name of the S3 bucket to create and use for storing the shared config for the ZooKeeper cluster.

subnet_idslist(string)required

The subnet IDs into which the EC2 Instances should be deployed. You should typically pass in one subnet ID per node in the cluster_size variable. We strongly recommend that you run ZooKeeper in private subnets.

user_datastringrequired

A User Data script to execute while the server is booting. We remmend passing in a bash script that executes the run-exhibitor script, which should have been installed in the AMI by the install-exhibitor module.

vpc_idstringrequired

The ID of the VPC in which to deploy the cluster

Optional

allowed_ssh_cidr_blockslist(string)optional

A list of CIDR-formatted IP address ranges from which the EC2 Instances will allow SSH connections

Default:[]

allowed_ssh_security_group_idslist(string)optional

A list of security group IDs from which the EC2 Instances will allow SSH connections

Default:[]

associate_public_ip_addressbooloptional

If set to true, associate a public IP address with each EC2 Instance in the cluster. We strongly recommend against making ZooKeeper nodes publicly accessible.

Default:false

client_portnumberoptional

The port clients use to connect to ZooKeeper

Default:2181

connect_portnumberoptional

The port ZooKeeper nodes use to connect to other ZooKeeper nodes

Default:2888

custom_tagsmap(string)optional

Custom tags to apply to the ZooKeeper nodes and all related resources (i.e., security groups, EBS Volumes, ENIs).

Default:{}

deployment_batch_sizenumberoptional

How many servers to deploy at a time during a rolling deployment. For example, if you have 10 servers and set this variable to 2, then the deployment will a) undeploy 2 servers, b) deploy 2 replacement servers, c) repeat the process for the next 2 servers. For ZooKeeper in production, we STRONGLY recommend keeping this at 1.

Default:1

deployment_health_check_max_retriesnumberoptional

The maximum number of times to retry the Load Balancer's Health Check before giving up on the rolling deployment. After this number is hit, the Python script will cease checking the failed EC2 Instance deployment but continue with other EC2 Instance deployments.

Default:360

dns_name_common_portionoptional

The common portion of the DNS name to assign to each ENI in the zookeeper server group. For example, if confluent.acme.com, this module will create DNS records 0.confluent.acme.com, 1.confluent.acme.com, etc. Note that this value must be a valid record name for the Route 53 Hosted Zone ID specified in route53_hosted_zone_id.

Default:""

dns_nameslist(string)optional

A list of DNS names to assign to the ENIs in the zookeeper server group. Make sure the list has n entries, where n = cluster_size. If this var is specified, it will override dns_name_common_portion. Example: [0.acme.com, 1.acme.com, 2.acme.com]. Note that the list entries must be valid records for the Route 53 Hosted Zone ID specified in route53_hosted_zone_id.

Default:[]

dns_ttloptional

The TTL (Time to Live) to apply to any DNS records created by this module.

Default:300

ebs_volumeslist(object(…))optional

A list that defines the EBS Volumes to create for each server. Each item in the list should be a map that contains the keys 'type' (one of standard, gp2, or io1), 'size' (in GB), and 'encrypted' (true or false). We recommend attaching an EBS Volume to ZooKeeper to use for transaction logs.

Type Details

list(object({
    type      = string
    size      = number
    encrypted = bool
  }))

Default

[
  {
    encrypted = true,
    size = 50,
    type = "gp2"
  },
  {
    encrypted = true,
    size = 50,
    type = "gp2"
  }
]

elb_nameslist(string)optional

A list of Elastic Load Balancer (ELB) names to associate with the ZooKeeper nodes. We recommend using an ALB or ELB for health checks. If you're using an Application Load Balancer (ALB), use target_group_arns instead.

Default:[]

elections_portnumberoptional

The port ZooKeeper nodes use to connect to other ZooKeeper nodes during leader elections

Default:3888

enable_detailed_monitoringbooloptional

Enable detailed CloudWatch monitoring for the servers. This gives you more granularity with your CloudWatch metrics, but also costs more money.

Default:false

enable_elastic_ipsoptional

If true, create an Elastic IP Address for each ENI and associate it with the ENI.

Default:false

enabled_metricslist(string)optional

A list of metrics the ASG should enable for monitoring all instances in a group. The allowed values are GroupMinSize, GroupMaxSize, GroupDesiredCapacity, GroupInServiceInstances, GroupPendingInstances, GroupStandbyInstances, GroupTerminatingInstances, GroupTotalInstances.

Default:[]

Examples

Example

   enabled_metrics = [
      "GroupDesiredCapacity",
      "GroupInServiceInstances",
      "GroupMaxSize",
      "GroupMinSize",
      "GroupPendingInstances",
      "GroupStandbyInstances",
      "GroupTerminatingInstances",
      "GroupTotalInstances"
    ]

exhibitor_portnumberoptional

The port Exhibitor uses for its Control Panel UI

Default:8080

force_destroy_shared_config_s3_bucketbooloptional

If you set this to true, when you run terraform destroy, this tells Terraform to delete all the objects in the S3 bucket used for shared config storage. You should NOT set this to true in production! This property is only here so automated tests can clean up after themselves.

Default:false

health_check_grace_periodnumberoptional

Time, in seconds, after instance comes into service before checking health. For CentOS servers, we have observed slower overall boot time and recommend a value of 600.

Default:300

health_check_portnumberoptional

The port Health Check uses for incoming requests

Default:5000

health_check_typestringoptional

Controls how health checking is done. Must be one of EC2 or ELB.

Default:"EC2"

root_block_device_encryptedbooloptional

Whether the root volume should be encrypted.

Default:true

root_volume_delete_on_terminationbooloptional

Whether the root volume should be destroyed on instance termination.

Default:true

root_volume_ebs_optimizedbooloptional

If true, the launched EC2 instance will be EBS-optimized.

Default:false

root_volume_sizenumberoptional

The size, in GB, of the root EBS volume.

Default:50

root_volume_typestringoptional

The type of volume. Must be one of: standard, gp2, or io1.

Default:"standard"

route53_hosted_zone_idoptional

The ID of the Route53 Hosted Zone in which we will create the DNS records specified by dns_names. Must be non-empty if dns_name_common_portion or dns_names is non-empty.

Default:""

script_log_levelstringoptional

The log level to use with the rolling deploy script. It can be useful to set this to DEBUG when troubleshooting the script.

Default:"INFO"

shared_config_s3_bucket_kms_key_arnstringoptional

The Amazon Resource Name (ARN) of the KMS Key that will be used to encrypt/decrypt shared config files in the S3 bucket. The default value of null will use the managed aws/s3 key.

Default:null

skip_health_checkbooloptional

If set to true, skip the health check, and start a rolling deployment without waiting for the server group to be in a healthy state. This is primarily useful if the server group is in a broken state and you want to force a deployment anyway.

Default:false

skip_rolling_deploybooloptional

If set to true, skip the rolling deployment, and destroy all the servers immediately. You should typically NOT enable this in prod, as it will cause downtime! The main use case for this flag is to make testing and cleanup easier. It can also be handy in case the rolling deployment code has a bug.

Default:false

ssh_key_namestringoptional

The name of an EC2 Key Pair that can be used to SSH to the EC2 Instances in this cluster. Set to an empty string to not associate a Key Pair.

Default:null

ssh_portnumberoptional

The port used for SSH connections

Default:22

target_group_arnslist(string)optional

A list of target group ARNs of Application Load Balanacer (ALB) targets to associate with ZooKeeper nodes. We recommend using an ALB or ELB for health checks. If you're using a Elastic Load Balancer (AKA ELB Classic), use elb_names instead.

Default:[]

tenancystringoptional

The tenancy of the instance. Must be one of: default or dedicated.

Default:"default"

wait_forstringoptional

By passing a value to this variable, you can effectively tell this module to wait to deploy until the given variable's value is resolved, which is a way to require that this module depend on some other module. Note that the actual value of this variable doesn't matter.

Default:""

wait_for_capacity_timeoutstringoptional

A maximum duration that Terraform should wait for ASG instances to be healthy before timing out. Setting this to '0' causes Terraform to skip all Capacity Waiting behavior.

Default:"10m"

asg_names

cluster_size

ebs_volume_ids

iam_role_arn

iam_role_id

private_ips

rolling_deployment_done

Other modules can depend on this variable to ensure those modules only deploy after this module is done deploying. Note that the actual value of this output can be ignored.

security_group_id

shared_config_s3_bucket_arn

ZooKeeper Cluster

Quick start​

Key considerations for using this module​

ZooKeeper AMI​

User Data​

Cluster size​

Health checks​

Rolling deployments​

Static IPs and ENIs​

Transaction logs and EBS Volumes​

Exhibitor​

Data backup​

Reference​

Required​

Optional​

Quick start

Key considerations for using this module

ZooKeeper AMI

User Data

Cluster size

Health checks

Rolling deployments

Static IPs and ENIs

Transaction logs and EBS Volumes

Exhibitor

Data backup

Reference

Required

Optional