Running a Three-Node etcd Cluster on AWS

I set up a three-node etcd cluster on AWS EC2 to gain first-hand experience with distributed consensus, leader election, and fault tolerance. The project involved launching virtual machines, configuring firewall rules, installing etcd, and experimenting with cluster behavior under simulated failure conditions.

Visualizing RAFT Consensus

To better understand how RAFT consensus underpins etcd, I used the RAFT simulator.

RAFT Simulator Screenshot
Visualization of RAFT showing a leader failing and a follower being promoted. The simulator was an invaluable tool to understand leader election, log replication, and term changes during failures.

This tool provided an intuitive way to see how:

One node is elected leader in normal operation
Followers replicate log entries from the leader
When the leader fails, a follower is promoted through a new election
RAFT terms increase each time a new leader is chosen

Setting Up the Cluster

I launched three Debian-based EC2 instances, noted their private IPv4 addresses, and configured a shared security group with inbound rules for ports 2379 and 2380.

On each instance, I installed etcd v3.5.21 and set up environment variables (setupenv.sh) with node names, IP addresses, and cluster tokens. Using these variables, I started etcd on each VM and verified that the three nodes successfully formed a cluster.

Interacting with the Cluster

I used etcdctl to:

List cluster members and verify connectivity
Store and retrieve key/value pairs (e.g., writing and retrieving mypid)
Inspect cluster health with the endpoint status command

Initially, Node 1 was elected as the leader.

Simulating Failure

To observe RAFT consensus in action, I stopped the EC2 instance running the leader node. Using etcdctl, I confirmed that one of the follower nodes (Node 2 in my experiment) was promoted to leader, and the RAFT term incremented.

Key/value pairs written earlier remained accessible, demonstrating etcd’s fault-tolerant storage guarantees. Once the stopped instance was restarted and etcd relaunched, it rejoined the cluster as a follower.

Takeaways

This project gave me practical experience with:

Deploying distributed systems on AWS EC2
Configuring networking and firewall rules for inter-node communication
Setting up and operating a three-node etcd cluster
Observing RAFT consensus, leader election, and log replication in a live system
Verifying fault tolerance by simulating leader node failures and confirming cluster recovery

The RAFT simulator was especially useful in helping me understand consensus theory and complemented the practical work of running etcd in a real cluster.