This section will dive into the implementation details of how various Habitat components work. These topics are for advanced users. It is not necessary to learn these concepts in order to use Habitat.
Table of Contents
The Habitat Supervisor is similar in some ways to well-known process supervisors like systemd, runit or smf. It accepts and passes POSIX signals to its child processes, restarts child processes if and when they fail, ensures that children processes terminate cleanly, and so on.
Because the basic functionality of process supervision is well-known, this document does not discuss those details. Instead, this document focuses strictly on the internals of the feature that makes the Habitat Supervisor special: the fact that each Supervisor is connected to others in a peer-to-peer, masterless network which we refer to as a ring. This allows Supervisors to share configuration data with one another and adapt to changing conditions in the ring by modifying their own configuration.
Supervisors are configured to form a ring by using the
--peer argument and pointing them at peers that already exist. In a real-life deployment scenario, Supervisors in a ring would also have a shared encryption key, so that inter-Supervisor traffic is encrypted. (See the security documentation for more details.)
Supervisor rings can be very large, comprising thousands of supervisors. The Supervisor communication protocol is low-bandwidth and designed to not interfere with your application's actual production traffic.
Rings are divided into service groups, each of which has a name. All Supervisors within a service group share the same configuration and topology.
Habitat uses a gossip protocol named "Butterfly". It is a variant of SWIM for membership and failure detection (over UDP), and a ZeroMQ based variant of Newscast for gossip. This protocol provides failure detection, service discovery, and leader election to the Habitat Supervisor.
Butterfly is an eventually consistent system - it says, with a very high degree of probability, that a given piece of information will be received by every member of the network. It makes no guarantees as to when that state will arrive; in practice, the answer is usually "quite quickly".
- Members: Butterfly keeps track of "members"; each Habitat Supervisor is a single member.
- Peer: All the members a given member is connected to are its "peers". A member is seeded with a list of "initial peers".
- Health: The status of a given member, from the perspective of its peers.
- Rumor: A piece of data shared with all the members of a ring; examples are election, membership, services, or configuration.
- Heat: How many times a given rumor has been shared with a given member.
- Ring: All the members connected to one another form a Ring.
- Incarnation: A counter used to determine which message is "newer".
Supervisors communicate with each other using UDP and ZeroMQ, over port 9638.
Butterfly encrypts traffic on the wire using Curve25519 and a symmetric key. If a ring is configured to use transport level encryption, only members with a matching key are allowed to communicate.
Service Configuration and Files can both be encrypted with public keys.
Membership and Failure Detection
Butterfly servers keep track of what members are present in a ring, and are constantly checking each other for failure. Any given member is in one of four health states:
- Alive: this member is responding to health checks.
- Suspect: this member has stopped responding to our health check, and will be marked confirmed if we do not receive proof it is still alive soon.
- Confirmed: this member has been un-responsive long enough that we can cease attempting to check its health.
- Departed: this member has been intentionally kicked out of the ring for behavior unbecoming of a Supervisor, and is not allowed to rejoin. This is done via a human operator using the
The essential flow is:
- Randomize the list of all known members who are not Confirmed or Departed.
- Every 3.1 seconds, pop a member off the list, and send it a "PING" message.
- If we receive an "ACK" message before 1 second elapses, the member remains Alive.
- If we do not receive an "ACK" in 1 second, choose 5 peers (the "PINGREQ targets"), and send them a "PINGREQ(member)" message for the member who failed the PING.
- If any of our PINGREQ targets receive an ACK, they forward it to us, and the member remains Alive.
- If we do not receive an ACK via PINGREQ with 2.1 seconds, we mark the member as Suspect, and set an expiration timer of 9.3 seconds.
- If we do not receive an Alive status for the member within the 9.3 second suspicion expiration window, the member is marked as Confirmed.
- Move on to the next member, until the list is exhausted; start the process again.
When a Supervisor sends the PING, ACK and PINGREQ messages, it includes information about the 5 most recent members. This enables membership to be gossiped through the failure protocol itself.
This process provides several nice attributes:
- It is resilient to partial network partitions.
- Due to the expiration of suspected members, confirmation of death spreads quickly.
- The amount of network traffic generated by a given node is constant, regardless of network size.
- The protocol uses single UDP packets which fit within 512 bytes.
Butterfly differs from SWIM in the following ways:
- Rather than sending messages to update member state, we send the entire member.
- We support encryption on the wire.
- Payloads are protocol buffers.
- We support "persistent" members - these are members who will continue to have the failure detection protocol run against them, even if they are confirmed dead. This enables the system to heal from long-lived total partitions.
- Members who are confirmed dead, but who later receive a membership rumor about themselves being suspected or confirmed, respond by spreading an Alive rumor with a higher incarnation. This allows members who return from a partition to re-join the ring gracefully.
Butterfly uses ZeroMQ to disseminate rumors throughout the network. Its flow:
- Randomize the list of all known members who are not Confirmed dead.
- Every second, take 5 members from the list.
- Send each member every rumor that has a Heat lower than 3; update the heat for each rumor sent.
- When the list is exhausted, start the loop again.
Whats good about this system:
- ZeroMQ provides a scalable PULL socket, that processes incoming messages from multiple peers as a single fair-queue.
- It has no back-chatter - messages are PUSH-ed to members, but require no receipt acknowledgement.
- Messages are sent over TCP, giving them some durability guarantees.
- In common use, the gossip protocol becomes inactive; if there are no rumors to send to a given member, nothing is sent.
- Many more details about the operation of SWIM can be found in its paper.
- For information about the newscast approach to rumor dissemination, please refer to the paper.
The Habitat Supervisor performs leader election natively for service group topologies that require one, such as leader-follower.
Because Habitat is an eventually-consistent distributed system, the role of the leader is different than in strongly-consistent systems. It only serves as the leader for application level semantics, e.g. a database write leader. The fact that a Supervisor is a leader has no bearing upon other operations in the Habitat system, including rumor dissemination for configuration updates. It is not akin to a Raft leader, through which writes must all be funneled. This allows for very high scalability of the Habitat Supervisor ring.
Services grouped using a leader need to have a minimum of three supervisors in order to break ties. It is also strongly recommended that you do not run the service group with an even number of members. Otherwise, in the event of a network partition with equal members on each side, both sides will elect a new leader, causing a full split-brain from which the algorithm cannot recover. Supervisors in a service group will warn you if you are using leader election and have an even number of supervisors.
Protocol for electing a leader
When a service group starts in a leader topology, it will wait until there are sufficient members to form a quorum (at least three). At this point, an election cycle can happen. Each Supervisor injects an election rumor into ring, targeted at the service group, with the exact same rumor, which demands an election and insists that the peer itself is the leader. This algorithm is known as Bully.
Every peer that receives this rumor does a simple lexicographic comparison of its GUID with the GUID of the peer contained in that rumor. The winner is the peer whose GUID is higher. The peer then adds a vote for the GUID of the winner, and shares the rumor with others, including the total number of votes of anyone who previously voted for this winner.
An election ends when a candidate peer X gets a rumor back from the ring saying that it (X) is the winner, with all members voting. At this point, it sends out a rumor saying it is the declared winner, and the election cycle ends.
- For more information about the Bully algorithm, please see the paper "Elections in a Distributed Computing System" by Héctor García-Molina.
Habitat uses both symmetric encryption (for wire encryption) and asymmetric encryption (for everything else). If you are not familiar with the difference between the two, please consult this article.
When you have either wire encryption or service group encryption turned on, the messages use the Curve25519, Salsa20, and Poly1305 ciphers specified in Cryptography in NaCl.
Habitat packages are signed using BLAKE2b checksums. BLAKE2b is a cryptographic hash function faster than MD5, SHA-1, SHA-2 and SHA3, yet provides at least as much security as the latest standard SHA-3.
You can examine the first four lines of a
.hart file to extract the signature from it, because it is an
xz-compressed tarball with a metadata header. The
hab pkg header command will do this for you.
.hart file format is designed in this way to allow you to extract both the signature and the payload separately for inspection. To extract only the
xz-compressed content, bypassing the signature, you could type this:
This document provides developer documentation on how the Habitat system becomes self-sustaining. It is built upon the work from the Linux from Scratch project.
This instructions in this document may become rapidly out-of-date as we develop Habitat further. Should you have questions, please join us in Slack.
Part I: Setup
In order to bootstrap the system from scratch, you should be familiar with how the Linux From Scratch project works.
We add the following software to augment the Linux From Scratch toolchain:
- Statically built BusyBox - used for the unzip implementation
- Statically built Wget with OpenSSL support - used by the build program to download sources
- A copy of curl’s cacert.pem certificates - used by wget when connecting to SSL-enabled websites
Finally, we place a recent last-known-good copy of the
hab binary inside
The entire tarball of bootstrap "tools" lives inside the stage1 studio tarball. This should be unpacked into
/tools on a Linux host that will serve as the build environment until the system is self-sustaining through the rest of this procedure.
Part II: Stage 0
Freshening The stage1 tarball
From time to time and especially with breaking changes to
hab’s core behavior it is a good idea to update the software in the
habitat-studio-stage1 tarball, even if that means skipping the work of rebuilding the toolchain.
Part III: Stage 1
In this stage, we rebuild all the base packages needed by Habitat using the tools (compiler, etc.) from the existing tools tarball. You will need to have a depot locally running on your system, the latest version of the studio, and you'll need a copy of the core-plans on your local disk.
Now in the stage1 Studio:
Part IV: Stage 2
In this stage, we rebuild all the base packages needed by Habitat using the tools (compiler, etc.) from the previous stage, thus making the system self-sustaining.
Part V: Remaining packages in world
In this stage, we rebuild all of the remaining packages using the base packages from the previous phase. We recommend that this stage be executed on a powerful machine, such as an
c4.4xlarge on Amazon Web Services (AWS).
Update build host now: