Amazon’s ECS (EC2 Container Service) provides a compelling offering to developers and operators who are already very comfortable with AWS and its tooling. In this series of posts I explore deployment strategies for Containerized applications built using Habitat on ECS.
The mission: Taking an application stack from “works on my laptop” to “works in production”.
Step 1, Picking an app
In my day job I work with a fairly complex microservices-based application, Chef Automate. Specifically the containerized version you see published as the Automate Pilot — and meant for non-production demo and trial purposes only. But once I took that demo apart and understood what made it tick, I realized that it could be scaled up and out into a robust Production-ready app.
A stretch goal for me personally is to continue the underlying technical work we did for this whitepaper: Scaling Chef Automate Beyond 50,000 nodes but to see if we can reimagine the scaling solutions in a containerized version of the app — while adding high availability and possibly even saving on cost.
Step 2, Picking a Container Scheduler
ECS is a container scheduling system built on top of Docker and AWS EC2. As somebody who uses AWS daily, I found it pretty easy to understand how it worked and to integrate it into my workflow.
- You define what you want to do with Task Definitions (which define which containers you want to run) combined with ECS Services.
- These can be submitted directly via the AWS API, or more commonly via CloudFormation.
- You can even take a regular docker-compose.yml and import it straight into ECS! That lowers barrier slightly when transitioning from local workstation development to ECS.
The reason that I chose ECS for this project is that it provides an elegant solution to running data persistence services (Postgres, Elasticsearch, etc) in the AWS RDS and Elasticsearch DBaaS, rather than trying to run those services in containers. The magic that ties it all together is Cloudformation, which allows you to provision persistent data services (RDS, etc) and your ECS cluster in a single command, from a single infrastructure specification.
Why Habitat on ECS?
Unlike some competing solutions, ECS doesn’t provide you niceties like overlay networking and service discovery out of the box — which are really important if you want to run your service on more than one docker host! For service discovery there are many proposed solutions to solve for this that involve setting up Lambdas, Consul, Netflix Eureka, or installing agents to report to DNS. There are also 3rd offerings that provide both overlay networking and service discovery from Weave and Linkerd — of which I found the Weave offering compelling, and could honestly eliminate the need for Habitat if it had config management capabilities.
The compelling thing about Habitat is that it’s the automation that travels with the app so I didn’t have to build external services and add dependencies on those systems — which should simplify my deployment and reduce operational complexity. It also helped that much of Chef’s internal development is shifting from Omnibus-style apps to Habitized apps, so a lot of the hard work had already been done on Automate by awesome people like Elliott (hi!).
What’s hard about Habitat on ECS?
ECS’s lack of service discovery makes it harder for Habitat supervisors to initially find each other (the service discovery system needs service discovery to bootstrap itself!) and the lack of overlay networking makes it harder for supervisors to all have 2-way communication with each other.
That’s why the Habitat docs section on
simply tells you to use docker
links (a deprecated feature) like so:
What this does is leverage Docker (and ECS’s) Bridge mode networking, which is
the default networking mode and required in order for linking to work. In this
case the initial peer is the
mongodbcontainer which the
container can find because linking takes care of service discovery (the hostname
mongodb is resolvable from any other linked containers) as well as 2-way
communications (they are on the same bridged network, thus having unencumbered
communications with each other).
ECS has another networking mode (called Host mode) where each container doesn’t have its own IP address, and instead identifies as the IP address of the host. Any port that the container tries to listen on becomes a listening port on the host. More on that in a future post.
A “simple” deployment
Let’s port Automate Pilot’s docker-compose.yml file to ECS. I started with a fairly stock ECS cloudformation snippet and replaced the example service with the container definitions from the docker-compose config. Here’s what that looks like:
What you see below is a complete and working Cloudformation template that is made up of 3 important parts:
- Parameter definitions — asking you important questions like what VPC to deploy into and server sizing (lines 2–35)
- The ECS Cluster — really the underlying EC2 instances that live in an AutoScaling group, plus associated networking and security rules (lines 359–493)
- The ECS Service and Task definition — The part that once was our plucky
docker-compose.ymlfile, but now presented in a fashion that ECS can schedule, run and even monitor (lines 52–284)
Let’s dive deeper into that ECS TaskDefition, because that deserves some
unpacking. There are 9 container instances, providing a range from data
persistence services (
elasticsearch), data processing services
rabbitmq) and API services (
automate-nginx). There’s also a trick-container called
postgresql-data which lays down the initial Postgres data volume and then
exits — we have to set that to
Essential: false so that ECS is okay with it
Each container definition has some important components:
- The image name (which is pulled from dockerhub)
- Tuning options (
- Links (as mentioned above for service discovery and networking between containers)
- Command Line arguments that are passed to Habitat: A peer name (to form a gossip ring) and bindings (dependencies on a service group)
- Environment variables passed to Habitat — TOML formatted parameters which inject configuration parameters to be used at runtime by this container (and also consumable by dependent services on other containers)
- Logging configuration, which is handy because it makes all container output available via CloudWatch Logs
Another important aspect of this configuration is the use of Volumes — right now
in a simple way to share the
maintenance volumes. More
on this in a future post, there’s a ton of interesting things we can do with it
Know your Habiterms: The Initial Peer
In this stack the initial peer is defined as the
postgresql container — all of
the other containers must be linked to it in order to join the gossip ring and
enable and perform service discovery.
Know your Habiterms: Service bindings
Service bindings are a super important part of gluing services together in
Habitat. It says “Wait until the
postgresql.default service group is available
before starting” — which makes a number of order-of-operations and other
orchestration problems magically disappear. It also allows you to explicitly
depend on other services, and know what ports to bind to or export to.
Neat things about the underlying ECS cluster
One thing I really like about ECS is how it gives you full control over the
underlying ECS instances. They’re essentially EC2 instances with a simple
expectation: they run the ECS service and the cluster name is put in the
/etc/ecs/ecs.config file. That’s it! It means you could replace the stock
ECS-Optimized Amazon Linux images with CoreOS or Weave and it still works just
as well (but with cool features).
In an upcoming post I’ll show exactly what I do with that, but for now I’ll show
one very real problem I could solve easily: Elasticsearch 5 will refuse to start
unless you set a certain sysctl value like so:
That’s not a setting that can be set inside the container, it has to be set on
the host — and guaranteed consistent on every docker host in your cluster.
Cloudformation makes that easy, because I can set it in the
time commands) for each instance in the cluster.
In the next post, I’ll start to make this more robust by leveraging AWS RDS, ES and EFS services so that our application can survive termination of the containers or even the ECS instance!