In the first part of this series we got the Chef Automate Pilot container stack up and running on ECS. Now let’s make it survive termination of any container or EC2 instance without losing data by adding AWS RDS, EFS and Elasticsearch. A story told in 3 git commits:
RDS and EFS
We know that Chef Automate stores almost all of its state in PostgreSQL, but is also a Git server and stores repositories on disk. We need to get all of that data out of the disposable container volumes and into highly available persistence stores.
For starters, I’ve borrowed a chunk of code from another template I
to add the
AWS::RDS::DBInstance resource and friends — that was the easy part.
Trickier is to extract the
postgresql-data containers from
Once you remove those two from the
ContainerDefinitions, you have to take care
of two things:
- All of the other containers previously considered
postgresqlto be the Initial Peer, so I’ll just re-point that to the
rabbitmqcontainer because we’ll be keeping that one, and it is now the most depended-on container.
- Many of the containers also had binds to postgresql (
--bind database:postgresql.defaultpassed to the Habitat Supervisor) which have to be replaced with environment variables like so:
This is where the magic of Cloudformation combines with the magic of Habitat — I can easily pass information about the RDS instance (hostname, username, password) in to Habitat’s runtime configuration and it does the right thing with that. As you can see, CloudFormation’s new-ish YAML format makes variable interpolation delightful in multi-line strings, especially compared to the JSON format.
Now you may wonder: how are you supposed to know what configuration to pass in to a particular Habitat package? The awesome Habitat depot site has the answer (scroll down to the Configuration section to see all the variables Habitat’s TOML config format). The Habitat docs describe the methodology for passing in runtime configuration via environment variables although I was never able to get the JSON format to work reliably (Habitat auto-detects the format of the variable).
Removing the Habitat
bindings and switching to
environment variables worked fine for all of the services in the stack except
notifications service. I didn’t realize when making this first commit
notifications didn’t need to talk to Postgres at all, so I passed in the
environment variables and then had to “fake it out” into skipping the
startup-time bind wait by passing in a phony bind:
database:rabbitmq.default (friends, don’t try this at home, it’s a bad idea and
Later on I realized that
notifications configuration file was missing an
important bit of code to make the binding conditional and used
pkg_binds_optional in the plan. That led to this PR back to
While waiting for that to get accepted I build some of my own docker
containers with those changes, which
Habitat makes easy with the
hab pkg export docker command. That way I could
quickly test those changes in my stack and get feedback quickly.
AWS EFS provides a way for persisting data across container restarts, and even providing concurrent access to files across containers (if you can do that safely) and that’s good old-fashioned NFS. Except it isn’t because EFS provides multi-AZ availability and NFS 4.1 isn’t exactly old fashioned — providing parallel access (pNFS), file locking and significant performance improvements. You still shouldn’t use it host your database files but for low-intensity IO it is totally fine and buys us a ton of flexibility.
As various AWS articles demonstrate, there are a few slightly awkward things about integrating EFS in to your ECS cluster:
- You have to add the commands to mount the EFS filesystem on each of your ECS instances, as I do in the UserData section of the AutoScaling LaunchConfiguration (AWS if you are listening, let’s make that automagical!)
- You have to create an
AWS::EFS::MountTargetfor each subnet you operate in. That’s hard for shareable templates, where you may be operating in 2,3 or even SIX Subnets in your region! (psst, AWS, please just let users pass in an array of subnets like you do in most others services ;)
Once that’s done, you can now tell ECS exactly where to put those volumes on the host (hint: on to the EFS mount) like so:
It took some digging around to realize exactly what containers data I should be
mounting on EFS (I still don’t really know what that
maintenance volume does),
but in later commits I start putting the Habitat
data volume for key
containers there. So let’s move on to that!
AWS ES, A Signing Proxy, and HTTPS
Okay so sometimes I get a bit punchy with my commit messages. Just like that Death Star, we’re hilariously not fully operational yet :D
What we’re doing here is replacing the
elasticsearch container with AWS’s ES
(Elasticsearch) service, which is where all that sweet visualization and
reporting data is going to go. Now it would be super cool if ES had a simple
access control scheme like RDS (VPC SecurityGroups) but no, they just had to be
AWS ES controls access by IAM roles, and isn’t integrated with VPC at all (your traffic goes to a public IP). Each request to ES must be signed the same way that AWS API requests are signed. Fortunately my team and I had already appropriated a useful bit of code for that: the aws-signing-proxy (credit to Chris Lunsford for the original code).
All I needed to do was habitize that, which was super easy when using another go-based application as an example:
One thing I realized was that you need to export
http-port instead of
(in the plan.sh) just like the Habitat elasticsearch
— that way binds that previously depended on
elasticsearch could now depend on
aws-signing-proxy service as a drop-in replacement.
EFS-Mounted Volumes, Revisited
Watching the container logs in Cloudwatch Logs, I noticed that other services
were taking a while to start up because they were re-creating various files at
init time. For example, the
automate-nginx container had a bit of code like
Thanks to some very forward-thinking developers, we can skip the expensive
opensslkey generation step on subsequent starts by mounting the Habitat data
folder on EFS!
Habitat is already smart enough to instruct docker to mount the
/hab/svc/servicename/data directory as a separate volume, as I found digging
docker inspect. So instructing docker to mount that on EFS made
sense for most of the containers.
One debatable service was
rabbitmq. I chose not to mount this on EFS because I
was concerned about the performance impact of doing so, particularly in high
scale scenarios (which is our ultimate goal, after all!) Also, in our experience
RabbitMQ tends to be greatly impacted by slow disk when handling large durable
queues, so let’s give it as fast disk as we can.
Now we have a Chef Automate container stack that can survive a wide variety of faults up to and including instance termination — and can recover with all of its data in just a couple of minutes.
In the next post, I’ll start working on multi-host operations for even better availability as well as options to scale-out.