Consul is a powerful tool for building distributed systems. There are a handful of alternatives in this space, but Consul is the only one that really tries to provide a comprehensive solution for service discovery. As my last post points out, service discovery is a little more than what Consul can provide us, but it is probably the biggest piece of the puzzle.
Understanding Consul and the “Config Store”
The heart of Consul is a particular class of distributed datastore with properties that make it ideal for cluster configuration and coordination. It reflects their key-value abstraction and common use for shared configuration so we could call it more of “config-store”. A common property of key-value stores is they all have mechanisms to watch for key-value changes in real-time. This feature is central in enabling use-cases such as electing masters, resource locking, and service presence.
What’s in Consul
Consul is the taking this forward and built on this by also providing specific APIs around the semantics of common config store functions, namely service discovery and locking. It also does it in a way that’s very thoughtful about those particular domains.
For example, a directory of services without service health is actually not a very useful one. This is why Consul also provides monitoring capabilities. Consul monitoring is comparable, and even compatible, with Nagios health checks. What’s more, Consul’s agent model makes it more scalable than centralized monitoring systems like Nagios.
A good way to think of Consul is broken into 3 layers. The middle layer is the actual config store, which is not that different from etcd or Zookeeper. The layers above and below are pretty unique to Consul.
Consul uses an efficient gossip protocol to connect a set of hosts into a cluster. The cluster is aware of its members and shares an event bus. This is primarily used to know when hosts come and go from the cluster. It uses gossip protocol to do this. Read more on this protocol here (http://highscalability.com/blog/2011/11/14/using-gossip-protocols-for-failure-detection-monitoring-mess.html)
The key-value store in Consul is very similar to etcd. It shares the same semantics and basic HTTP API, but differs in subtle ways. For example, the API for reading values lets you optionally pick a consistency mode. This is great not just because it gives users a choice, but it documents the realities of different consistency levels. This transparency educates the user about the nuances of Consul’s replication model.
On top of the key-value store are some other great features and APIs, including locks and leader election, which are pretty standard for what people originally called lock servers. Consul is also datacenter aware, so if you’re running multiple clusters, it will let you federate clusters. Nothing complicated, but it’s great to have built-in since spanning multiple datacenters is very common today.
However, the killer feature of Consul is its service catalog. Instead of using the key-value store to arbitrarily model your service directory as you would with etcd or Zookeeper, Consul exposes a specific API for managing services. Explicitly modelling services allows it to provide more value in two main ways: monitoring and DNS.
Built-in Monitoring System
Monitoring is normally discussed independent of service discovery, but it turns out to be highly related. Over the years, we’ve gotten better at understanding the importance of monitoring service health in relation to service discovery.
With Zookeeper, a common pattern for service presence, or liveness, was to have the service register an “ephemeral node” value announcing its address. As an ephemeral node, the value would exist as long as the service’s TCP session with Zookeeper remained active. This seemed like a rather elegant solution to service presence. If the service died, the connection would be lost and the service listing would be dropped.The problem with relying on a TCP connection for service health is that it doesn’t exactly mean the service is healthy. For example, if the TCP connection was going through a transparent proxy that accidentally kept the connection alive, the service could die and the ephemeral node may continue to exist.
Every service performs a different function and without testing that specific functionality, we don’t actually know that it’s working properly. Generic heartbeats can let us know if the process is running, but not that it’s behaving correctly enough to safely accept connections.
Specialized health checks are exactly what monitoring systems give us, and Consul gives us a distributed monitoring system. Then it lets us choose if we want to want to associate a check with a service, while also supporting the simpler TTL heartbeat model as an alternative. Either way, if a service is detected as not healthy, it’s hidden from queries for active services.
Built-in DNS Server
Service discovery tools manage how processes and services in a cluster can find and talk to one another. It involves a directory of services, registering services in that directory, and then being able to lookup and connect to services in that directory.At its core, service discovery is about knowing when any process in the cluster is listening on a TCP or UDP port, and being able to look up and connect to that port by name. DNS is not a sufficient technology for service discovery. DNS for resolving names to IPs, not IPs with ports. So other than identifying the IPs of hosts in the cluster, the DNS interface at first glance seems to provide limited value, if any, for our concept of service discovery.
However, it does serve SRV records for services, and this is huge. Built-in DNS resolvers in our environments don’t lookup SRV records, however, the library support to do SRV lookups ourselves is about as ubiquitous as HTTP. This took me a while to realize. It means we all have a client, even more lightweight than HTTP, and it’s made specifically for looking up a service.
So what is the best way, lets look at SRV records, it is a specification of data in the Domain Name System defining the location, i.e. the hostname and port number, of servers for specified services. It is defined in RFC 2782, and its type code is 33. Some Internet protocols such as the Session Initiation Protocol (SIP) and the Extensible Messaging and Presence Protocol (XMPP) often require SRV support by network elements. SRV the best standard API for simple service discovery lookups. I hope more service discovery systems implement it.
We can build on SRV records from Consul DNS to generically solve service inter-connections in Docker clusters. It cannot be realized any of this if Consul didn’t provide a built-in DNS server.
We need to design containers to be self-contained, runtime-configurable appliances as much as possible. Later in another post I would try to show how we can use consul and other way to load-balance a docker cluster.