Getting started with Envoy Proxy for microservices resilience
Envoy Proxy Overview
Envoy Architecture
Envoy and the Network Stack
The Envoy Mesh
Envoy Configuration Overview
Up Next
Using microservices to solve real-world problems always involves more than simply writing the code. You need to test your services. You need to figure out how to do continuous deployment. You need to work out clean, elegant, resilient ways for them to talk to each other.
A really interesting tool that can help with the “talk to each other” bit is the Envoy Proxy from Lyft.
Envoy Proxy Overview
Envoy Proxy is a modern, high performance, small footprint edge and service proxy. Envoy adds resilience and observability to your services, and it does so in a way that’s transparent to your service implementation. It might feel odd to see us call out something that identifies itself as a proxy – after all, there are a ton of proxies out there, and the 800-pound gorillas are NGINX and HAProxy, right? Here’s some of what’s interesting about Envoy:
- It can proxy any TCP protocol.
- It can do SSL. Either direction.
- It makes HTTP/2 a first class citizen, and can translate between HTTP/2 and HTTP/1.1 (either direction).
- It has good flexibility around discovery and load balancing.
- It’s meant to increase visibility into your system.
- In particular, Envoy can generate a lot of traffic statistics and such that can otherwise be hard to get.
- In some cases (like MongoDB and Amazon RDS) Envoy actually knows how to look into the wire protocol and do transparent monitoring.
- It’s less of a nightmare to set up than some others.
- It’s a sidecar process, so it’s completely agnostic to your services’ implementation language(s).
Envoy is also extensible in some fairly sophisticated — and complex — ways, but we’ll dig into that later — possibly much later. For now we’re going to keep it simple. (If you’re interested in more details about Envoy, Matt Klein gave a great talk at the 2017 Microservices Practitioner Summit.)
Being able to proxy any TCP protocol, including using SSL, is a pretty big deal. Want to proxy Websockets? Postgres? Raw TCP? Go for it. Also note that Envoy can both accept and originate SSL connections, which can be handy at times: you can let Envoy do client certificate validation, but still have an SSL connection to your service from Envoy.
Of course, HAProxy can do arbitrary TCP and SSL too — but all it can do with HTTP/2 is forward the whole stream to a single backend server that supports it. NGINX can ’t do arbitrary protocols (although to be fair, Envoy can’t do e.g. FastCGI, because Envoy isn’t a web server). Neither open-source NGINX nor HAProxy handle service discovery very well (though NGINX Plus has some options here). And neither has quite the same stats support that a properly-configured Envoy does.
Overall, what we’re finding is that Envoy is looking promising for being able to support many of our needs with just a single piece of software, rather than needing to mix and match things. One final note: Envoy Proxy is an official, graduated CNCF project, with a huge community. So unlike HAProxy and NGINX, which are controlled by a vendor, Envoy has vendor-neutral governance which is an important consideration for many projects.
Envoy Architecture
While I said that Envoy is less of a nightmare to set up than some other things I worked with, you’ll note that I didn’t say it was necessarily easy. Envoy’s learning curve is a bit steep at first, and it’s instructive to look at why.
Envoy and the Network Stack
Let’s say you want to write an HTTP network proxy. There are two obvious ways to approach this: work at the level of HTTP, or work at the level of TCP.
At the HTTP level, you’d read an entire HTTP request off the wire, parse it, look at the headers and the URL, and decide what to do. Then you’d read the entire response from the back end, and send it to the client. This is an OSI Layer 7 (Application) proxy: the proxy has full knowledge of what exactly the user is trying to accomplish, and it gets to use that knowledge to do very clever things.
The downside is that it’s complex and slow – think of the latency it’s introducing reading and parsing the entire request before making any decisions! Worse, sometimes the highest-level protocol simply doesn’t have the information that you need for your decisions. A good example of this is SSL: before the SRI extension was added, the SSL client would never state which host it was trying to connect to — so although HTTP servers handled virtual hosts just fine (with the HTTP/1.1
Host
So maybe a better choice is operating down at the TCP level: just read and write bytes, and use IP addresses, TCP port numbers, etc., to make your decisions about how to handle things. This is an OSI Layer 3 (Network) or Layer 4 (Transport) proxy, depending on who you talk to. We’ll borrow from Envoy’s terminology and call it a Layer 3/4 proxy.
Things can be very fast in this model, and certain things become very elegant and simple (see our SSL example above). On the other hand, suppose you want to proxy different URLs to different back ends? That’s not possible with the typical L3/4 proxy: higher-level application information isn’t accessible down at these layers.
Envoy deals with the fact that both of these approaches have real limitations by operating at layers 3, 4, and 7 simultaneously. This is extremely powerful, and can be very performant… but you generally pay for it with configuration complexity.
The challenge is to keep simple things simple while allowing complex things to be possible, and Envoy does a tolerably good job of that for things like HTTP proxying.
The Envoy Mesh
The next bit that’s a little surprising about Envoy is that most applications involve two layers of Envoys, not one:
- First, there’s an “edge Envoy” running all by itself somewhere. The job of the edge Envoy is to give the rest of the world a single point of ingress. Incoming connections from outside come here, and the edge Envoy decides where they go internally.
- Second, each instance of a service has its own Envoy running alongside it, a separate process next to the service itself. These “service Envoys” keep an eye on their services, and remember what’s running and what’s not.
- All of the Envoys form a mesh, and share routing information amongst themselves.
- If desired (as it typically will be), interservice calls can go through the Envoy mesh as well. We’ll get into this later.
Note that you could, of course, only use the edge Envoy, and dispense with the service Envoys. However, with the full mesh, the service Envoys can do health monitoring and such, and let the mesh know if it’s pointless to try to contact a down service. Also, Envoy’s statistics gathering works best with the full mesh (more on that in a separate article, though).
All the Envoys in the mesh run the same code, but they are of course configured differently… which brings us to the Envoy configuration file.
Envoy Configuration Overview
Envoy’s configuration starts out looking simple: it consists primarily of listeners and clusters.
A listener tells Envoy a TCP port on which it should listen, and a set of filters with which Envoy should process what it hears. A cluster tells Envoy about one or more backend hosts to which Envoy can proxy incoming requests. So far so good. There are two big ways that things get much less simple, though:
- Filters can – and usually must – have their own configuration, which is often more complex than the listener’s configuration!
- Clusters get tangled up with load balancing, and with external things like DNS.
Since we’ve been talking about HTTP proxying, let’s continue with a look at the
http_connection_manager
http_connection_manager
The filter configuration for
http_connection_manager
virtual_hosts
- : a human-readable name for the service
name
- : an array of DNS-style domain names, one of which must match the domain name in the URL for this
domains
to matchvirtual_host
- : an array of route dictionaries.
routes
Each route dictionary needs to include, at minimum:
- : URL path prefix for this route
prefix
- : Envoy cluster to handle this request
cluster
- : timeout for giving up if something goes wrong
timeout_ms
All of this means that the simplest case of HTTP proxying — listening on a specified port for HTTP, then routing to different hosts depending on the URL — is actually pretty simple to configure in Envoy.
An example: to proxy URLs starting with
/service1
service1
/service2
service2
“virtual_hosts”: [{“name”: “service”,“domains”: [“*”],“routes”: [{“timeout_ms”: 0,“prefix”: “/service1”,“cluster”: “service1”},{“timeout_ms”: 0,“prefix”: “/service2”,“cluster”: “service2”}]}]
That’s it. Note that we use
domains [“*”]
Of course, we would still need to define the
service1
service2
virtual_hosts
cluster_manager
clusters
- : a human-readable name for the cluster
name
- : how will this cluster know which hosts are up?
type
- : how will this cluster handle load balancing?
lb_type
- : an array of URLs defining the hosts in the cluster (usually these are
hosts
URLs, in fact).tcp://
The possible values for
type
- : every possible host is listed in the cluster
static
- : Envoy will monitor DNS, and every matching A record will be assumed valid
strict_dns
- : Envoy will basically use the DNS to add hosts, but will not discard them if they’re no longer returned by DNS (think round-robin DNS with hundreds of hosts — we’ll get more into this in subsequent articles)
logical_dns
- : Envoy will query an external REST service to find cluster members
sds
And the possible values for
lb_type
- : cycle over all healthy hosts in order
round_robin
- t: select two random healthy hosts and pick the one with the fewest requests (this is O(1), where scanning all healthy hosts would be O(n). Lyft claims that research indicates that the O(1) algorithm “is nearly as good” as the full scan.)
weighted_least_reques
- : pick a random host
random
One interesting note about load balancing: a cluster can also define a panic threshold where, if the number of healthy hosts in the cluster falls below the panic threshold, the cluster will decide that the health-check algorithm is broken, and assume all the hosts in the cluster are healthy. This could lead to surprises, so it’s good to be aware of it!
A simple case for an edge Envoy might be something like
“clusters”: [{“name”: “service1”,“type”: “strict_dns”,“lb_type”: “round_robin”,“hosts”: [{“url”: “tcp://service1:80”}]},{“name”: “service2”,“type”: “strict_dns”,“lb_type”: “round_robin”,“hosts”: [{“url”: “tcp://service2:80”}]}]
Since we’ve marked this cluster with type
strict_dns
service1
service2
docker-compose
service1
“clusters”: [{“name”: “service1”,“type”: “static”,“lb_type”: “round_robin”,“hosts”: [{“url”: “tcp://127.0.0.1:5000”}]}]
Same idea, just a different target: rather than redirecting to some other host, we always go to our service on the local host.
Up Next
So that’s the ten-thousand-foot view of Envoy, plus a bit of a dive down into Envoy’s background and configuration. Next up, we’ll tackle actually deploying a simple application using Kubernetes, Postgres, Flask, and Envoy, and watch how things go as we scale it up and down. Stay tuned.