Matt Klein on Testing Microservices and Building Cloud Platforms
About
Episode Guests
Be sure to check out the additional episodes of the "Livin' on the Edge" podcast.
Key takeaways from the podcast included:
- Creating a fast development loop is essential for engineering productivity. Sometimes engineers have to “bend the rules” in order to make this work for them.
- It is very challenging to make a development environment like production, particularly when working with a large service-based system or dealing with large amounts of data.
- Developers should take time to understand the “test pyramid” model. Creating unit tests that can be run locally without external dependencies can provide fast feedback.
- The judicious use of mocks and virtual services can provide effective component and integration testing. A service-based system should make strong use of APIs and contracts (e.g. interface definition languages, such as Protobuf), which can be used to automatically generate skeletons for mock components.
- The continuous delivery pipeline should include additional tests, and also be capable of supporting incremental release and testing in production e.g. canary releasing and dark launching.
- Minimizing the number of services that are directly dependent on a data store enables more effective testing of a system in general. Stateful services can be tested in isolation, using additional testing techniques. Stateless services interacting with other services can be tested via contracts and interactions.
- Technical leaders should ensure that any desire to maintain a staging environment (or series of pre-production environments) provides position return on investment.
- Systems that synchronize local and remote development environments, such as Telepresence, Garden, and Tilt, are interesting, but engineers should again take care to ensure their usage provides net positive ROI across their team.
- The Lyft team continually runs a rider/driver simulation on their staging environment. This typically identifies large or problematic bugs before they are released to production.
- The Lyft rider/driver simulation can also be used for load testing or performance testing within a production environment. The system has been designed to recognize this test traffic, and not trigger certain analytics or data mutation that would skew real customer ride data.
- When a team is considering building a cloud platform or improving their development workflows, they should always start by identifying their most impactful/blocking problems. When applying solutions, teams should aim to “keep it simple”.
- The future of development and platforms may be skewed towards function-based approaches. Frameworks like Kubernetes and Envoy will be around for a long time, and they will increasingly be pushed down into a platform that supports a variety of architectures.
- In the near-term future, organizations are dealing with a lot of core/vintage systems (that make money), and so the immediate goals of many teams are to bridge the gap between the core and cloud native technologies.
Transcript
Daniel (00:03):
Hello everyone. I'm Daniel Bryant and I'd like to welcome you to the Ambassador Living on the Edge podcast, the show that focuses on all things related to cloud native platforms, creating effective developer workflows and building modern APIs. Today, I'm joined by Matt Klein, creator of the Envoy proxy. I've followed Matt's work for many years now and not only have I learned a bunch from him about modern networking, service mesh, and API gateways, but I've also learned a lot about creating effective dev to prod workflows. Matt's got some great ideas around implementing continuous delivery pipelines, how we release functionality from them, and in this podcast, I was keen to dive into these topics in more detail.
Daniel (00:35):
Matt has a fantastic community presence. He's often seeing the Envoy communities, the CNCF communities, and he's not shy of voicing an opinion or two as well. His arguments are always well-reasoned, so a Matt talks about interesting technology or a new approach I always add it to my research backlog. If you like what you hear today, I would definitely encourage you to pop over to our website, that's getambassador.io, where we have a range of articles, white papers and videos that provide more information for engineers working in the Kubernetes and cloud space. You can also find links there to our latest releases, such as the Ambassador edge stack, our open source Ambassador API gateway, and also our CNCF hosted telepresence Kubernetes tool too. Hey Matt, welcome to the show and thanks for joining me today. Could you introduce yourself to the listeners, please?
Matt (01:14):
Thank you for having me. My name is Matt. I'm a software engineer at Lyft. At Lyft I focus mostly on system reliability, where I spend about 50% of my time. I spend the other 50% of my time doing industry work, where I lead the open source project called Envoy. I think the highlight continues to be seeing the just fantastic growth of Envoy and just the entire community around it is absolutely awesome.
Daniel (01:42):
So today I wanted to chat, pick your brains Matt, around developer experience and developer loops, the ability to rapidly code, test, deploy, release, and verify. Now, without naming any names, protecting the innocent, could you describe your worst developer experience or your worst dev loop from idea to code to production.
Matt (02:00):
Oh wow. I'm going to answer that in a slightly different way, which is that I always personally have focused on circumventing any bad developer experience. And by that, I mean there have been many times in my career where I have gone out of my way to, let's say, not use an organization's official development tooling. I've gotten a service to be able to build on my local machine so that I can edit on it directly without using the sanctioned Docker containers, et cetera, et cetera, et cetera. So I am very notorious for making sure that I am able to efficiently get work done. And I think everyone has a different definition of what that means, and that's obviously the focus of this podcast, but I take it very seriously that I can make fast, fast progress. So I don't know that I can talk about the worst case scenario mainly because I tend to mould whatever is the sanctioned approach into something that actually works for me.
Daniel (03:09):
That totally makes sense, Matt. What about a best developer loop, then? Can you share your best dev ex?
Matt (03:14):
Yeah. I think one thing that has become really clear, at least from a recent industry perspective, is that we have... Obviously lots of companies have moved towards a microservice world, a serverless world. There's obviously increasing uptake of things like Kubernetes. And what I've seen for most of an industry perspective is that a lot of organizations, they try very hard to have the development experience model what people actually run in production. What that means is potentially giving every developer their own Kubernetes cluster, and then when you layer things on top, having a service mesh, actually having Envoy on there and the entire Envoy config and all the different services and et cetera, et cetera, et cetera. What I can tell you is that I have never personally seen an organization do that well, and by do it well, I mean, do it in a way that is not painful for the people that are developing software on that system, just because it is incredibly difficult to make a development environment both look like production and actually stay up to date like production.
Matt (04:35):
So when you talk about my personal best development flow, I tend to take a similar approach for most projects that I work on, and that approach is making sure that I can run a fairly sophisticated set of tests on my local machine without any external dependencies. I think you'll hear a lot of people talk about... They call it the test pyramid, where you're supposed to have most of your tests are unit tests, and then they call them integration tests and tests... People have different words for these things. But what I have found personally, and I've personally had fairly great success with this, is investing in a good set of tools so that on my local machine I can get very high quality test coverage is quite important. And that means, for example, for services, it actually means investing in the capability of not just doing unit tests with mocks, because although that is useful, that alone is fairly notorious for not having good end to end coverage.
Matt (05:45):
What I have found very useful is actually building what I would call local machine integration tests. So for example, having fake clients that run through my service and then actually fake servers, which potentially mock actual responses or actual flows, and what I found through not only working on Envoy, but on similar systems and different types of internet services over the last 10 years, is that by having a good mix of these things, having a good mix of unit tests with mocking to hit some of the more difficult edge cases, but investing in a fairly sophisticated local integration test framework, I can get very, very good coverage of the system without leaving my local machine, and then a whole separate conversation is I trust the testing and production process through feature flags and all of the normal ways that people do things, like actually touch their code in production.
Matt (06:48):
So that's my personal best flow is basically being able to do 95 to 97% of my development on my local machine. And then for the things that I can't do, to be able to actually run them in a staging environment or even in a partially deployed production environment and just see how that actually goes. One small anecdote is that in the last five years now, actually, that I've been working on Envoy at Lyft, I think I can count on my one hand the number of times that I've had to actually do development with other physical systems at Lyft. It just hasn't happened. We've been able to model almost all of behavior through a sophisticated set, again, of fake clients and fake servers, and that's not just a service client, that could be a backend, a rate limit service or off service or something along those lines. I think folks can get a lot farther than they probably would think by focusing on a fast local development flow.
Daniel (08:03):
Yeah, that totally makes sense, Matt. You mentioned about effectively creating out the dependencies mocking stubbing, is that scalable though? I could almost see a thing where every developer is creating their own virtual service?
Matt (08:14):
Correct. Right. So where I have seen the more successful organizations tackle this, is I think as organizations move towards structured services, so basically using things IDL, whether that be a protocol for thrift or something that, if the API itself is structured, it allows a central tool to basically generate mock services. And that is a very powerful way that people can tackle this problem. So that, for example, if people say, write services in Go or Python or Java or something along those lines. As long as you have a common understanding of what the RPC layer speaks, you can relatively, and I'm using air quotes here, you can relatively easily generate service stubs and service clients that you can set expectations on. And that can be a very powerful way of writing these end to end tasks, because the reality, again, is that if you're writing a service, even in a large microservice architecture, typically you're talking to a subset of the overall service graph.
Matt (09:28):
So it's like you have your typical communication partners. And again, it's not that any of these things are perfect. There's always going to be bugs. But at the end of the day, if you have a service architecture, you have an API for a reason, and that API is so that you can be decoupled for the development of your service partner. So I've always found it somewhat ironic that people are theoretically moving to a microservice or a service oriented architecture, yet they feel that before they can ship code, they have to spin up 75 services and run this thing. And that's when I think the phrase now is distributed monolith. I think that that's a really good indication that you have a distributed monolith when you feel that you need to spin up 80 services and do some type of integration tasks to feel confident that you haven't actually broken something.
Matt (10:26):
If you've developed APIs and you've stuck with them for the vast majority of development, you should be able to adhere to these contracts. And again, that's not to say that there aren't going to be bugs. And that's why you also have to focus on the production rollout component of staged rollout, canary deployments and things like that. I mean, there's a whole other topic here, but I'm a big believer that if you focus on a quote fast inner loop development process, which is mostly your local machine, coupled with a robust production deployment system, I personally think that is the fastest way of developing. I think organizations, and again, I'm not going to name names here, but I think that organizations that invest a huge amount of resources in trying to do that replication of your development environment to look like production, I have personally, and I know obviously there's lots out there, so I probably haven't seen it, I'm sure some people are doing this, but I have not personally seen that done well in a way that I think is a net positive ROI.
Daniel (11:42):
Yeah, I think my experience broad aligns with that too, Matt. One of the things I've seen folks struggle with is data management. And if you got any thoughts on that, so as in testing and production like data, how you capture that, how do you recreate it, how do you test with that?
Matt (11:55):
Super, super hard problem, and I'm going to be perfectly honest with you, which is that is not my forte. There are lots of people that have a lot more experience with doing data migrations and general data management. So I don't feel super comfortable speaking to this point, but I think it's always a topic of conversation, which is how do you make test data look production data, et cetera, et cetera, et cetera. And again, it's just one of those areas where I don't know that I've ever seen it done well, just because it's an incredibly hard problem. You have in production data volumes that are always going to dwarf what you have in dev. You have historical data, which is very, very hard to model. You have PII concerns, so it's not like you can give every developer production data. It just doesn't make sense.
Matt (12:46):
So this comes back to... You're not looking for a 100% solution in dev, but you're looking for the 90 or the 95% solution. So you're looking for a situation in which you can mock not only your clients and your servers to do what I would call local end to end or local integration testing, but you're going to want to mock and have some capability if your service talks to data stores. And again, not all services do that. So one other general design pattern that I've seen work well is to limit the number of services that actually talk to data stores, for this very reason. You can limit the number of entities in your service graph that actually have to deal with this problem, which makes it simpler for the majority of services.
Matt (13:36):
I think having the capability of some local environment, again, in which you can use something that looks your data store and populate it with test data, you can get pretty far. Is it going to be a 100% solution? No. Is it going to be pretty good? Yes. As you find issues, can you backfill it with problem cases? Sure. But again, this comes back to the need for a very rigorous production deployment process. And there are people that are absolute experts in terms of how to do data migration, how to do data shadowing, how to do data comparisons. And I'm not that person. I think that's a fascinating topic, but I think what I would say is that, again, I've never seen it worthwhile while to try to model everything that happens in production in dev, because it's almost an asymptotic curve. You can get very far for a reasonable effort, but that last 10% is a Herculean effort, and honestly it might not even be possible. So I think knowing when to stop and then knowing when to shift your investment into a safe production deployment mechanisms, in my personal opinion, is a better use of engineering resources.
Daniel (15:00):
Yeah, totally makes sense. I like your mention of ROI. That's something I'm continually working on. Can I get your thoughts, Matt, on things like connecting up local and remote dev environments? Obviously we've got telepresence like CNCF hosted tools, a bunch of other ones, Garden, Scaffold.
Matt (15:12):
I mean, this really comes back to what I was saying before, which is that it's not that these things don't work. I have seen organizations do them and I work at organizations that use these products. I think there are interesting things to consider. There are latency concerns. So you have to start thinking about what you run locally? What do you run remotely? How does the code syncing work, et cetera, et cetera, et cetera. And I'm not saying that it can't work. People are investing in these things that are just very technically complicated. And that's what I was saying before, is that for me personally, it's not clear that investment ends up, again, having that positive ROI, versus allowing people to do the majority of their development on their local machine without leaving that machine. And then in the small set of cases we can figure out a way of running a staging type cluster where we can run services and maybe do some extra testing and then obviously have our production rollout system. So I think a lot of these systems like telepresence are personally very interesting. It's not clear to me how scalable they actually are, but that's me.
Daniel (16:37):
Yeah, that's intriguing, Matt. A valued opinion very much. Maybe dialing it back a little bit. You mentioned about say the rollout process, so canary-ing, dark launching, is a bunch of techniques. Could you share some of your experience around that? What do you recommend perhaps folks that are looking to get started on the journey? Is a canary the easiest thing? Is a dark launch the easiest thing? What's the best way to approach it?
Matt (16:57):
I think both canary and dark launch slash feature flags are probably both required and relatively easy. What I would say is that most companies' staging environments, to be honest, is a useless dumping ground of nothingness. This is one area where I am comfortable talking about Lyft, and for years Lyft's staging environment, basically what it was used for is we would test potentially destructive AWS actions, like making load balancers, or doing things that. There was very little use to it from a product perspective. One thing changed a few years ago at Lyft. I think the single biggest thing that we did at Lyft to increase system reliability is we invested in what we call our simulated rides platform. And basically what that means is that this is a sophisticated type of end to end test, where we basically have fake drivers and fake passengers.
Matt (18:09):
And in our staging environment, they take fake rides, and we use the system to both do load testing against prod, but also to run continuous fake traffic in our staging environment. Now is the simulator ride system perfect? Does it cover every production flow? No, of course not. Does it cover a lot of the most important flows? Yes. And it has been really incredible for catching bugs. And what that system has done, though, is that it has made staging useful because now that we have a portion of the system running full time in staging, would you actually catch bugs in staging. If something seriously broke in staging, it will get caught during that staging deploy.
Matt (18:57):
But coming back to your question, I think for most companies that have not invested in, in that type of environment for their staging, where they're running actual system traffic, staging doesn't have that much value. I think a lot of companies have staging environments that are mostly dead. From a production standpoint, I think investing really early on per zone, per cell deployments is very important from a blast radius perspective. And then obviously quality observability about the system is required. And then some type of feature flagging system, whether that be using a product like LaunchDarkly or something in house. I think that a feature flagging dark launch system, plus careful stage rollout slash blue green deploys, et cetera, I think can get people quite far.
Daniel (19:53):
Something that you mentioned there which I was keen to pick up on. You mentioned about the simulated ride system, load testing in production? Did I understand that?
Matt (19:58):
Yeah, so we use the same system to both run continuously in staging, but when we are running production load tests, we can also point that system against prod, just to verify that obviously our production system is up to a certain standard.
Daniel (20:15):
Very cool. And I presume that's some kind of dial you can turn to ramp up the traffic and you siphon the traffic off to a no-op in terms of the backend.
Matt (20:23):
Yeah. So without getting into the deep technical details, we have a way of identifying things that are simulated. So again, it's not perfect by any means, but it has been a very, very successful program.
Daniel (20:38):
Yeah. Sounds intriguing. Very intriguing. If folks are looking to get started and some of the ideas that you mentioned, feature flagging and so forth... Obviously with Envoy, you can use it service mesh, you can use at the edge. Where do you think's the best way to get started? It should be the edge and work inwards, or should folks pick a technology like SDL or something, just get started with that as a service mesh? What do you think the best approach is?
Matt (20:59):
I think my general advice for people is to start with the problem. I think these days we have a lot of shiny technology, and frankly we have a lot of smaller organizations that are potentially taking on technology that's complicated and not necessarily fully robust yet and again comes back to ROI. It's not completely clear that smaller organizations building sophisticated things on top of Kubernetes with service mesh and all of that, it's not super clear to me that that's a great use of time, particularly for smaller orgs. So my advice to people is always keep it simple for as long as possible, and then start with the problems that you're actually trying to solve. So a problem statement would be, I want to have fast dev or something like that, but then I want to be able to safely roll out in production, and that might lead you towards using something like LaunchDarkly or some type of feature flagging system.
Matt (22:01):
Or maybe your problem statement would be the, I want to do blue green deploys, or I want to have some type of deploy roll out mechanism, and that might lead you to one of the past providers that might allow you to do that for free. And then in terms of service mesh or API gateway, again, I would start with the problem statement. Are you having unexplained networking issues or do you need timeouts and retries and that type of functionality, or do you want extra observability or or something like that? I think that would lead you towards a particular technical solution or on the API gateway do you need rate limiting and off, et cetera, et cetera. So that's always my advice is to keep it simple and then don't start with the technology, start with a problem and then figure out what is the simplest technology that will solve that problem.
Daniel (22:57):
Great advice, Matt. I think the irony is I've definitely learned this, but only as my career has progressed. I think many of us, when we start, we're like, "Technology, technology, technology." And then we learn?
Matt (23:06):
I am notoriously a late adopter. I have made my career by... I like to follow everyone else by five to seven years. It's kind of ironic given now that I do all this work in the cloud native space and we're at the forefront of certain things, but at the same time, people will often laugh or complain for example, that Envoy is written in C++. And a lot of that is it's a late adopter mindset. It's that I've always taken the approach of trying to build on stable technologies whenever possible. And obviously as engineers, that's the trade off that we have to make. You have to always balance doing something that might be more efficient but is less proven, versus what you know.
Daniel (24:00):
Yeah. Well said, Matt, well said. Final question then before I let you go, is what do you think the future will look like say in five years time? Do you think we'll be developing functions as a service? Will we still be doing micro services? What will the platforms look like? Anything you want to dive into there?
Matt (24:14):
My feeling is that in the 5 to 10 year timeframe, I think we are going to move much more towards a functional or a serverless type deployment. I do think that as time goes on, there will be a consolidation on a fewer number of platforms as a service or functions as a service, just because from a development perspective, that's what developers want. They just want to write their service code and do some caching, have a talk to a database and do some networking and not have to think about it. So from the average developer standpoint, if you look at the cloud offerings Google Functions or Amazon Fargate, or those types of systems, I don't know why anyone wouldn't want that type of environment. Now, we are very early days. There's a lot of engineering to do to make that stable enough that larger organizations would be willing to run their real time applications on that type of system.
Matt (25:17):
But I think that's where we're going. Although I expect Kubernetes to be around for quite some time and probably Envoy will be around for quite some time and a bunch of these other technologies. I think increasingly people won't ever interact with Kubernetes or interact with Envoy. They're going to interact with systems that are built on top, and those systems, as we converge, they can become a little more opinionated because they're not dealing with all of his legacy and that makes them simpler to use. But right now I think we're in a really early slash messy time in this cloud native space, which is that we have a lot of legacy, we're trying to bridge it to this new world, it's super messy, it's hard to find products that can do that legacy to new bridging. That's going to be a multi year journey, until the majority of organizations have a more consistent infrastructure.
Daniel (26:11):
Super. Well Matt, really appreciate your time today. Thanks very much.
Matt (26:15):
Thank you for having me.