Tech Talk: Developing APIs the Easy Way – Streamline your API process with an endpoint-focused approach on Dec 5 at 11 am EST! Register now

Back to Podcasts
LIVIN' ON THE EDGE PODCAST

Sam Newman on Microservice Ownership, Local Development, and Release Trains

About

In this episode of the Ambassador Livin’ on the Edge podcast, Sam Newman, independent technology consultant and author of Building Microservices, discusses the concept of microservice ownership, explores how to best approach the local development of a microservice-based systems, and shares his opinions on the release trains model of deployment.

Episode Guests

Sam Newman
Independent technology consultant and author of Building Microservices
Sam Newman, formerly of ThoughtWorks and now an independent consultant, is passionate about the convergence of technology areas like development, ops, security, and usability. He wrote "Building Microservices" by O'Reilly and has worked globally in both development and IT operations. He aims to help people create superior software systems. Besides contributing to open-source projects and presenting at conferences, Sam is proficient in languages including Java, Ruby, Python, Javascript, and Clojure, and also in Infrastructure Automation and Cloud systems.

Be sure to check out the additional episodes of the “Livin' on the Edge” podcast.

Key takeaways from the podcast included:

  • Although the microservices architecture pattern is widely adopted across the industry, the term “microservices” can mean many things to many people. When reading an article that claims microservices are bad or that engineers should be building “macroservices” or “nanoservices”, take time to understand the related terminology, goals, and constraints.

  • An important decision point around embracing a microservices-based architecture is that of ownership. With a weak ownership model, engineers are expected to be able to work on any service across the system. With a strong ownership model, small cohesive teams of engineers are responsible and accountable for specific services.

  • Weak ownership can be effective at a small scale. Engineers can iterate quickly across the entire portfolio of services. However, maintaining consistent quality can be an issue, and further collaboration challenges emerge as the number of services increases. Local development machines also typically have to be capable of running any of the services, their dependencies, and associated middleware and data stores.

  • Strong ownership of services, as demonstrated by Google, can be effective at any scale. With a smaller codebase (and smaller required mental model) engineers can maintain high quality by encouraging others outside of their team to contribute via pull requests. Localized ownership of services typically means that local developer machines are easier to maintain.

  • Language choice also impacts local microservice development environment configuration. It is relatively easy to run multiple Python or Go applications. However, only a few JVM- or CLR-based applications can run simultaneously on a single machine. Here tooling such as Telepresence and Azure Dev Spaces can be useful.

  • Making it easy to do the right thing and incentivising engineers to make the best choices is key to rapidly delivering business value to customers. Flakey, non deterministic tests can slow down release cycles. And rewarding short-term hacks can encourage developers to not to continuously improve processes and systems.

  • Adopting a “release train model” can be a good stepping stone for the ultimate goal of each service within a system being independently deployable. The release train model should not be seen as the end state when attempting to adopt continuous delivery.

  • Progressive delivery is a good umbrella term for “a whole bunch of techniques regarding how we roll out software”. Separating deployment and release of functionality is key to being able to deliver value rapidly and safely. Canary releasing and dark launching are very useful techniques.

  • Implementing appropriate observability in applications, the platform, and supporting infrastructure is vital to enable rapid feedback for engineering and business teams. Charity Majors, Liz Fong-Jones, Cindy Sridharan, and Ben Sigelman are people to follow in this space.
  • Successfully rolling out system-wide log aggregation is a vital prerequisite (and test) for adopting microservices: “If you as an organization cannot do it. I'm sorry, microservices are too hard for you”.

  • The Kubernetes ecosystem is very powerful, but also very complicated for developers to work with. Organizations adopting this technology need to take the time to learn new mental models and choose appropriate tooling in order to implement long-standing best practices like continuous delivery.

  • Sam’s “Monolith to Microservices ” book provides a number of useful migration patterns, such as the strangler fig, parallel run, and branch by abstraction.





Transcript

Daniel (00:03):

Hello everyone. I'm Daniel Bryant. And I'd like to welcome you to the Ambassador Livin' on the Edge Podcast, the show that focuses on all things related to cloud native platforms, creating effective developer workflows and building modern APIs. Today I'm joined by Sam Newman, independent technology consultant and author of the O'Reilly book, Building Microservices. Anyone interested in the topic of microservices has surely bumped into Sam's work. I learned a lot from the first book, which was published when I was doing work in the Java space and building Java microservices. A second edition of this book is currently underway and Sam has recently published another book called Monolith to Microservices. This contains great content focused around migration patterns like canary releasing and parallel runs and also explores how to evolve systems using patterns like the strangler fig. In this podcast, I was keen to pick Sam's brains around how to design and test microservices.


Daniel (00:49):

And this ultimately let us discuss the concept of service ownership. I was also keen to learn about his experiences setting up local development environments for microservices, and to learn about patterns for how this functionality is propagated through environments and released ultimately to customers. If you like what you hear today, I would definitely encourage you to pop over to our website. That's www.getambassador.io, where we have a range of articles, white papers and videos that provide more information for engineers working in the Kubernetes and cloud space. You can also find links there to our latest release of the Edge Stack, also our open source Ambassador API gateway, and our CNCF hosted telepresence tool too. Hello, Sam, and welcome to the podcast.


Sam (01:26):

Thanks for having me, Daniel.


Daniel (01:27):

Could you introduce yourself, please, Sam, and share a few career highlights for us?


Sam (01:30):

So I'm an independent consultant, been doing software for about 20 years. My main focus is sort of cloud continuous integration, continuous delivery, and I'm probably better known for microservice related stuff.


Daniel (01:42):

Oh, yes.


Sam (01:42):

And a career highlight. I've published two books. So my career highlight was my most recent book. I like finishing something and handing it over and people seeing it. Both the books I've had released have been a big career highlight for me.


Daniel (01:54):

So it's a tradition in this podcast to talk about developer experience and developer loops, and we always start a podcast without naming names, protect the guilty and the innocent here. Can you describe your worst dev experience from that coding testing, releasing and verifying?


Sam (02:07):

Okay. So I'll tell you two stories. One of the worst was very early on in my career. It's 2005 and we had a classic separate test team problem. We were trying to do more tests ourselves and we were doing it, but the client we're working for also at insisted hiring a separate test team. And one of these individuals who had an excellent education, had come from a top tier university, well known for its computer science credentials had basically come up with a test suite called white box that was the flakiest thing you've ever seen in your entire life. Like it would just fail at random times, but this guy was so smart that he'd worked out that these tests passed more frequently overnight than they did during the day. It was just classic non-deterministic test behavior, but rather than fixing the underlying flaky issue, he'd done some statistical analysis worked out that they parsed more often at three o'clock in the morning.


Sam (03:01):

So he would schedule his checkins with Windows Scheduler to do his check-ins at like two o'clock in the morning in the hope that the tests would be more likely to pass you come in in the morning and the test would be broken and he just wouldn't be there. He'd turn up at 11 o'clock. So that was bad. The worst one that I didn't was a colleague of mine, was at a company, went into a client. They were trying to get them to move towards continuous integration. No one was playing ball and they couldn't work out why. They eventually found out that the way software got done is a developer would build it on their laptop and would get the latest build of the software to the trading floor. Traders felt that it was an edge to get the latest version of the software.


Sam (03:38):

And in exchange for you being the one that got the software to the trader, that'd be a little envelope for you at the end of the month, cash in it. So trying to get these builds fully automated into the point where there was no ceremony did not play well. But yeah, those were the two worst that I'd come across I think in my experiences.


Daniel (03:56):

They're pretty good, Sam. That's quite unique because I like the trader and the cash envelope. I have not heard of those before. That's very interesting.


Sam (04:02):

It was just like, it's thick with bills and it's like, "Oh, okay." Well, everything's based on incentives, isn't it?


Daniel (04:09):

That it is. That it is. Nice. So I'm definitely keen to explore your book Monolith to Microservices. I've actually got a copy here beside me, Sam, as we're talking. But first I wanted to look at some of the things around developer experience. I've sort of hinted in that question. Obviously microservices means you're typically building a distributed system and this is new to a lot of folks. Yeah, it certainly was for me when I bumped into your work, I know five or more years ago, I guess. I wanted to run through sort of local dev to delivering value in production. Have you got any advice for folks, broad brush strokes I appreciate, for setting up local development? Because I think there's a tendency for people as they're building microservices is to want to run everything on their laptop.


Sam (04:45):

Yeah. I think the very first challenge comes up is the context of ownership and what you own. Backing up. Right? It's all about fast feedback. So what I want is fast, high quality feedback on my laptop, right? I want that does it work now? And I want to get feedback as quickly as possible. And so running loads of processes on your laptop is going to suck because my laptop, all it's trying to do is a virtual background in Zoom. And the fans on my MacBook Pro are spinning, let alone running seven or eight JVM based processes. So you always want fast feedback. So you can kind of limit how many things you're having to run. And then when you're on tests, you want to limit the scope of those tests, get fast feedback. I mean, this is all bread and butter type stuff. The challenge comes then when I see organizations that have kind of quite strong ownership versus quite weak ownership of microservice stacks.


Sam (05:35):

So if you've got a team where you've got 50 developers and you could work on any one of 50 different microservices, you've got a roving portfolio. I could be anywhere. And in those situations, people more readily want to run more stuff because they don't know which bit they're going to be working on. On the other hand, if you have maybe a stronger ownership, "This is the team, we work on these two services, and this is the scope of what we own", then I think those teams are much more aware of saying, "This is our bit of the world and we can stop things out because that's sort of outside our responsibility." So it's almost like I can tell people, you shouldn't need to run that many things locally. You should either kind of use something like Telepresence or something clever to sort of shell out to the cloud effectively, all the amazing stuff that Microsoft has that does this kind of, crazy tools they've got or just say that is not your business.


Sam (06:24):

But if you're in an organization that has more of that, anybody should be able to change anything and will change anything model, which I don't actually think is very effective at scale, if you're going to do that, it becomes a much harder conversation to have about you don't need all this stuff because actually you might. And that's a real problem. It's also more pronounced, of course, with the technology stacks. I can quite happily have a whole load of Python processes running locally and no one cares, right? If I'm doing the same thing with the JVM or the CLR, my laptop really does care about that. And there I've played around with some hybrid models where like pretending I'm running separate JVMs, but they're actually all one JVM, but you get into so many ways in which that is not anywhere near production-like because then you can't run it in a container. And pre-containers. Right? The whole problem was my laptop is quite different to my CI environment. My CI environment quite different to my prod environments. So silly things like, oh, on this operating system, there's case sensitivity in path names.


Daniel (07:20):

Yeah, classic.


Sam (07:21):

That's going away with containers because we now have a production-like operating system giving us that more production-like experience. But then that means if we're doing a JVM or a CLR based system, that's going to be quite a big impact, whereas if I'm running go processes or no processes, it's not quite as bad a state of affairs. After speaking to lots and lots of people, I still think this idea of strong ownership of microservices is the one that seems to align the best, which is a microservice in general is owned by a team and a team might own one or more microservices, but where you start getting into the world where I've got multiple teams and all of them own these microservices, you find yourself weird sort of commune, hippiesque, collective model, it's like, it just doesn't work. Five, 10, 15 people, absolutely you can make that work. No question. 50, 100. It doesn't. It breaks down. And I think people then say, "But I can send pull requests," and it's like, "That's not how pull requests work."


Sam (08:17):

Because that when I spent a couple of years working at Google working where we had 10,000 engineers all working on one big source tree, but that wasn't a really a collective ownership model. And so Martin Fair has definitions of ownership. So he talks about strong, weak and collective ownership. And if you actually look at his definitions, well, Google practice is strong ownership, there are certain people that are allowed to change certain source files. If somebody else wants to change those source files, they have to ask for permission. That's what happens. I want to change your source code. Here is a commit. Will you let me merge that commit? That's a strong ownership model, somebody's acting as the gatekeeper. And so when I was there, I know things have changed, every directory structure would have an owner's file in it. That owner's file said who owns the code in that? And you had to have somebody from the owner's file say, "Yes, you're allowed to put that in."


Daniel (09:08):

Interesting.


Sam (09:08):

And that was a strong ownership model. And I think people look from the outside and misunderstand how those things work effectively when you actually get to grips with it.


Daniel (09:15):

It's classic. And we use technology sometimes or even process to fix like underlying ownership or organizational issues.


Sam (09:23):

Yeah. Yeah. This is one of those areas though, where we can look at that technology and say, "You want to run 10, 15, 20 services on your laptop. Well, you don't actually have to." And sometimes the answer is you don't actually have to, but then it's about, well, how do you scope? What is your active new development and there the model that seems most common is I'll actually have a cluster of things. So I intend to work on these four. So I'll have some sort of a template effectively that allows me to stop out their dependencies and we'll use something like Mountebank or something to just stop that out. But then if they are working in an environment where from day to day or week to week they could be changing what they're working on, unless you've got a heavy degree of standardization about how you spend things up, that's a really difficult world to be in.


Daniel (10:04):

Yeah, it's tricky, I guess in the end, because I often hear folks that have worked at Google. I chat to Kelsey Hightower a couple of weeks ago and he said the exact same as you, Sam. The Google dev experience is fantastic, but they have whole teams working on their developer experience. And most of us are not in that fortunate of a position to be there, are we?


Sam (10:20):

I was actually in their tools team when I was there. So I was one of, well, I think at that point, the tools team was around it's between 100 and 200 people. So for 10,000 engineers, you can easily afford to have one to 200 people doing everything from a distributed peer to peer object, linking object cache linker thing. I was working on this system that helps store all test results and assistance. Whenever you ran a build, you could see where the test was and you could code, all that. You could afford to do that when you've got 10,000 engineers. That's noise, but those models don't necessarily apply. So you're taking something like what Google does, or maybe what Facebook does and saying, "We'll do that just because they've opened source some of it."


Daniel (11:01):

Yeah.


Sam (11:02):

And then Google haven't actually opened source any of it really, but you can't recreate that experience and nor should you, because there's also an extent to which Google is solving problems that may be a only, they really have and b, a lot of them have problems they don't need to have, but that's a different conversation.


Daniel (11:20):

Yeah. I think Adrian Cockcroft said it really well a couple times in presentations at QCon, like people are just copying the outputs and not understanding the stuff. And I heard you say the same in your talks as well. It makes a lot of sense. Yeah. But it's the voice of experience speaking there. You've been through this, a lot of us, perhaps haven't had that experience.


Sam (11:34):

Yeah. And it's also often a lot of us don't have the ability or the access to scratch a bit deeper into this. A talk by somebody at Uber or at Google or at Facebook automatically gets magnified because the name is attached. And then something that they say is now the position of the whole company. So I got sent this thing recently, which is Uber's moving away from microservices. They're adopting macro services. I dug into it, I read the article about the guy and the guy says, "I'm only one team. And I don't know what anyone else in Uber's doing. And there's about five of us in this team." And I dug into what the definition of macro services were. And there is no definition of macro services. And it's just that for them, they felt that they had too many services. They wanted to have fewer services so they merged them back together again. And that the whole story microservices is this. Macro services. It's like, "Guys, just."


Sam (12:25):

It took me less than two minutes to get that information because the guy had actually, to be fair to him, he'd written an article and actually explained it, all the segments stuff as well. That was fine because really interesting talker. I quite enjoyed that talker at the segment talk. Yeah.


Daniel (12:38):

Yeah. But I'm thinking you didn't like the fire drill in that talk.


Sam (12:39):

Well, I was able to speak to the presenter about that. And she was talking about a series of design decisions they went through where they kind of had microservices and they moved away from them. And that actually, for them, it didn't work. And then someone said, "Oh, that means segment. Don't use microservices." And I spoke to her and she says, "No, no, actually we're only one team and it's four people. And we had 60 microservices to manage and we had four people," and I'm like, "That was too many." And they moved away from it. But you take these things out of context, then suddenly you're... Anyway.


Daniel (13:11):

It's the Hacker News or Reddit effect. I chat to Alex at Qcon. That was really quickly smart, well thought out, but of course everyone in the audience thought they knew better. "Oh, should do this. Should do that." I mean, I've been there. I've been that person, but I think is taking that time to go a little deeper, as you said, Sam, pause, check the context, understand what's going on rather than jump to the click bait title.


Sam (13:30):

It's not as much fun. You don't get any manifestos out of that though. I never did do the microservices manifesto, but it would be full of too much snark to actually be read by anybody. But I do understand people are time poor, but also when people share these things, they get translated as hot takes quite quickly. And then whole industries are born.


Daniel (13:49):

Agreed, agreed.


Sam (13:50):

Says the microservices expert.


Daniel (13:57):

I wanted to move on to look at this releasing because I find that quite tricky. It's very easy to create a distributed monolith where you have to release it all in lock step. I think my first system I built, I do that to be completely fair, I bumped into release trains quite a lot, actually in organizations. I've heard you talk about these kind of the scaled agile framework, which I'm sure we've all got our opinions on that one, but like release train is almost a best practice in that. Have you any thoughts on or hot takes maybe on the release train?


Sam (14:21):

Yeah. I've heard less polite definitions of what SAFe stands for, but I won't repeat those in the podcast. So putting aside my feelings about SAFe, the release train was always kind of a very useful mechanism to help people adopt proper continuous delivery. And so the release train, I always viewed as like training, wasn't it? That was my focus before microservices. I spent five, 10 years focusing on CI/CD and I got to microservices as result of that and realizing sometimes the architecture needs to change. So this is my background. And you'd go into organizations and one of the biggest problems was that they would have some functionality they wanted to put out of the door, but it wasn't quite ready so they'd just delay the release and those releases would get less and less frequent. And there's all the sorts of associated issues with that. And so the kind of thing with the release train was getting them used to a cadence, which is saying, "Look, every four weeks, you're going to release what's ready to go live at four weeks. If your software isn't ready, it waits four weeks."


Sam (15:22):

And you get people used to that rhythm and then you increase the release cadence and then you will increase the release cadence. And then you get rid of the release train and move to release on demand. But it's like training wheels on a bike. It's something you move through and move beyond. The problem with something like SAFe is it effectively codifies the release train as being the best way to release software and almost as something aspirational when, and that, to an extent is my problem with sort of SAFe as a whole really is that it sort of codifies mediocre development practices as being somehow aspirational. And it's a great way for enterprise never to actually have to change. And guess what? You get an A-1 laminated org chart for your money.


Sam (16:05):

So for me, it was like the release train. And then when you look at a microservice architecture, if you practice a release train across your application, which might consist of multiple services, you very quickly get to a situation where multiple services are being deployed at the same time as part of each release train leaving. And that means that even if you theoretically had microservices that could be deployed independently over time, that's not what's going to happen. Release train ends up driving you towards that. So if you have got people that are on a release train, a couple of things, remember it's like training wheels. Increase that release cadence. The other thing you can do is move to a release train per microservice. So if you've got an invoicing service, a payroll service, an order processing service, and you're working on those, maybe have each one has its own release train maybe.


Sam (16:49):

That might be a way to, or at least a per team release train. So at least you say with this team, if you own these services, you can have a release train, but it only applies to your services. Other people have their own release trains. And at that also can make it easier to help the people that are most ready to move to release on demand get rid of their release train. But if you have that sort of like cadence set on a program level, and everyone has to stick to that cadence, it means those people that want to move faster are constrained and aren't able to do that. But if you can localize those release cadences within each team, people can be in charge and I think more readily move away from that. And you de-risked the we've got a distributed big ball of mud problem.


Daniel (17:31):

Big bang type thing.


Sam (17:33):

Yep. Yep.


Daniel (17:33):

Yeah. And like my father famously says the only thing you get with a big bang is a big bang, right?


Sam (17:37):

Yes, yes.


Daniel (17:39):

Yeah. So I wanted to move on, Sam, a bit to the progressive delivery kind of space. I think I've bumped into you talking about this. I'm a big fan of James Governor's work. I'm sure you are too. What's your thoughts around the whole progressive delivery GitOps kind of thing? I've heard some folks say progressive delivery is sort of more of an extension of continuous delivery. GitOps is just really best practice for Kubernetes. But I see some value deeper than that, but I'm kind of curious what your thoughts are.


Sam (18:00):

Yeah. I mean, for me, progressive delivery is a good umbrella term for a whole bunch of techniques regarding how we roll out software. And I think we've, in the past, thought quite simplistically about how we roll our software out. And really we've been, James talks about this very early on. I think back in 2010, he did a really nice set of articles over at InformIT where he talks about this idea that we've got to separate, in our heads, the act of deployment from the act of relief. So for many of us I've deployed into production is the same thing as I've released my software. And so he gives the example of the blue green deployment as the classic example of separating these two steps. I've deployed my software into production, but it's not live because I've deployed into blue and we're currently live out of green. When I do the switch over now it's live.


Sam (18:44):

So if you can separate those two concepts in your head, you can start thinking differently about how you roll that software out. So I think progressive delivery as an umbrella term for the different ways in which I might roll my software out, I think makes sense, because we've now got tools and technologies that allow us to go beyond sort of fairly simple things like blue, green deployments. We can now do things more easily, like parallel runs and roll outs and yeah, and arguably parallel runs are basically a form of dark launching. So those sort of techniques having an umbrella term, I think makes sense because then you can point at it and say, "Look, you've got the continuous delivery process and then the delivery process ends in delivery. Well, actually there's nuances around that part of it. So what is your technique going to be for how you roll that software out?"


Sam (19:26):

I mean, for me GitOps is a separate thing in a way, which is using version control as your source of truth, your desired state management, right, which is, "This is what I want my system to look like. I version control that and I've got some tooling that makes sure that those things bridge together." I might use GitOps to implement my progressive delivery rollout potentially, but I kind of try and separate those two things out. I also know I get a lot of people angry because I used to do that with Puppet and Chef and all this sort of stuff and when they, when those sorts of folks saw the GitOps stuff, they said, "We've been doing that for years," and they start frothing at the mouth. They're totally right.


Daniel (20:05):

Yeah.


Sam (20:05):

Actually there is value in having a term that allows you to talk about thing. What do I talk about when I do Puppet? Is it infrastructure automation? Is that really all of this? It's not really just infrastructure automation. And so I have some sympathy for Alexis and the folks at Weave trying to name.


Daniel (20:24):

Yeah. Yeah.


Sam (20:26):

Define it. I think if anything, the unfortunate thing is GitOps has been narrowly viewed as just what you're doing Kubernetes which...


Daniel (20:32):

Yeah. Interesting. Even I fall into that trap to be fair, Sam.


Sam (20:35):

Yeah. And yeah, I mean, I was involved in some of the usual conversations around this stuff and historically a lot of this conversations were going on was because the story around CI/CD and on Kubernetes was pretty poor. And some of the solutions that are coming up that space really did not fit what I think is kind of good practice around CD.


Daniel (20:53):

Yeah. People were shelling out to kubectl, weren't they for a while? Things like that.


Sam (20:56):

Yeah. Yeah, absolutely. And the lack of awareness of prior art, maybe in that space as well. And so there was an attempt, certainly Alexis and the folks at Weave. You have people like David Aronchick and things as well, he was at Google at the time, now moved onto Microsoft, were really trying to get efforts moving, I'm thinking Glasius as well at Google, were trying to say, "Well, how should CI/CD look in that, given that we've got tools that have new capabilities, there's a bit of a vacuum and a lot of the tooling that works really well in other context, hasn't embraced Kubernetes yet?" I still think that a lot of the solutions in that space seem to have just missed out a whole bunch of what continuous delivery is. And it's got very, very branch heavy Daniel, eh, very get flow.


Sam (21:46):

It's very get flow. It's very, "Let's not do continuous integration because that's scary." So I've, I do have some fairly firm problems with that side of things. And so I fallen out with a few vendors. It's one of the benefits of being independent. Right?


Daniel (22:03):

Yes, totally.


Sam (22:04):

I know you get it. And so I think the stuff at least around GitOps, which is like version controlling stuff, is a really good idea. Just get that idea really firmed in. Desired state management. That's a really great idea. Let's show you how you can bring those tools together in a way that works natively in a Kubernetes landscape, absolutely. And I think it's sensible to talk about it, to be honest with you, whether or not it's right for everybody, because at the moment you're having to buy into f.lux or Spinnaker or something similar, and that's yet another tool to get your head around. Like for me, it's like if I can get design state management to stick in people's heads and version control to stick in people's heads, once those ideas are stuck, then joining those two ideas together and now talking about that next steps, it's not that far a journey to take people on.


Daniel (22:48):

No, I like it, Sam. I like it. You mentioned a couple patterns there, which I'm keen to dig into in just a second, but I wanted to get your thoughts around observability because for me, that's often closing that feedback loop, right? Continuous delivery doesn't end with deployment as you rightly said. You've got to like, "Did we hit our KPIs? Are we breaching our service level agreements at a operational level?" And what's your take on observability for microservices like logging, metrics, all that kind of stuff?


Sam (23:11):

My favorite tweet probably ever, which shows you how sad I am, that sums this situation that really well is by the Honest Status page. And it says, "We turned our monolith into microservices so that every production outage is like a murder mystery."


Daniel (23:27):

Love it. Yeah.


Sam (23:27):

That's the issue, right? You've got a whole load more sources. I mean, fundamentally it comes down for me to the troubleshooting flow. I mean, yeah, you can talk about gathering trends and it's easy. The big thing is something's gone wrong. What the hell's gone wrong? And in that situation you need to find out what's going on or at least get to recovery and then gather the data you need to fix the problem going forward. So there are some smarter people out there to talk to about this than me. You've got Liz and Charity and Cindy in this space just to name but three. But the thing for me, it does start with some basics of people. And I say log aggregation for me is pretty much the only prerequisite I have for microservices full stop. I say, if an organization, if they are interested in adopting microservices, I say, "Get log aggregation before you do anything else."


Sam (24:12):

And it's for two reasons, one it's super useful. And two, and really it's early on, is going to give you so much return on investment in terms of understanding what the hell's going on is having good log aggregation. The second reason is implementing a log aggregation solution is not, in the grand scheme of things, hard. If you, as an organization, cannot do it, I'm sorry, microservices are too hard for you because it requires some joined up thinking, getting operations, whoever handles operations, to roll out some small, very small changes. I mean, we're talking about running one login daemon where your microservices live and pushing it somewhere centrally. This is not difficult work. So if you can't do that as a simple prerequisite, that's probably a sign that other things are go wrong. Once you've got log aggregation into place, and it's simple things like standardizing log formats, but the fundamental kind of mindset shift is moving away from this model where you go to a computer to ask if it's okay, because that computer is now its potential container.


Sam (25:16):

It's a ephemeral. It might not be there. It might be there, in which case great. But you have to assume the machine may have gone. So you've got to get the data from those machines to you. That's why log aggregation is so important. That way, if you are relying on stats, gathering your stats somewhere centrally in something like Prometheus, if you really must use Prometheus, gathering that data essentially, that's super important. And getting into that, all of aggregating that data, the next sort of problem people hit is around performance and understanding where time is being spent. So distributed tracing obviously is super important here. That for me is not something I do day one. If I've got log aggregation in place and basic stats gathering in place, then I start getting worried about my latency. Then I'm going to look for distributed tracing tools, but I would do correlation IDs before that. And we've talked about that stuff in the past. Once I've got my log aggregation in place, every single thing should have a correlation ID. Get that in.


Sam (26:07):

Once you've got that in your microservices stack the places where you've got the hooks, your correlation ID generation and logging, is where you put your hooks in distributed tracing. So that becomes a sensible thing. So that's my general case progression of how I build that platform. The bigger issue then is another one is, is it working? And I think with more simpler deployment topologies, our view about is it working is really we don't look to say, "Is it working?" We look to say, "Are there any problems?" We look for the presence of errors. We look for a red flashing light on a dashboard. We look for a CPU being red lined. And with a fairly simple system, I mean, simple in terms of deployment topology, the presence of an error, like a red line CPU is often a good proxy if there's anything broken. It does mean something. The presence of an error ceases to be as meaningful when you have more moving parts. Instead, you really need to ask the actual question, which is, "Is it working?"


Daniel (27:03):

Yes.


Sam (27:03):

I might have errors. This CPU might not be happy. I might be having an out of memory killer over here. This container's thrashing all over the place, but can people still sign up? Can we still make money? And then as your sort of sources of noise increase, you do then start needing to almost be out. I mean, this is what you're moving into sort of semantic monitoring and sort of making value statements about your system that have to be true. And then to an extent, the errors are then things that you might investigate if those value statements aren't found to be true. And that takes a while, I think, to move through that progression because often, especially in more established enterprise organizations, the role of monitoring and alerting is such a siloed. Well, then those people don't have awareness about what good looks like from a business context.


Daniel (27:53):

Yeah. That's the key thing. A business context is key.


Sam (27:55):

Yep. Yep.


Daniel (27:57):

Yeah. Great. You mentioned a bunch of industry names. I'll definitely link them in the show notes, too, because like there's Charity, Cindy, Ben Sigelman, I'm big fan of too. They've written lots of good stuff around this. I'm starting to realize it's not as easy as I think sometimes, this notion of observability. Right?


Sam (28:11):

Yeah. And it means, again, it's like progressive delivery in a way, right? It's a good umbrella term for a whole bunch of new thinking in this space. A lot of that new thinking, being old ideas that they're packaging together, it's still at least something on which we can, a term on which we can hook articles and have conversations. So I think it's as good a term as any we've got.


Daniel (28:30):

I like it. I like it. So in the sort of final five or so minutes we've got here, Sam, I wouldn't mind to have a look at the Monolith To Microservices book.


Sam (28:37):

Yeah.


Daniel (28:38):

The kind of things that stood out in the book when I was reviewing it was, is evolutionary patterns to transform your monolith and the evolutionary and the patterns really jumped out to me. I like this idea of how you pitch it, a lot about iterating, evolving, not trying to do big bangs. I've heard you talk a lot about patterns of late. What would your go to pattern be in terms of evolving a monolithic system toward microservices, say?


Sam (29:00):

I mean, they're all a bit different. I would say it's maybe not a, pattern is not the right word. It's the metaphor that I buy into. We're trying to people to buy into this. And that's like I say that adopting a microservice architecture is not like flicking a switch. It's like turning a dial, turning the dial on your stereo to get to the volume you like. You turn it up til you find your happy place and then you need to go from there. And that's the same thing in microservices. You've got buy into this idea it is going to be a progressive journey. There are some patterns that are really easy to implement if your architecture fits those problems, the strangler fig application pattern strings comes to mind, right, which works really well if you can do call interceptions. So it's basically a way where you're, basically a call's going to come into your existing monolithic system, as this typically as used underneath the UI, although you can do it in the UI as well.


Sam (29:44):

And rather than that call to request some functionality being served by the monolith, if that functionality has been migrated to a microservice, that call is intercepted and diverted. So it's worked really, really well with HTP based systems. Although I spoke to a team called Homegate in Switzerland that actually used this with FTP, and I've seen it on the messaging as well, messaging interceptors and things like that. And the nice thing about that is that pattern works really well because you don't actually have to change the monolith. It's almost unaware that anything's happening. It might be where it's getting fewer calls and then now being diverted away somewhere else. And that pattern was actually mostly used, and I use it multiple times, when doing rebuilds of existing system stacks, but moving from monolith to monolith and it works surprisingly well. There's some areas it doesn't work well.


Sam (30:28):

So you've got to be able to intercept the call coming into the system and move that call's functionality effectively over into the new microservice. So if I've got something that's maybe more side effective I'm trying to migrate, that's not going to work. So the example I think I've give in the book is I might have a call coming into a system that says, "Place order." I could intercept that call. As a result of placing the order, I might want to award some loyalty points to you for buying so many Justin Bieber CDs or whatever it is you're going to buy. I don't know. That functionality is what triggers a side effect of the order call coming in. I can't get hold of that loyalty points awarding functionality on the edge of my system.


Daniel (31:07):

Yeah. Yeah.


Sam (31:08):

So, but I could get the orders and then have the orders called back into the mods to do it once. So there's that pattern. That's really such a straightforward pattern. And it's a nice one to start with because it's an easy one to roll out. It's very easy to de-risk. It conceptually makes a lot of sense and it can also work well in situations where you have different people who are doing the microservice migrations from the people who work on the monolith system because they don't actually have to coordinate too much.


Daniel (31:34):

So wrapping up, I guess, what do you think the future of developer experience is going to look like in, say, five years time? Are we all going to be building our nano services, microservices, macro services, call them what you will, how are we going to be building those do you think? Will we have more platforms or Kubernetes get pushed down into these platforms? What do you think the future looks like?


Sam (31:52):

Oh, I hope no one knows Kubernetes exists in about five years. I mean, look, Kubernetes is not a developer friendly experience. It just isn't. It's just not in no way, shape or form. And the tooling's got a lot better this space. I remember this fantastic talk by Al at GOTO Berlin, all the tools that make developer experience and Kubernetes effective and it's a brilliant talk and they know their stuff, but it was like, "I need to lie down."


Daniel (32:19):

It's too much. Yeah.


Sam (32:21):

So for me, there's almost two things. There's the logical architectural style stuff. And then the physical deployment experiences and the cloud function abstraction is the best new developer friendly abstraction I've seen for deploying software since Heroku. Right? Heroku's still gold standard of paths as far as I'm concerned, it's still brilliant even if maybe the software itself hasn't kicked on. The cloud function concept is brilliant. The crown crop of executions leave a lot to be desired in terms of usability. I mean, you see how Heroku it is. And then you compare that to say Lambda. Lambda sucks, compared from sort of developer experience stuff compared to say Azure cloud functions. You look at debug cycles, debug processing on cloud functions on Microsoft Azure. It's amazing. I can run a function in the cloud and debug it from my laptop.


Sam (33:13):

I can have a function running on my laptop, but triggered from cloud function from Azure and debug it locally. I mean Microsoft is insane. The durable functions they've got, but even then it's still a train wreck, right? Comparatively. So for me, that's what we're going to have. That is the function however big you want to define a function as being. It's like it's quantum, it's like UNA texts. That as a deployment model makes perfect sense. It's what most developers need, what we need are better executions on those models. I am a bit, as a result of this, a bit disheartened to see what's happened with Knative, as in Google taking their ball away, not playing with play with the rest of the kids anymore because in my experience, Google haven't always done a great job when they develop these things in isolation. They've got some great engineering, but they don't necessarily always, I think, have a good awareness about how these things actually get used.


Sam (34:03):

And like if you saw early generation Kubernetes, which was based on internal Google concepts and ideas, and people are like for about five years, what the hell is a pod? Like it makes sense in Borg. I get that. You're not using Borg. So, but for me, that's is some kind of function primitive for most software delivery. The question really is going to be the public cloud staff is already heads and shoulders above of what you can run on premise and the ancillary serverless offerings that support that sort of cloud function delivery is already. I just till think we're in a bit of a fool's errand thinking we're actually going to be able to recreate that quality of experience in the private cloud, but a lot of people are going to try.


Daniel (34:48):

Yes. Oh yes. Well, a consultant's got to make money, Sam, right?


Sam (34:50):

Absolutely. I don't do that kind of work. So I can think stuff like this. If I did, I'd be quiet. I mean, it is the function. That is the way forward and yeah, things to be worked out, but I think we're very much gen one with that stuff. And I'm feeling it's a good primitive that I could understand enough of how it maps to actual executions. We've got so many more options around how to run container workloads now. Really some impressive stuff has been done in that area. And again, as an old Unix programmer, it pains for me to say this, an awful lot of that smart thinking has come from Microsoft in this regard. So long may there continue to be a prevaility of ideas around this stuff. And maybe Google will realize that going all out with Knative was a bad idea, but who can tell?


Daniel (35:38):

Only time will tell, Sam. Only time will tell. All these insights been fantastic. Really enjoyed chatting to you as always. Learned a bunch. Thanks so much for joining today, Sam.


Sam (35:44):

You are welcome. Thanks, Daniel.

Featured Episodes

Platform Engineering

Podcast

S3 Ep14: Four P's of Platform Engineering: Key to Prosperity

Explore platform engineering with Erik Wilde as he discusses the Four P’s for success in our latest podcast.

Observability

Podcast

S3 Ep16: Cutting Costs with Observability - Beyond Monitoring Best Practices

Discover top observability best practices to cut costs and enhance performance. Go beyond monitoring to optimize your cloud-native applications.

Cloud Computing

Podcast

S4 Ep1: Overcoming Cloud Challenges: Exploring the Future of Cloud Computing

Dive into cloud computing’s future with Kunal Khushwaha on 'Livin’ on the Edge.' Discuss Multicloud, AI, K8s & the challenges of the cloud that many organizations are facing.