Monsters in the Cloud? Yes! They do exist and if you're running apps in the cloud, it's likely that you've already been bitten by one and may not even know.
Mike and Tobias Kunze from Glasnostic talk about the never-ending Day-2 in the cloud and what cloud operators and software developers need to pay attention to help identify and mitigate operational monsters that can cause havoc in your cloud environments.
Ep4 - Tobias Kunze - "Operational Monsters in the Cloud"
Mike: Today we begin episode four with a quote from Pirates Of The Caribbean, the movie, which reads: "You're off the edge of the map mate, here there be monsters". Why a quote like this? Because today we're going to focus on some of the operational monsters that lie in wait for us in the the cloud.
My guest today is Tobias Kunze. Tobias is a serial cloud entrepreneur. His first startup company was an early innovator with Platform-as-a-Service (PaaS). In 2011, that company was acquired by Red Hat, which has since rebranded the portfolio into the OpenShift Platform. So, for those of you running Docker containers managed by Kubernetes, you may already be using some of this guys' earlier works.
Today, Tobias is the CEO and founder of Glasnostic based in Menlo Park, California. His latest company deals with cloud operations, services, and application control. But before I start talking about how all this ties with operational monsters, I want to welcome Tobias to our program today. Glad to have you on the show!
Tobias: My pleasure. Thanks, Mike.
Mike: Great. I'm really delighted to have you here. I think you're working on some interesting technologies and that's what we're here to talk about, it's the technologies that run the whole cloud operations.
Mike: And I did mention your company deals with CloudOps, that sounds distinctly like Day-2 operations. What does that mean?
Tobias: It absolutely is, right? We do Day-2 operations. And what that means is whatever happens once software is running. You may recall Day-1 is really the configuration, the deployment, the initial setup of software. And then once it's running the only thing we really do is monitor. Maybe trace things. But something goes awry in whatever way. If we catch it, the only thing we can do really is to roll back.
Mike: How long does Day-2 last? It depends, I guess. How long it does it last in a normal situation?
Tobias: So, in a small environment, Day-2 will be once it's deployed until it is redeployed in a new Day-2 start. But if you think about it, you have 10 or 20 or 400 teams deploying in parallel. Day-2 is always. You deploy something new, it's the same Day-2, because you're connected to everything else that's running. There's not really a single application anymore.
Mike: Okay. So just to say what's going on here is Day-0 is when you're setting up what you're going to build, you build the stuff that you're going to build. Day-1 is when you release it. And then Day-2 is everything that happens until that application is redeployed or gotten rid of.
Tobias: Exactly! Build, deploy, run.
Mike: You focused this company on application control. What brought you this direction?
Tobias: A very simple thought, really. The stack that we work with grows taller and taller. So, the networking level is pretty far removed today from what applications do at the top. Think about it. You have running many pieces in many places today. They are all connected directly or indirectly.
Mike: Haven't we always done that? I mean, we've always had three-tier applications, right?
Tobias: Yes. Three-tier applications are very small. You know who they are, you know what the tiers are. You build this as you always build applications, you create the blueprint, you develop it, you release it, you know exactly how you're going to iterate on it.
If you have a landscape that consists of many of these applications that are connected, because guess what, the interesting bits are always outside of your application, so you have to call into the dependencies. If you have such an application landscape, the situation is way different because now you're running in different clouds, you're running a different VPCs, you're running on premise, hybrid, maybe on edge, and all these pieces need to work together in a predictable fashion. And that is where the application level really matters. Here we can't really try to fiddle with things on the network level anymore.
Mike: Why Not? Why can't you just fiddle with it to make it work? And besides you can always change the code, can't you?
Tobias: That's one of the things that engineers always like to think, right? Something happens, oh, let me look at it, let me spend a couple hours and figure out what's going on, and then I'm going to release a patch. So, my fix will go into production like hours later. That of course doesn't really work anymore if you have hundreds of these things running, potentially hundreds of fixes being deployed all the time. And keep in mind, this is a very dynamic environment, right? It is rapidly evolving; we're not just scaling out. We are releasing new versions of something. We are putting new infrastructure next to existing infrastructure, then turn the old one off and so forth. All these cloudy processes that really created and amp-up the dynamism of the entire landscape.
Mike: I liked the cloudy infrastructure here because that's exactly what it's doing because it has gone to such an n-tier plus that there's so many other things taking place. And because of that, the applications are more difficult to figure out where things are going.
Mike: So where do the monsters come into play? What monsters are out there? Help us understand.
Tobias: You've heard about the "Noisy Neighbors". There's even in the old Linux top or Unix top, there's " Steal Time" in there. There are things we don't see. We are presented with a virtual world and outside of that, there's all kinds of things happening.
Mike: What is a "Noisy Neighbor"?
Tobias: Oh, very simple. I'm calling a database and somebody else's hammering it. That's a Noisy Neighbor. I don't get this quality of service because somebody else does something with it.
Mike: What can you do about it?
Tobias: Well, you as a coder can't, but the operations side of course can and then balance the resource allocation properly.
Mike: Okay. So, before we get to what you do with some of these, let's go through some of the others, then the other monsters here. You mentioned the Noisy Neighbors. What other things are out there?
Tobias: Well, we've heard the "Thundering Herds", right? That brought Robinhood down a year ago. There are some flowery names around this "Retry Storms", everybody's seen those.
Mike: Let's talk about Robinhood. What happened in that case? Because you're right. Robinhood as everyone knows is a huge investment capability, it allows you to buy and sell stocks. What happened in their outage?
Tobias: Very high level and I'm sure we probably don't know the entire picture either, but the public message was our DNS services got overrun, too many people at the same time trying to hit them, they went down, and it took time to get them back up. That's always the story, something like that.
Clearly, if you had the ability to exert a little bit of back pressure, like slow things just a little bit down, not drastically, just a little bit.
Mike: Isn't that counter intuitive? Don't we always want things to go faster? Why apply back pressure?
Tobias: Yes, you can step on the gas pedal and let things go faster. And you're going to end up in the wall. What do you really want to do is control things, have things communicate with each other in a controlled way. And that controlled way is really a top-down view, it's a managerial perspective where we say, well, we are managing to a certain SLO (Service Level Objective). We want to make sure the services over there have enough space, enough head room, and there's enough capacity here and so forth.
We need to be able to interrupt feedback loops. Developers are super familiar with the term Circuit Breaking. That same concept happens on the operational side. When I need to just disentangle some direct couplings that bring the business down. And those are feedback loops, the cascading effects, the compound ripple effects and so forth.
Mike: Those are starting to sound the same. It sounds like they're all convoluted here. Is that fact?
Tobias: Well, if you look at the tip of the iceberg, you've got a couple of names. Below that there isn't a lot of names because frankly they're really complex. If you shine a light on them, you'll always realize that its highly non-linear chains of events. You have something that, you know, may start out as a CPU issue here, it becomes a latency issue over there, becomes a retry storm, then turns into asymptomatic slowness somewhere else. Meanwhile, your systems] throw thousands of alerts all the time.
Mike: I think you said something here that's very important because as you shine the light on it, you find out it may not be the monster that you think it is. It may be something cute and furry under your bed, but the fact is the monster may be a little bit different or in a different place. And doesn't that all come down to having the proper visibility, knowing what's out there and being able to see what's there?
Tobias: Exactly. It's what happens in Day-2 is largely unpredictable. So, the instrument cluster in your cockpit, like the tracing, the debugging tools you have, don't really help you there. You're only looking at one "flight". You need to have a radar screen. You need to get the full global visibility of who is affecting, who, who is talking to who. Should they be talking? Is what we are seeing here what we expected to see? Is it plausible? These kinds of questions need to be answered to guide us to where we actually want to exert control.
Mike: Okay. Let's take this analogy that you're using here because I am a pilot. And so, if I'm up flying from one place to another, traffic control really doesn't care why I'm flying from one place to another. They just want to separate me from the other airplanes. Is it the same thing that you're talking about here?
Tobias: 100%! You as a pilot, your responsibility is to get the plane from the ground on point A, to on the ground on point B. And that's your only and top priority. For air traffic control, your flight is not their responsibility, the airspace is. So, in the same analogy we are not looking at what a single application does, what a single thread of execution does because monsters are outside of the thread of execution. We are looking at the cloud operations as a whole. We have a holistic view and need to make sure that all the services happening there don't step on each other, don't fall over.
Mike: Now, again, we do have visibility to a lot of this stuff. But you just said services. That's a little bit different because we have packet visibility. We have information about what's going on. That could be observability. We've got log information. We've got all that at our fingertips, and that helps us. But what you're talking about, the interaction at the services is at the service level, is that correct?
Tobias: These applications are really just a handful of services, that may be loosely belonging to each other; maybe share a git-repo. But they're embedded in hundreds of other services. Because in the old world, when we did for three-tier or two-tier applications, when we needed to update the data, we would have an ETL process in the backside periodically changing how we operate right? But today that's 5,500 other applications. And those applications are different technologies, different life cycles, different departments, joint ventures, could be external partners. They're all over the place.
Mike: So, I have to ask the question, how do you see this kind of information? What technologies are you using? What technologies are available? It sounds like there must be a huge, huge database if you're going to interconnect every service? It's like taking everything that's in every airplane. Is that the way you're doing it?
Tobias: In an old J2EE environment, you would look at the class loader. The class loader today is in the network because everything's connected. That's the unified medium. It's the lingua franca, is the wire data, the wire exchange. So, we are looking at the network, but we're not looking at the network from a like ThousandEyes or a Kentik perspective when we look at what is appearing, what's the routing and so forth. We are looking at it at the highest level of the, of the SDN. What is as close as possible to the actual application endpoint. And then we be measure component behaviors.
Mike: It sounds more like policy-based routing or something like that?
Tobias: Absolutely! If we detect something that's pathologic, we may do policy-based routing in the sense that we back pressure against that service.
Mike: Okay. Back pressure. Talk about back pressure.
Tobias: It will always be some form of reduction of capacity or shaping of capacity. Back pressure is really you have your eyes set on the downstream that you want to back pressure against to relieve pressure from a system that's upstream. You could make sure there's quality of service, which is really a rebalancing, a calibration of resource utilization to downstream services.
If you leave workloads unrestrained in your environment, they can do whatever they want. They can completely kill your performance because all of a sudden, they go in the tailspin or they spew tracing logs, or, you know, killing your bandwidth. These things happen all the time. So back pressure here really means to very surgically limit what a single or a group of end points does. And so, you literally push against them. You rate limit what can happen. And that's a stabilization strategy or tactic, and it's the Swiss army knife of everything. In fact, I make it a habit, I read every public Postmortem that companies publish on outages and 99.9% of all of them could have been solved with just being able to back pressure.
Mike: Really? That's a big statement. So, in the banking industry when everybody comes in to look at their paychecks, sometimes what happens is the bank thinks they are in a Denial-of-Service Attack because everybody's hitting the bank at the same time. And in your case what you're saying here is things can be done differently.
Tobias: Absolutely! Shape the demand for a certain amount of time. Meanwhile, you can scale up. So, your scaling actions don't create new ripple effects.
Mike: So, you ease into the situation.
Tobias: You ease it in because now you can shape. As soon as you have runtime control, a lot of these monsters are not really a problem anymore.
Mike: But to have runtime control, I think the big part of that is the C part, the control. If you're going to do this, you're going to have to do it actively correct? So, does that mean that you're all inline?
Tobias: Absolutely. Absolutely.
Mike: What else can be done other than back pressure? What do you do?
Tobias: You can have bulkheads inserted. As I mentioned earlier, these landscapes are extremely unpredictable. And by inserting well defined bulkheads between.
Mike: Explain these terms because not everyone on this podcast will be familiar with them. What do you mean by bulkheads?
Tobias: So, bulkheads, Mike, comes from the shipbuilding engineering space where you don't want anything that penetrates your hull to sink the ship. So you insert small walls into the hull so any compartment that gets flooded only floods that compartment and doesn't sink the ship. So, let's say it's a preventive very simple, very effective, mechanism.
Mike: And how do we apply that to the cloud?
Tobias: We can do the same thing in the cloud. Think about it. There's configuration drift all over the place you don't even know it. You run something over there in this zone, you run something, and this zone and they shouldn't be talking to each other. But of course, critical services should be able to fail-over. In order to prevent drift issues, you may want to decide I want to have a bulkhead between those two zones that only allows a certain amount of traffic between these systems, interactions really, to occur. And everything else will be automatically slowed down so much that the drift will be apparent immediately. So, this is a really a symmetric way of protecting each zone from each other. Because if you can't do that what typically happens, you notice that this happened when you get the bill at the end of the month and that's too late. Meanwhile, your engineers have spent a lot of time trying to figure out why are things so slow.
Mike: Okay. So that's the key. What are the risks there? There have to be risks because now you could bring down something you don't want to bring down or slow up something you don't want to slow down. How do you mitigate that?
Tobias: Yeah, there's a people risk. People doing something wrong, obviously. So that's a permission issue, do you want to allow people to do that. The technical issue is solved today. Our technology is very flexible. We can use other insertion points. We can insert ourselves. It depends on the customer.
Mike: Tobias, I want to back up a second because we've been talking about these monsters, however, in reality, the whole cloud infrastructure has given us so much. It allows us to scale. It allows us to roll things out so much quicker. These are all positive things. How big are the monsters in the big picture here?
Tobias: They increase linearly at least, if not exponentially. There are three characteristics you need to look out for, and you want them because those are a sign of success. It's complexity, its utilization, and it's change.
Mike: Those are the three primary constructs here. Okay. Continue on.
Tobias: So, complexity means you're running a lot of services. Very likely on totally different lifecycles. That's a reflection of what you do as a business. So, more services mean your business does more. You may refactor over time, turn services off and so forth, but that just means you're upping the rate of change.
And then if these systems are really active and highly utilized, that means they're doing something. So, you want that, but you don't want idle systems hanging around, maybe just waiting on the next quarterly report or something like that. You want these things to be active. And as they're active, they exert more of a load and demand on the entire architecture. So that's really the complexity, large scale nature and the connectivity between systems. It's the high utilization and the rapid evolution of your entire landscape.
Mike: And it sounds like the monsters will be bigger or more prevalent in the larger infrastructures. So, if you have a smaller infrastructure that you're doing, you know, native cloud, you have a handful of developers. These aren't going to be the big issues.
Mike: What type of infrastructures are you seeing, where you need to exert back pressure, where you need to do the bulkheads? Where are you seeing that really take place?
Tobias: In an abstract way, anything that has like 50 or more services, logical services running. And they're fairly highly connected. And do you have at least something like five teams working on that landscape. Now add scaling to that, add multi-region setups to that, add cross-region hybrid set ups to that, right? That increases the unpredictability of the entire landscape. So, beyond that it depends on if you're maximally utilized, it becomes exponentially more difficult to get it right. If you are only looking at 5% utilization in your architecture, nothing ever is going to go wrong, you can basically do anything.
So, it's a very important aspect. How big is your utilization? Of course, it's also directly related to your cloud cost. So, the pressure from the business is to get high utilization, meaning no idle resources.
Mike: What companies would most likely fit these scenarios? Give me some types of companies.
Tobias: So, our bread and butter are service providers, managed service providers, cloud service providers, hybrid service providers, who essentially run a lot of other people's workloads.
Mike: Highly complex, very large,
Tobias: Highly complex, very unpredictable. Their customers' workloads do all kinds of weird things, and the customers don't even know. So how do you control that? How you do you run this safely and in a multi-tenant environment? Or even if it's dedicated, there will be shared infrastructure underneath in some shape or form.
Mike: What about government infrastructures? DISA? There are some huge applications that take place in those environments. Are those areas that should be of concern?
Tobias: Absolutely! You have all these governmental infrastructures are multi-region, they built their own clouds. They tie data centers together make them look as one so you get unpredictable right off the bat because now, I don't know where it's going to run, but it's behaving very different physically. So yes, absolutely. The additional issue with governments is that you often have to have tenant separations, mandate separations between departments. And then levels of secrecy. So, it gets very complex, very quick.
Mike: Just to recap here, it sounds like the ones that we need be most cautious of having these types of things are when it goes to scale. And that could be a large company. It could be a telco, it could be a government situation, it really doesn't matter. But it comes down to scale and complexity.
Mike: So, Tobias, we've talked about how this works for the applications. How does it really change things for the developers?
Tobias: Developers are an interesting breed because they're insulated from what's going on in Day-2. By and large they're working in Day-0, they're building things. They're sometimes taking part in the release cycle, that's the Day-1 piece, but whatever happens there they only hear from the monitoring people, right? Oh, something's gone wrong here, can you diagnose it?
Mike: They've got their heads buried in another area. So how do you bring that together?
Tobias: So, what they feel though, intimately very, very directly, what very directly affects them, is that they release some piece of code. It works, its unit tested, it gets released. And one out of whatever, 3, 4, 5 times, something else breaks. So, somebody else gets mad because your code broke something for whatever reason. You stepped on something and it's not in your IDE, it's not in the staging environment, it's in production as it happened today.
Mike: But it worked fine in test.
Tobias: Exactly, it works until it breaks. And that is an experience that almost every developer now has. Now the question, the key question here is really, do you think you can fix this in code? And the very clear answer to that is no you can't. And if you would fix it in code, you would cause another problem tomorrow because fixing code means getting your code more specific which means it's less resilient and now it's causing another outage or another degradation tomorrow.
So that is the paradigm shift where we need to realize that in the old world, we really had to deal only with two things. If we had correct code and enough resources, our production will be fine. And now in the new world there's this whole new component and that's the dependencies. Stuff you depend on, others depending on you. The general connectedness, the entanglement. And this is all stuff you don't own as a developer.
Mike: Tobias, this is a vital point that could easily be missed. In cloud environments, programmers no longer control everything code. Years ago, when I was in charge of a healthcare master provider system, I was personally responsible for every change to every doctor, pharmacy, or hospital across the system. And just like you're saying, I did it all in code because it was a closed environment. However, what you're saying is today's cloud environments intermingle cloud, services, all working together and there could be thousands of these. So just as you stated, individual developers no longer own the entanglement. In fact, they can't. That's why operational control has to be performed outside of the developers at a level above the codebase. As a result, technologies such as yours perform traffic control across the systems. They're looking for monsters and when they find them, they apply back pressure to keep the traffic flowing.
I think this is great information today for our listeners and if they want to learn more about this topic, where should they look?
Tobias: Our website is glasnostic.com like Glasnost and Perestroika for the younger folks. Or email me at firstname.lastname@example.org and also my Twitter is open. So, it's @tkunze
Mike: Great. I'll list information on your blogs; you have some cool topics there. I'll just throw out a couple. There's one about Kindergarten Ops, How to Control Application Behaviors. I also love Mission Control for Microservices. And of course, if people have questions about what we're talking about today, the Gigamon Community has lots of different options for you to ask questions and get answers.
So, with that, I want to thank you very much today, Tobias. It's been a lot of fun and I think we've got some cool things going on here. Thanks for everything you've done.
Tobias: Thanks for having me, Mike.
Hybrid/Public Cloud Group: Gigamon technical Community