[Video Podcast] K8s Maxxing with AI-Native Platform Engineering Stack with OpenChoreo
OpenChoreo is a developer platform for Kubernetes that lets developers and AI agents build, deploy, and operate apps, resources, and agentic workloads. It provides development and platform abstractions, a Backstage-powered developer portal, CI/CD, GitOps, and observability.
Key Takeawaysโ
- OpenChoreo is not a Kubernetes distribution; it installs on top of any existing Kubernetes cluster and provides a unified control plane that orchestrates CNCF tooling across data, observability, and workflow planes.
- Agents are first-class citizens. Two MCP servers (control plane and observability) expose roughly 100 tools, with the same permission model and guardrails applied identically to human users and AI agents.
- The abstraction layer solves the agent context problem by hiding raw Kubernetes internals behind higher-level OpenChoreo concepts; agents get the right context in fewer tokens.
- Pre-built SRE, FinOps, and Architect agents ship out of the box. The SRE agent ingests logs, metrics, and config changes, generates root-cause reports, and can be authorized to apply fixes autonomously.
- The permission model supports incremental agent trust that is read-only by default. Write permissions can be granted per agent identity, enabling a walk-before-you-run adoption path.
You can view the podcast here.
Transcriptโ
Introduction 0:00โ
Bret Fisher: Welcome to DevOps and Docker Talk. On here, we talk about all cloud-native things, and another Kubernetes project has begun and showed up as a CNCF sandbox project this year, and that is OpenChoreo. This project is all about adding AI agents and essentially a new abstraction on Kubernetes itself, that interfaces between what we're all now using as our primary interface to everything software, which is agent harnesses, and what the Kubernetes API and other APIs provide that are typically on Kubernetes, like your monitoring, your observability, or whatever else you might need. This project is adding a new abstraction layer. I'm calling it an abstraction layer because it provides API, MCP, CLI, and web UI to front-end everything else you typically deal with, whether that's your GitOps tools like Argo, your orchestration, your storage, your observability, Kubernetes itself, your authentication, all that stuff.
Bret Fisher: We are now in a world where we're all playing around with agent SDKs and trying to figure out what we can automate in our infrastructure safely with reliable LLM prompts. These are all things we're probably thinking about. If you're not doing it, you're thinking about it, or you're waiting for someone to tell you how to get it done. OpenChoreo is a new project that came out of WSO2, and they were running it on their own SaaS infrastructure and realized it might be useful to others, which was also how the Docker project got started, by a cloud company realizing that their infrastructure tooling was maybe the most useful part of their product, and released it for free. So OpenChoreo came out of a business hosting a legitimate tool for Kubernetes users and felt like it needed to be open source. Now they're going to be running that on their own platform for customers eventually, but for now they're just focused on getting it open source. Now that it's officially a CNCF sandbox project, that means they've met the minimum level of maturity to explore whether this could succeed in the marketplace, whether those of us building or running open source tools on top of Kubernetes would be interested in an agent platform on top of Kubernetes.
Bret Fisher: Specifically, this is about us administrating and managing these environments with agent harnesses through the OpenChoreo abstraction (their MCP, API, and CLI). This is not about running inference or building custom models. This isn't about being an AI ops or MLOps person. And this isn't about running custom agents that we built in pods inside of a cluster, although that's certainly possible. The main focus is on exactly what I talk about on this show as well as my other podcast, Agentic DevOps. This one I'm trying to keep focused on CNCF projects, Kubernetes and Docker projects. The Agentic DevOps podcast is really about the patterns, behaviors, and tools for just being a DevOps or platform engineer or SRE who's using AI to get their job done.
Bret Fisher: On the show this time I have Lakmal and Sameera from the project team over at WSO2, talking about this, getting into the details of what it does and doesn't do, and also explaining to me how it's not what I actually thought it was when we first started talking about it. In the show notes, there are links to the actual demos โ we tend to cut those out of the podcast version, but you can see how the product works and how you can use agents to interface with Kubernetes. We could use agents before, but we didn't necessarily have all of that infrastructure in place besides just a raw API. This helps solve a lot of that with a new level of abstraction. Let's get into it. Lakmal, tell us about yourself. Who are you? Why are you here?
Lakmal Warusawithana: I am working as a VP and Distinguished Engineer at WSO2, and also GM of the Choreo BU. I started my career as a systems engineer in 2001 and for the last 25 years I've been engaging with building many platforms. Here we are today.
Bret Fisher: Sameera, how about you?
Sameera Jayasoma: I'm Sameera Jayasoma, one of the co-maintainers of OpenChoreo. I started my career in programming language design, compiler engineering, and platform architecture. You'll hear more about abstractions and compilation throughout this talk โ that's me.
Bret Fisher: We'll get a gist by the end of what you're really into. So you all reached out. This is a new project. I'm always excited about new CNCF projects because once they've been accepted into the sandbox process we get an idea, it's like reading the tea leaves a little bit. Not everything is a winner, not every project makes it to graduation. But it's an interesting way to separate the hobby projects from things that teams are taking seriously, because you don't really go for sandbox status unless you're taking things seriously. That's where I wake up and start paying attention to new projects. I'm excited to talk about this today because I'm obsessed with AI agents, how we manage infrastructure with agents involved, how we put MCP in our clusters safely, how we sit on our harness as our primary mechanism for interacting with 20 to 40 years of complexity that we've all built up. The number of abstractions we all have to deal with today is insane, and it's amazing that anyone can go to a conference and have a common conversation because there are a thousand tools we're all using. How did the project get started?
Lakmal Warusawithana: We started this software offering called Choreo SaaS five years back. It started as an iPaaS โ an integration platform as a service, then it evolved into something where you could run all other services within the platform itself, so it evolved into an internal developer platform a couple of years later. The name comes from "choreography", we shortened it to "Choreo." It's like choreographing your microservices in a certain way, where Kubernetes does the orchestration and we do the choreography for all your microservices. We were offering this to enterprise customers, fully managed by WSO2. We were running Choreo version 2, and two years back we started the Choreo version 3 project โ the next version of our current offering, and eventually it became OpenChoreo. At WSO2 all our software is open source, so we thought: why not open source the software we're running ourselves? That's how OpenChoreo started. This January we donated it to the CNCF foundation and now we are a sandbox project.
Bret Fisher: That's a similar path to how Docker was invented, because it was a platform company saying "this piece of our platform is pretty interesting, but it's maybe a utility at this point โ let's just give it away." So it's not a bad idea to do this.
What Makes OpenChoreo Different 8:29โ
Bret Fisher: What was the unique problem you were trying to solve? How does the team look at it from the perspective of "this is a unique thing" in a world where we've got dozens of Kubernetes distros? First off, do you even think of this as a distribution of Kubernetes? Do you see it as vanilla Kubernetes with some things bolted on, or how is this different from something like Rancher?
Sameera Jayasoma: OpenChoreo is something we have built on top of Kubernetes. You can use any Kubernetes distro โ Rancher, EKS, AKS, any Kubernetes and then just install OpenChoreo on top of it. The idea is that you have Kubernetes and all the other tools, and you have a developer platform. What we've seen in the industry is that you can build this developer platform with Kubernetes and Argo and all the things glued on, and then you expose that layer to developers and build a developer experience around it โ portals, MCPs. But we've figured out that that way you expose the platform complexity to your developers. They have to learn Kubernetes, they have to learn all the tools. What's unique here is that we build a middle abstraction layer on top of Kubernetes and other tools. Through six years we've figured out that abstraction layer, and it helps us build a developer experience layer where developers are abstracted away from Kubernetes. They can still see what's going on, but they may not need to learn a lot of things about Kubernetes and other tools. That's one of the unique layers in the platform.
Bret Fisher: So when we talk specifically about OpenChoreo, is it installing all the parts? For the audio audience, we're looking at a visualization. I love diagrams because they help me understand the conceptualness of something. To me it looks like a full-fledged, fully built-out Kubernetes cluster with all the extras โ all the necessary things that an enterprise cluster would have. There are components like KEDA for scaling, Cilium for networking, API gateways, Argo Workflows, Buildpacks, observability layers with Prometheus and all the OpenTelemetry stuff. And then over to the side there's OpenChoreo as this control plane. It's almost like the control plane on top of the control plane. Is this really just focused on the OpenChoreo control plane components on the left and showing how it interacts with the rest, or is this actually helping us implement all of this?
Lakmal Warusawithana: The main idea behind having a unified control plane is that platform engineers, developers, or even our AI agents can interact with the control plane via different protocols, and then the control plane does all the orchestration across all those other planes โ the data plane with Cilium network policy, scale-to-zero with KEDA, all these things can be orchestrated using a control plane with governance enforced through the control plane. That's the main idea. What I've seen as a problem is that even when we expose all these services to the end user, whether it's a platform engineer or a developer, you can't enforce governance. But having this single control layer, we can have guardrails, policy enforcement, and a single pane of glass. That helps to orchestrate across the different planes. Those planes are built using modular architecture in OpenChoreo. For example, if you take an API gateway, it can come from the WSO2 API gateway, kgateway, Kong gateway โ it isn't tied to a single vendor. It's building an entire ecosystem around it. Users can pick and choose what they want to use in their data plane, and the abstraction layer helps orchestrate all those different tooling choices.
Bret Fisher: So if I'm a platform engineer listening to the show, that probably means I have platforms today, I've already got Kubernetes in production, I've already got a lot of these components. Is the idea that this project is for implementing the OpenChoreo control plane components and then including the system components that link or integrate with the APIs of those other things? Is that where you're drawing the boundary of where your project ends?
Sameera Jayasoma: Yes, that's a good way to put it. Our main component is the control plane, and at the same time there are certain system components running in all the other planes as well โ data planes, observability plane, workflow planes. If you're already running these projects, it's just a matter of installing our system components and organizing your platform architecture in this way. And what Lakmal said about abstractions is that when you think about this developer platform, the primary users are platform engineers and the primary consumers are developers. They will use the experience plane โ UI, MCP server, and things like that. They will say "I want a project, I want a component, I want this component deployed on a dev environment," and the control plane understands that and executes and compiles that into Kubernetes and other projects in a way that those projects understand. That's the idea behind all those arrows in the diagram.
Bret Fisher: When I look at the diagram, from a platform team perspective, I recognize all the names, it's a lot of terminology from different CNCF projects. Are you thinking of AI as the main interaction point? I see there's a web console, but do you think AI is going to be increasingly the way we interact with Kubernetes? When I look at this diagram, all the arrows from how things interact with the Kubernetes cluster seem to go through the OpenChoreo control plane. You've got the Backstage UI, presumably for humans, but then you've got MCP, CLI, and API as other interaction endpoints. When I see this diagram, I think: this allows me to put another abstraction layer for my AI to work through, so I'm not running my AI directly against the Kubernetes native API. I've got this enhanced layer with additional functionality. I see you have authorization and access control, which is a big thing for me lately โ how do we minimize the blast radius of giving agents access to different tokens and not giving them root "god" rights to the cluster every day, which I see a lot of people doing by just giving them kubectl admin abilities? Is that the goal of where you see agents being responsible for clusters?
Lakmal Warusawithana: Yes, exactly, that's described correctly. What we want is: when the developer platform gives the golden path, the guardrails, the policies to our human developer. The same way we want to expose this same golden path to the AI agents that interact within the platform.
Agents as First-Class Citizens 16:41โ
Lakmal Warusawithana: In OpenChoreo, agents are first-class citizens. In the current release, we're exposing two MCP servers: the OpenChoreo control plane MCP server and the OpenChoreo observability MCP server. External agents can interact with these MCPs as well as internal built-in agents. When they interact with the MCP servers, the permissions and all the guardrails apply the same as for a developer user in the platform. It's a unified experience whether it's a human developer, a human SRE, or an agent working as a developer or SRE.
Bret Fisher: So this shows up as a bunch of MCP tools I could plug into locally. If I've got Claude Code running, I plug in the MCP endpoints. What's the estimated tool count? Are we dealing with 100 tools, 20 tools?
Lakmal Warusawithana: It depends on the user persona interacting with MCP. We have a developer persona and an SRE persona โ they get different tooling based on their role, and based on that, they can do different activities.
Sameera Jayasoma: Roughly, I would say roughly 100 tools altogether.
Bret Fisher: Not an insignificant amount of tools. So what you're saying is based on who I'm authing as, which role I'm authing as, I get a scoped set of MCP tools based on my permissions.
Lakmal Warusawithana: Yes, exactly.
Bret Fisher: So these agents are coming. You're listing AI agent modules โ you've got SRE, FinOps, and Architect. What does that agent mean to me? If I'm on my local harness, are those long-running agents sitting in the cluster? How do I interact with them?
Sameera Jayasoma: We have internal agents and external agents. They both use the same experience plane, MCP servers, and tools. Claude Code is an external agent. The FinOps, SRE, and Architect agents are platform agents running in the cluster, and they are reacting to certain events. For example, if there's an alert โ a lot of 500 errors are going on, the SRE agent will react and give you a report on what's going on and why. You can configure the agent with certain alert types: if you are consuming high memory above a certain limit, the SRE agent will also react. The FinOps agent is similar in that way.
Bret Fisher: So each of those agents comes predefined with a scope, and presumably through the control plane, I'm plugging those into my messaging apps, into Slack or whatever, and those are how the agents reach out to me. Do they go through a middle layer like an alerting system the way Prometheus does for alerts?
Lakmal Warusawithana: Yes, at the moment it's going through a middle layer, the alerting system. We have log-based alerting with OpenSearch and the Prometheus layer for metrics. Agents interact with the same layer. The alerting system can be configured with PagerDuty, Slack channel, or other channels โ that's how the architecture currently works.
Bret Fisher: This feels like it's designed to be batteries-included. A lot of the conversations I have, I'm lucky to be in the Agentic DevOps guild I run, and we meet weekly, are around what teams are trying to do right now with agents related to infrastructure. People are in that middle mode right now. The local agent harness has become common knowledge; we've all experimented with GUIs, TUIs, multi-agent management, sessions, and skills. But the minute I have to start running what I call "server agents", these long-running, always-looping agents that are sitting there checking something, continually polling for something, or waiting for a webhook if they're event-driven โ these things are probably going to be everywhere, and you're providing a few out of the box defined to do specific things. So I don't have to go find an agent SDK and make up what an agent might do in my cluster, because that's kind of what a lot of us are doing right now โ we're writing little agents everywhere, making it up as we go, because we don't really know which parts to automate or which parts they'd be good at. But it feels like this system helps with the context problem, because that's always a really hard problem: what context do I need to give my agent to make it useful and not hallucinate, and how do I get at that stuff at the time it needs it?
Lakmal Warusawithana: Exactly. The main idea is that we have this ecosystem, and within the ecosystem, we have the agents. The community or even OpenChoreo maintainers will release agents, and then our users can pick and choose which agents to run in their OpenChoreo system. For the SRE agent, the intention is: when something happens, the SRE agent can trigger, and we feed all the logs, metrics, config changes, and code changes to the SRE agent within that window. It can then generate a post-incident report, root cause analysis report, and alert the human SRE. When they come into the system, they already have the root cause report. And we're not stopping there, the agent can also provide remediation actions: "This is a quick fix if you want to fix this issue." At the moment, it's human-in-the-loop, so the human SRE can apply those fixes. But we have the permission model where we can give certain agents the ability to automatically apply fixes themselves rather than waiting for a human. We can say: "If this is only a config change, I will allow my SRE agent to apply that change and fix the issue." So eventually it can take on more activity, but it always has human control built in by design. We know what context we need to feed into this agent to provide better results; that's how the built-in agents are acting.
Sameera Jayasoma: And these agents are also a good starting point for you to come up with your own agents, because the context problem is largely solved within the platform. These agents work against the OpenChoreo abstraction layer rather than exposing raw Kubernetes details, and that's what we've figured out โ those abstraction fields help us give the right context.
The Future of Agentic DevOps 24:50โ
Bret Fisher: I was writing an article over the last week around what an AI Kubernetes platform even means. What I had to define in this newsletter is what I'm calling "type three AI on Kubernetes." If you've been going to KubeCons since the invention of ChatGPT, for most of the history of what we now call Gen AI, we are not talking about AIOps and MLOps. For those of us that go to KubeCon, I've been ranting for a couple of years now that there's been a weird disconnect, when a lot of people in the industry talk about Kubernetes and AI, they're actually talking about running AI, making AI, building models, reinforcement learning, and running inference. That's not my world. To me that's a specialty, and while it might get easier, I feel like there's still a significant majority of us that won't be doing that. Our new job is to build out these agents and understand where the features and the edges of models can be implemented so we can automate and build more.
Bret Fisher: I see this whole thing as just like VMs, the cloud, the invention of the PC. Over my 30-year career, we've had major pivoting points in technology that allowed us to scale ourselves as humans. We went from managing 10 servers in the 90s to 100 servers with VMs to a thousand servers with the cloud. Maybe containers allowed us to run 10,000 pods per admin. And now this is the next level. What you all are building is exactly that layer of abstraction; the agents and their context management and permission management, which is this new level of abstraction that will hide some of the complexity. We don't have to know every kubectl command in the world anymore. In fact, I don't even know how we all get Kubernetes certified, because if we're not going to be running kubectl and knowing every option, what is a Kubernetes admin test other than just understanding architecture?
Bret Fisher: This post took me a week to write because I was trying to theorize: we're really talking about agentic operations. That's the reason I have the new podcast; that's the reason this project is exciting; this is that new layer, the abstraction we're all searching for. I think we've understood the local harness now after having those for a little over a year. But this new nebulous layer sitting in front of our infrastructure is still not fully realized. You all are pretty early in the game in terms of saying, "We can help, we have this defined component." I'm trying to draw a diagram of a maturity path: we started with the agent harness locally, and the end goal is we've maximized the agent assistance โ the dozen agents, two dozen agents, maximizing all the functions and features of what models can do for us as DevOps engineers, platform engineers, SREs. We're all managing this infrastructure. There are a dozen tools and a thousand features we now need to learn. We're learning skills and agent files and MCPs that can go awry. But eventually the idea is we're improving productivity, improving management, making things more secure, and reducing outages. A year ago we were all nervous about whether our jobs were going away. But to me, this is the job. This is the new job. Where do the agents help, where do they not help, where do we still need the human in the loop? Does any of that resonate?
Lakmal Warusawithana: I think you're correct. It's evolved in the last one year, within one year we can run agents in production, helping different personas engage within the platform. I believe the generating code problem is almost solved now with Claude Code, Cursor, or Codex. People can generate code and write applications within minutes. We call it "vibe coding", but the moment they hit deployment, the vibe is ending, because the coded application has to be promoted into production, but the platform doesn't support it. They have to go away from the agent, either create a ticket or go to the self-service portal to deploy the application. What we've seen with our skills and the MCP server is that they can use the same agent they're using to develop their code and say, "I want to deploy my application into the development environment, you figure out how to do it." So now we call it "vibe deployment." OpenChoreo tries to fill the gap on the deployment part, the operational side of it. You can write your code using an agent and now use the same agent to deploy your application into production, with the support of OpenChoreo's abstraction as well as the tools supported in the MCP servers. That's where we want to play in this platform engineering side of it.
Sameera Jayasoma: And abstractions help agents, the way I think about that is: more abstractions, fewer tokens. They don't have to learn deep Kubernetes YAMLs and kubectl. The same applies to programming languages โ writing assembly versus writing in Java or C# or Go, right? Fewer tokens in that way. They have more context with less token budget. That's what I'm working on these days, whether we can establish that.
Bret Fisher: We're very quickly understanding that tokens are finite, just like early in our Kubernetes journey, we realized how much infrastructure we actually need to just run the apps. When you build out a full-scale Kubernetes cluster, there is a non-insignificant amount of infrastructure to manage the infrastructure, and I don't think we all really understood the eventuality of dozens of different controllers and dedicated nodes in the control plane. I feel like now the same thing is happening with tokens. We got all excited and we're going to put agents everywhere, and we're going to very quickly end up in a world where we've got budgets and we have to optimize our agents. One of the conversations we're having in the Agentic DevOps guild is around evals โ how can we use evals to figure out which model to use? If my agents are really just behaving with a bunch of skills and context management stuff, how can I programmatically determine that maybe I could use Gemma or Minimax or Qwen or some openweight model that's super cheap for my very specific niche agent, whereas maybe it doesn't need Opus? None of us want to put these in production and then find out a month later that it's too expensive to run and have to shut it down.
Deploying with Agents: Live Demo 32:55โ
Bret Fisher: We've had Innerhive on the show and Mindrol on the show, both AI startups building agents to help with infrastructure CI/CD and fault remediation. The approach I've been seeing everybody take is the normal pattern of engineering: walk before you run, run before you sprint. On day one, you might be read-only. Agents can't do things, they can just look things up, they're a little helper assistant with no control over anything, just here to help you find information in this sea of infinite information about your infrastructure. Then, like with Mindrol, they're now to the point where for very specific use cases they're letting the agent have a tiny bit of control, because it's predictable. They can test to make sure it works, that it's as declarative as possible, and they start to let that happen. Is that something we can do with OpenChoreo? Does it have a default stance of read-only? How do the permissions work?
Lakmal Warusawithana: It's not read-only, you can control it. We have the permission model where you can give different tools to different agents. It can be read-only tools, it can be write/operational tools as well. Based on what you want to do, you can allow these permissions, because agents have their own identity, and within that identity, they have permissions. Based on the permissions, they can use different tools. If you are confident enough that your agent is performing well, you can give it write permissions to do some level of activity โ for example, if there is a configuration issue, the agent can go and fix the configuration itself. You can fine-tune how your agent can interact with your platform.
Bret Fisher: I've been wondering about how permissions are going to work โ so much of our permission models today feel like they're not designed for an agentic world. And how are we going to track the actions of these things? Every little step they take, hopefully will be put into some sort of observability platform. I'm trying to envision how to correlate this to the things we know today; GitOps repos with Argo CD definition files that get pulled in because Argo or Flux is watching a different repo and looking for YAML changes. Does this change that workflow? I see GitOps on the diagram. How does this correlate to the GitOps loop that so many teams have adopted?
Sameera Jayasoma: The CRDs โ you can put all of these in your GitHub repo and then configure Argo CD or Flux, and the normal workflow will work as it is. So there are two modes here. I think we are using more of a click-ops approach in the demo, but you can configure the same thing with your GitOps.
Bret Fisher: For audio listeners, we're looking at a dashboard. We're vibe-opsing, not click-ops, because we're not clicking on the dashboard; we're using the agent. But just to be clear: the agent is causing OpenChoreo to create new resources inside the cluster for deploying the Google microservices demo. That is an alternative to having an agent make a new PR in YAML and push that to GitHub and then have Argo or Flux pick it up later. I could give the agents skills so that they know that's my workflow and I'm using it as a read-only partner. Just like we're not supposed to go into the AWS console and click-ops away, I don't necessarily want OpenChoreo changing my infrastructure outside of my GitOps loop. Do you think the GitOps approach is less important if we have this sort of agent chain of events, or do you see enterprise teams possibly shifting to more of this vibe approach? Or do you think GitOps is here to stay as the more mature, safe approach?
Lakmal Warusawithana: When you come to lower environments, agents directly creating custom resources will play a big role. But maybe in production, when you promote to production, people will use GitOps or a declarative way of defining things. Eventually, if the human is confident enough with their agents and trusts them, I would say agents can directly call the MCPs and create custom resources directly within the cluster itself. Not this year, but maybe in the future.
Sameera Jayasoma: One advantage of GitOps is that you have the tracing โ you know exactly what happened over the years. For dev environments, you probably don't need GitOps; you can give developers full capability to do whatever they want. But for other environments, my view is you would still use GitOps to control the workflow.
Bret Fisher: I started to wonder if agents will eventually change this. Everyone's already writing YAML with agents; teams using Claude Code are not handwriting Kubernetes YAML anymore. And there is a mode in Argo where you can do things retroactively, make changes but document them after you're done. I wonder if agents create a future where GitOps is more of a system of record done after the fact, and agents push us into a faster evolution of this.
Sandboxing & SRE Auto-Remediation 39:26โ
Bret Fisher: A constant conversation we're having is around sandboxing. Specifically with local agents: every AI I open up by default, even with Claude Code's built-in sandboxing, chances are it has access to my AWS, Docker CLI, kubectl, Terraform, and those keys are all already set up and it could just deploy to things. One of the things I'm trying to work on with some friends is where to draw a boundary around sandboxes. This almost feels like a perfect scenario โ these MCP tools and skills, are maybe all in a Docker sandbox, which is more like a VM with a harness running inside it. I only use that particular harness because it has these particular permissions, and I only spin it up when I want to. It's almost like a per-environment configuration and each Docker sandbox has its own isolated configuration, including all your keys. So I'd have my prod harness, my staging harness, and I can interact with those environments and keep them safe away from my normal day-to-day harness. A lot of small teams have too many keys on their local machines โ if you're a team of three, chances are you've got the production Terraform key on your machine. Now that we have agents, people are starting to get more concerned: what if it accidentally picks the wrong environment? Can OpenChoreo manage multiple clusters in one instance, or is it per cluster?
Lakmal Warusawithana: You can manage multiple clusters within one OpenChoreo.
Bret Fisher: So I almost treat it like an environmental gateway for my agents. I call this "production" and I have this Docker sandbox where the keys to production are only accessible from that sandbox. And maybe I'll also have some GitHub keys in there so my agents in that sandbox can write to the GitOps repo and make PRs for me. I'm now realizing how these things are coming together: I can have one local agent that can see the infrastructure through OpenChoreo, but can also implement GitOps changes based on the information it's getting out of that infrastructure. If the pods are in a pull-backoff loop, the agent can determine that, write the GitOps diff, push that to a PR, and that's how I'm going to be iterating on my cluster now. Or maybe I'm doing it in Slack while I'm at lunch through a Slack bot. Can we use a private LLM to interact with OpenChoreo instead of Claude Code?
Lakmal Warusawithana: Yes. It's not only Claude Code, it's your model, your choice. You can use any Codex or any other agent. It's just MCP servers you're interacting with. You can use any LLM in your local agent, and in the built-in agents you can also configure your LLM, OpenAI or whatever model you want to configure. It's a configuration for us.
Bret Fisher: Picking the model for the agents is a good thing because I've got a feeling we're going to need cheaper models for the cluster for the always-on activities. Does the agent have visibility into inter-cell dependency traffic, or is it completely blinded to anything outside its assigned project?
Lakmal Warusawithana: It depends on the different agent and what context it's interacting with. For example, the SRE agent, when troubleshooting, can go beyond one project, because some components interact with other components in other projects. If there are interactions, the agent can look at what's happening in those other components in other projects. Based on the context the agent is interacting with, it can go beyond a single project to multiple projects.
Bret Fisher: Just to be clear on the scope โ it doesn't appear you're trying to boil the ocean and make OpenChoreo the single control plane for AWS, GCP, and all the other things we manage. It's trying to remain Kubernetes-focused, right? The lowest layer is Kubernetes, and I don't see clouds listed here.
Lakmal Warusawithana: We are running on top of Kubernetes. We are not even managing Kubernetes with OpenChoreo, we build on top of it. You can run OpenChoreo on EKS or AKS, but we are not managing EKS or AKS. That's how we architect it. But OpenChoreo has a resource abstraction where you can manage cloud resources within OpenChoreo. We have integrated with Crossplane, so the OpenChoreo control plane can talk with the Crossplane integration, and via Crossplane, it can create an S3 bucket in AWS and do the lifecycle management of that bucket. It provides a single unified control plane to manage it, but it's up to users if they want to use it that way.
Sameera Jayasoma: Just to add to that, there are multiple planes: control plane, data plane, observability plane. You can run each plane in its own Kubernetes cluster, or all planes in one cluster. You can run your control plane locally and your data plane in AWS. You can configure it the way you want. If the data plane is in AWS and the control plane is somewhere else, you don't have to expose the Kubernetes API server to the Internet, there's a certain agent running in the data plane that creates an outbound connection to the control plane. That's sort of the standard these days.
Bret Fisher: Basically, if you're going to use this to help manage something, it needs to be something running in Kubernetes, probably around the CNCF ecosystem of tooling. This isn't trying to replace Terraform or control CloudFormation or run other infrastructure. That gives me a nice boundary. When we talk about the future of what it means to be an SRE, a DevOps platform engineer, where do the edges of these tools exist? Sure, I could manage everything from Claude Code in theory, but I probably need very scoped things to keep it in line. Having this thing well defined as running on top of Kubernetes means things you can do in Kubernetes it can help with, but maybe it's not the right tool to do everything all in one place. There's a ton of context outside of Kubernetes it would need. At the end of the day we're going to have dozens of agents, and we're going to have an agent management plane where we're all just staring at agent configs and skill files; that's our new job. More markdown.
Getting Started & Wrap-Up 47:04โ
Bret Fisher: How can people get started? Is this a Helm chart?
Sameera Jayasoma: Step one: you've got to have Kubernetes running somewhere. If you go to the openchoreo.dev documentation page, there are a couple of ways to get started. First, if you want to just try out OpenChoreo, we have a quickstart guide, you just need Docker. There's one command that will install Kubernetes in Docker and then install OpenChoreo on top. Once it's done you get a UI, you get MCP access, all that โ it's just 10 minutes, it won't pollute your local environment, and you can just destroy it when you're done. The second option is to have a Kubernetes cluster on your local machine and install OpenChoreo there. The third option is to install OpenChoreo in cloud environments. That's how we've structured the getting started experience.
Bret Fisher: Does it have databases as part of its deployment โ a Redis, or what does the persistence layer look like for those of us deploying this on our clusters?
Sameera Jayasoma: For the whole OpenChoreo control plane, the main persistent layer is etcd, whatever Kubernetes uses. For Backstage, you can configure Postgres or any other database that Backstage recommends. Apart from that, there's no other persistence.
Lakmal Warusawithana: For the SRE agent, you can configure it to save incident history and provide context in future incidents, you can configure Postgres or any other database for that.
Bret Fisher: These agents are doing things and we probably want a history of that. Is it logging through Kubernetes to my existing monitoring and logging platform? If I want to see what my agent did in the last 24 hours, do I just look at my normal monitoring and logging platform?
Lakmal Warusawithana: Yes, it does all the audit logging itself. Depending on the log module it can be OpenSearch with different indices for logs, what the agent did, what the human did, everything is tracked within the logs.
Bret Fisher: Are agents pulling infrastructure data in real time, or is there a caching layer you have to deal with to optimize in the OpenChoreo layer?
Lakmal Warusawithana: At the moment, it's real-time. We are not using any vector DB or caching. We give the agent a time interval context, "Within this time interval you have to pull the logs and metrics," and it will just look in that interval and get all the data to troubleshoot.
Bret Fisher: So once it sees an event, it already knows exactly when that event happened and can limit itself so that you're not burning a bunch of tokens by boiling the ocean and scanning every log to infinity. Does it need to look at Grafana graphs, or is it just doing Prometheus queries and getting data in real time?
Lakmal Warusawithana: It goes to the observability MCP server, it's querying through the MCP servers.
Sameera Jayasoma: This is directly from the observability MCP server talking to OpenSearch and getting the data.
Bret Fisher: So it's a little bit simpler than having to constantly store everything as a cache, which in theory would make the agent faster, but also has a lot of complexity and resource utilization. That gets back to the point that at some point everybody has that infrastructure where they realize they have a cluster with more resources used for the infrastructure than for the actual apps โ you know, the app only costs $100 a month to run but the infrastructure costs $10,000 a day.
Bret Fisher: This has been very cool. I could talk to you for another hour about this because I'm really interested in the patterns and architecture design that you've learned. These are things we're going to have to do the same thing for our cloud infrastructure, our line of business apps, we're going to have to figure out how to give agents a deeper understanding of that infrastructure so we can depend on them more and rely less on the human tribal knowledge we all have about how these systems are put together. My friends call it "the sins of the data center" and this thing is going to have to know all that. Viktor Farcic, who runs the DevOps Toolkit channel, talks about how Claude Code is the center of the universe for him and that anybody creating a project that isn't expected to be local-harness-first is creating an outdated project. This feels like it's doing exactly what he was predicting over the last year, the harness is your gateway to all this infrastructure, and everything else needs to be there to support that. All right, people get started at openchoreo.dev and you're on LinkedIn, X, and CNCF Slack?
Lakmal Warusawithana: We have X and LinkedIn pages, and also the CNCF Slack channel. We actively engage there. And we use GitHub Discussions a lot.
Bret Fisher: I love GitHub Discussions. I wish they were used more rather than issues, a lot of things just need to start in discussions before they go to issues. Thank you both so much for being here. Last quick thing, what's next? What big PRs are coming down the pike?
Lakmal Warusawithana: More agents.
Sameera Jayasoma: More agents.
Bret Fisher: Could have guessed it. More AI overlords. Thank you so much for being here. Thanks everybody in chat for being with us and we'll see you in the next one.
Get Started with OpenChoreoโ
OpenChoreo is open source and free to use. The best next step is to try it yourself.
- Check out the website
- Try out the quickstart guide
- Star the repository
- Join #openchoreo on Slack

