Version: v0.11.x

Observability & Alerting

OpenChoreo provides an optional observability plane, which consists of a comprehensive observability stack for monitoring applications deployed on the platform.

This guide covers how to configure and use logging, metrics, traces, and alerting capabilities.

Overview

OpenChoreo's default observability architecture consists of:

Pillar	Components
Logs	Fluent Bit as the log collector and OpenSearch as the log storage
Metrics	Prometheus to collect and store metrics
Traces	OpenTelemetry Collector as the trace collector and processor, and OpenSearch as the trace storage
Alerting	Prometheus Alertmanager for metric alerts, and OpenSearch Alerting for log alerts

All observability data is accessible through the Observer API, which provides a unified interface for querying logs, metrics, and traces.

Architecture

Single-Cluster Setup

In single-cluster mode, all planes run in the same Kubernetes cluster. Observability data is collected directly from the data planes and build plane via the agents deployed in the observability plane.

Multi-Cluster Setup

In multi-cluster mode, the observability plane runs on a dedicated cluster. Data planes and build plane deploy local collectors that publish observability data to the observability plane through gateway ingress.

In this setup:

Data Plane deploys Fluent Bit to collect logs, Prometheus Agent to collect metrics, and OpenTelemetry Collector to collect traces
Build Plane deploys Fluent Bit to collect build logs
All collectors publish data to the Gateway in the Observability Plane
The Observer API queries OpenSearch and Prometheus to serve unified observability data

For detailed multi-cluster setup instructions, see Multi-Cluster Connectivity.

Prerequisites

OpenChoreo control plane installed
Data plane and build plane (optional) installed to observe
Observability plane installed (see Installation)

Resource Requirements

Component	CPU Request	CPU Limit	Memory Request	Memory Limit
OpenSearch (per node)	100m	1000m	1Gi	1Gi
Observer	100m	200m	128Mi	200Mi
Prometheus	100m	200m	128Mi	256Mi
Fluent Bit	100m	200m	128Mi	256Mi
OpenTelemetry Collector	50m	100m	100Mi	200Mi

Installing the Observability Plane

Refer to Getting Started or Multi-Cluster Connectivity for instructions on installing the observability plane in single-cluster or multi-cluster mode.

Observability

Logs

OpenChoreo collects container logs using Fluent Bit and stores them in OpenSearch. By default, logs are collected from all containers in the cluster, except for the Fluent Bit containers. Collected logs are enriched with Kubernetes metadata to support querying by OpenChoreo labels.

Log Collection Configuration

The Fluent Bit configuration can be customized via Helm values:

fluent-bit:
  enabled: true
  config:
    inputs: |
      [INPUT]
          Name tail
          Path /var/log/containers/*.log
          Tag kube.*
          # ... additional configuration
    filters: |
      [FILTER]
          Name kubernetes
          Match kube.*
          Merge_Log On
    outputs: |
      [OUTPUT]
          Name opensearch
          Host opensearch
          Port 9200
          Match kube.*

Logs Retention Configuration

Collected logs are retained for a default of 30 days and can be configured as required.

Querying Logs

The Observer API provides a REST API for querying logs of a specific OpenChoreo component. This can be accessed via the Backstage portal, OpenChoreo CLI or OpenChoreo MCP server. Observer handles the authentication and authorization based on OpenChoreo user identity.

Metrics

OpenChoreo collects metrics using Prometheus and kube-state-metrics. The metrics stack provides:

Container resource metrics (CPU, memory)
HTTP request metrics (when instrumented via Hubble with Cilium CNI)

Available Metrics

Metric Type	Description	Source
CPU Usage	Container CPU utilization	cAdvisor
Memory Usage	Container memory consumption	cAdvisor
HTTP Requests	Request counts, latency	Hubble (Requires Cilium CNI)

Querying Metrics

The Observer API provides a REST API for querying metrics of a specific OpenChoreo component. This can be accessed via the Backstage portal, OpenChoreo CLI or OpenChoreo MCP server. Observer handles the authentication and authorization based on OpenChoreo user identity.

Traces

OpenChoreo supports distributed tracing using OpenTelemetry. The OpenTelemetry Collector receives traces via OTLP, enriches them with Kubernetes metadata, applies sampling policies, and exports them to OpenSearch.

Trace Pipeline

The OpenTelemetry Collector processes traces through the following pipeline:

Receivers: Accepts traces via OTLP protocol (gRPC on port 4317, HTTP on port 4318)
Processors:
- k8sattributes: Enriches traces with OpenChoreo labels
- tail_sampling: Applies rate limiting to control trace volume
Exporters: Sends processed traces to OpenSearch (index: otel-traces-*)

Instrumenting Applications

Applications must be instrumented to send traces to the OpenTelemetry Collector. Configure your application to send OTLP traces to one of the following endpoints:

Protocol	Endpoint
HTTP	`http://opentelemetry-collector.openchoreo-observability-plane.svc.cluster.local:4318/v1/traces`
gRPC	`opentelemetry-collector.openchoreo-observability-plane.svc.cluster.local:4317`

Example: OpenTelemetry SDK Configuration (Go)

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
)

exporter, _ := otlptracehttp.New(ctx,
    otlptracehttp.WithEndpoint("opentelemetry-collector.openchoreo-observability-plane:4318"),
    otlptracehttp.WithInsecure(),
)

Trace Sampling Configuration

The OpenTelemetry Collector uses tail-based sampling to control the volume of traces stored. Configure sampling via Helm values:

opentelemetryCollectorCustomizations:
  tailSampling:
    decisionWait: 10s           # Time to wait before making sampling decision
    numTraces: 100              # Number of traces to keep in memory
    expectedNewTracesPerSec: 10 # Expected new traces per second
    spansPerSecond: 10          # Maximum spans per second rate limit
    decisionCache:
      sampledCacheSize: 10000
      nonSampledCacheSize: 1000

Querying Traces

The Observer API provides a REST API for querying traces of a specific OpenChoreo project. This can be accessed via the Backstage portal, OpenChoreo CLI or OpenChoreo MCP server. Observer handles the authentication and authorization based on OpenChoreo user identity.

Alerting

OpenChoreo provides alerting based on logs and resource usage metrics. Alert rules are defined as traits on components and are automatically created for each environment by the control plane during component releases. Alert notifications are configured as notification channels and are sent through the notification channel when an alert is triggered.

Alert Rule Configuration

OpenChoreo ships a default trait named observability-alertrule that can be used to define alert rules on components. Platform engineers can define their own traits to create custom alert rules as required.

traits:
  - name: observability-alertrule
    instanceName: high-error-rate-log-alert
    parameters:
      description: "Triggered when error logs count exceeds 50 in 5 minutes."
      severity: "critical"
      source:
        type: "log"
        query: "status:error"
      condition:
        window: 5m
        interval: 1m
        operator: gt
        threshold: 50

Override the environment-specific parameters for the alert rule in the ReleaseBinding CR.

spec:
  traitOverrides:
    high-error-rate-log-alert:
      enabled: true
      enableAiRootCauseAnalysis: false
      notificationChannel: devops-email-notifications

Alert Source Types

Type	Description	Use Case
`log`	Log-based alerting	Error patterns, specific log messages
`metric`	Metric-based alerting	Resource utilization (CPU, memory)

Alert Condition Operators

Operator	Description
`gt`	Greater than
`lt`	Less than
`gte`	Greater than or equal
`lte`	Less than or equal
`eq`	Equal to

Notification Channels

Configure notification channels to receive alerts. Platform Engineers can configure notification channels per environment. The first notification channel created in an environment is marked as the default channel. The default channel is used by alert rules that don't specify a channel.

apiVersion: openchoreo.dev/v1alpha1
kind: ObservabilityAlertsNotificationChannel
metadata:
  name: my-notification-channel
  namespace: default-organization
spec:
  environment: development
  isEnvDefault: true
  type: email
  config:
    from: alerts@example.com
    to:
      - team@example.com
      - oncall@example.com
    smtp:
      host: smtp.example.com
      port: 587
      auth:
        username:
          secretKeyRef:
            name: smtp-credentials
            key: username
        password:
          secretKeyRef:
            name: smtp-credentials
            key: password
      tls:
        insecureSkipVerify: false
    template:
      subject: "[${alertSeverity}] ${alertName} Triggered"
      body: |
        Alert: ${alertName}
        Severity: ${alertSeverity}
        Time: ${alertTimestamp}
        Description: ${alertDescription}
        Component: ${component}
        Project: ${project}
        Environment: ${environment}

AI-Powered Root Cause Analysis

When enableAiRootCauseAnalysis is enabled on an alert rule, OpenChoreo's RCA Agent automatically analyzes the alert and generates a root cause analysis report.

See RCA Agent for configuration details.

Viewing Observability Data

OpenSearch Dashboards

Enable OpenSearch Dashboards for visual exploration of logs and traces:

helm upgrade --install openchoreo-observability-plane oci://ghcr.io/openchoreo/helm-charts/openchoreo-observability-plane \
--version 0.11.0 \
--namespace openchoreo-observability-plane \
--reuse-values \
--set openSearchCluster.dashboards.enable=true

Port-forward OpenSearch Dashboards to view logs and traces:

kubectl port-forward svc/opensearch-dashboards 5601:5601 -n openchoreo-observability-plane

Open http://localhost:5601 in your browser to access OpenSearch Dashboards.

Grafana

Enable Grafana for metrics visualization:

helm upgrade --install openchoreo-observability-plane oci://ghcr.io/openchoreo/helm-charts/openchoreo-observability-plane \
--version 0.11.0 \
--namespace openchoreo-observability-plane \
--reuse-values \
--set prometheus.grafana.enabled=true

Port-forward Grafana to view metrics:

kubectl port-forward svc/grafana 5000:80 -n openchoreo-observability-plane

Open http://localhost:5000 in your browser to access Grafana.

Default credentials: admin / admin

Configuration Reference

Key Helm Values

Value	Default	Description
`fluent-bit.enabled`	`true`	Enable log collection
`prometheus.enabled`	`true`	Enable metrics collection
`opentelemetry-collector.enabled`	`true`	Enable OpenTelemetry Collector for traces
`openSearch.enabled`	`false`	Enable OpenSearch single node mode
`openSearchCluster.enabled`	`true`	Enable OpenSearch HA mode
`openSearchCluster.dashboards.enable`	`false`	Enable OpenSearch Dashboards for HA mode
`prometheus.grafana.enabled`	`true`	Enable Grafana
`rca.enabled`	`false`	Enable AI RCA Agent

For complete configuration options, see the Observability Plane Helm Reference.

Troubleshooting

Logs Not Appearing

Verify Fluent Bit is running:

kubectl get pods -n openchoreo-observability-plane -l app.kubernetes.io/name=fluent-bit

Check Fluent Bit logs:

kubectl logs -n openchoreo-observability-plane -l app.kubernetes.io/name=fluent-bit

Verify OpenSearch is healthy:

kubectl get pods -n openchoreo-observability-plane -l app=opensearch

Metrics Not Available

Verify Prometheus is running:

kubectl get pods -n openchoreo-observability-plane -l app.kubernetes.io/name=prometheus

Check if ServiceMonitors are being discovered:
```
kubectl get servicemonitors --all-namespaces
```

Traces Not Appearing

Verify OpenTelemetry Collector is running:

kubectl get pods -n openchoreo-observability-plane -l app.kubernetes.io/name=opentelemetry-collector

Check OpenTelemetry Collector logs for errors:

kubectl logs -n openchoreo-observability-plane -l app.kubernetes.io/name=opentelemetry-collector

Verify your application is configured to send traces to the correct endpoint (port 4317 for gRPC, port 4318 for HTTP).

Alert Not Firing

Verify the alert rule status after a component is deployed:

kubectl get observabilityalertrules -n <namespace>
kubectl describe observabilityalertrule <name> -n <namespace>

Check the Observer logs for alert processing errors:

kubectl logs -n openchoreo-observability-plane deployment/observer

RCA Agent - AI-powered root cause analysis
Deployment Topology - Multi-plane architecture overview
Multi-Cluster Connectivity - Connecting planes across clusters
Observability Plane Helm Reference - Complete Helm configuration options

Overview​

Architecture​

Single-Cluster Setup​

Multi-Cluster Setup​

Prerequisites​

Resource Requirements​

Installing the Observability Plane​

Observability​

Logs​

Log Collection Configuration​

Logs Retention Configuration​

Querying Logs​

Metrics​

Available Metrics​

Querying Metrics​

Traces​

Trace Pipeline​

Instrumenting Applications​

Trace Sampling Configuration​

Querying Traces​

Alerting​

Alert Rule Configuration​

Alert Source Types​

Alert Condition Operators​

Notification Channels​

AI-Powered Root Cause Analysis​

Viewing Observability Data​

OpenSearch Dashboards​

Grafana​

Configuration Reference​

Key Helm Values​

Troubleshooting​

Logs Not Appearing​

Metrics Not Available​

Traces Not Appearing​

Alert Not Firing​

Related Documentation​