Quickstart

From zero to correlated incidents in an hour

Install the Infrawatch agent via Helm, point it at your OpenTelemetry collector, and watch signals correlate automatically.

Prerequisites

  • Kubernetes cluster (1.24+) with Helm 3 installed
  • An existing OTel Collector deployment or compatible endpoint
  • Your Infrawatch API key — retrieve from the dashboard under Settings → API Keys

Step 1 — Add the Helm repository

Step 2 — Install the agent

The agent starts receiving signals immediately. By default it listens on port 4317 (gRPC) and 4318 (HTTP/protobuf).

Step 3 — Point your OTel collector at the agent

In your otel-collector-config.yaml, add an OTLP exporter targeting the agent service:

Concepts

Correlation windows

Infrawatch groups signals that arrive within a configurable time window (default: 5 minutes, range: 30s–60m) and share a topology relationship. Signals from the same service, same node, or related upstream/downstream services are correlated into a single incident candidate. The window is tunable per cluster — tighter for high-churn deployments, wider for batch jobs with delayed failure propagation.

Fingerprinting

Each unique signal event is assigned a fingerprint based on error message pattern, service name, namespace, and OTLP resource attributes. Recurring fingerprints (same pattern, different incident window) are deduplicated into a single incident stream rather than N duplicate alerts. Config changes are fingerprinted separately and indexed for correlation lookup — a Helm chart diff that appears before multiple OOMKill incidents will surface in the root cause heatmap.

Topology matching

Infrawatch builds a live service dependency graph from your trace span attributes (service.name, service.namespace, span parent/child relationships). When signals arrive, they are resolved against this graph to determine upstream/downstream blast radius. A memory pressure event on a shared cache will automatically surface all dependent services in the incident view — even if those services haven't fired their own alerts yet.

Configuration reference

KeyDefaultDescription
config.apiKeyRequired. Your Infrawatch API key.
config.clusterName"default"Label shown in the incident view for this cluster.
config.correlationWindow300Seconds to hold signals open for correlation.
config.dedupeWindow3600Seconds before a closed fingerprint can reopen.
config.otlpPort4317gRPC listener port.
config.otlpHttpPort4318HTTP/protobuf listener port.
agent.resources.limits.memory"256Mi"Container memory limit.