Prerequisites
- Kubernetes cluster (1.24+) with Helm 3 installed
- An existing OTel Collector deployment or compatible endpoint
- Your Infrawatch API key — retrieve from the dashboard under Settings → API Keys
Step 1 — Add the Helm repository
$ helm repo add infrawatch https://charts.infrawtch.com
$ helm repo update
Step 2 — Install the agent
$ helm install infrawatch-agent infrawatch/agent \
--namespace monitoring \
--create-namespace \
--set config.apiKey="YOUR_API_KEY" \
--set config.clusterName="prod-us-east"
The agent starts receiving signals immediately. By default it listens on port 4317 (gRPC) and 4318 (HTTP/protobuf).
Step 3 — Point your OTel collector at the agent
In your otel-collector-config.yaml, add an OTLP exporter targeting the agent service:
exporters:
otlp/infrawatch:
endpoint: infrawatch-agent.monitoring.svc.cluster.local:4317
tls:
insecure: false
service:
pipelines:
traces:
exporters: [otlp/infrawatch, ...]
metrics:
exporters: [otlp/infrawatch, ...]
Concepts
Correlation windows
Infrawatch groups signals that arrive within a configurable time window (default: 5 minutes, range: 30s–60m) and share a topology relationship. Signals from the same service, same node, or related upstream/downstream services are correlated into a single incident candidate. The window is tunable per cluster — tighter for high-churn deployments, wider for batch jobs with delayed failure propagation.
Fingerprinting
Each unique signal event is assigned a fingerprint based on error message pattern, service name, namespace, and OTLP resource attributes. Recurring fingerprints (same pattern, different incident window) are deduplicated into a single incident stream rather than N duplicate alerts. Config changes are fingerprinted separately and indexed for correlation lookup — a Helm chart diff that appears before multiple OOMKill incidents will surface in the root cause heatmap.
Topology matching
Infrawatch builds a live service dependency graph from your trace span attributes (service.name, service.namespace, span parent/child relationships). When signals arrive, they are resolved against this graph to determine upstream/downstream blast radius. A memory pressure event on a shared cache will automatically surface all dependent services in the incident view — even if those services haven't fired their own alerts yet.
Configuration reference
| Key | Default | Description |
|---|---|---|
config.apiKey | — | Required. Your Infrawatch API key. |
config.clusterName | "default" | Label shown in the incident view for this cluster. |
config.correlationWindow | 300 | Seconds to hold signals open for correlation. |
config.dedupeWindow | 3600 | Seconds before a closed fingerprint can reopen. |
config.otlpPort | 4317 | gRPC listener port. |
config.otlpHttpPort | 4318 | HTTP/protobuf listener port. |
agent.resources.limits.memory | "256Mi" | Container memory limit. |