Skip to content

Architecture

This page outlines a pragmatic, operations-focused view of an observability stack.

High‑level layers

  1. Instrumentation (generation)
  2. Application SDKs, auto-instrumentation, exporters emit logs/metrics/traces (L/M/T)
  3. Standard protocols: OTLP over gRPC/HTTP; legacy inputs supported via receivers

  4. Ingestion/collection

  5. Agents/DaemonSets/sidecars and/or a gateway OpenTelemetry Collector receive telemetry
  6. Fan‑in from nodes, apps, infrastructure, cloud services

  7. Processing/enrichment

  8. Pipelines apply sampling, filtering, redaction, resource detection, attribute transforms
  9. Batching, retry, timeouts, queueing/backpressure control to stabilize flow
  10. Routing by signal/tenant/team to specific backends

  11. Export/transport

  12. OTLP/gRPC or HTTP to remote backends; TLS/mTLS and auth (API keys, OIDC)
  13. Circuit breakers, exponential backoff, persistent queues for resilience

  14. Storage backends (by signal)

  15. Metrics: Prometheus/Thanos/Mimir, VictoriaMetrics
  16. Traces: Tempo/Jaeger/Elastic APM
  17. Logs: Loki/Elasticsearch/OpenSearch
  18. Sometimes a columnar data lake/warehouse for long‑term retention and cost control

  19. Query/visualization & alerting

  20. Grafana/Kibana/Tempo/Jaeger UIs; SLOs and alert rules
  21. Routing to Alertmanager, PagerDuty, Opsgenie, Slack, email

Where the OpenTelemetry Collector fits

The OTel Collector spans multiple layers:

  • Layer 2 (Ingestion/collection): receivers (otlp, prometheus, filelog, k8s events)
  • Layer 3 (Processing/enrichment): processors (batch, memory_limiter, attributes/resource, transform, tail_sampling, routing)
  • Layer 4 (Export/transport): exporters (otlp[gRPC/HTTP], prometheusremotewrite, loki, tempo/jaeger), TLS/mTLS, retries/queues

It is not a storage backend or UI (does not cover layers 5 or 6).

Reference view (Kubernetes + OpenTelemetry)

  • Workloads emit OTLP → node/sidecar/agent collector (DaemonSet or sidecar)
  • Optional gateway collector (centralized, stateless) → processes and routes signals
  • Backends per signal; Grafana on top for dashboards, logs, traces, exemplars
Apps/SDKs ──OTLP──> Node/Sidecar Collector ──OTLP──> Gateway Collector ──> Backends
                                              │                          ├─> Metrics TSDB
                                              │                          ├─> Traces Store
                                              │                          └─> Logs Store
                                              └── kube/system exporters (kube-state, cAdvisor)

Collector deployment patterns

  • Sidecar: per‑pod isolation; simplest context propagation; higher overhead
  • DaemonSet (agent): per‑node collector for all workloads; good default
  • Gateway: centralized fan‑in; enforces org‑wide policy (sampling, PII scrubbing)
  • Common to use Agent (DaemonSet) + Gateway for scale and control

Signal‑specific notes

  • Metrics: prefer low‑cardinality labels; use histograms; remote write to long‑term TSDB
  • Traces: sampling strategies (tail‑based at gateway for best value; head‑based at source for low overhead)
  • Logs: structure at source (JSON); drop/trim noisy lines early; labels/indices budgeted

Reliability and cost levers

  • Backpressure: memory_limiter + queued_retry processors in OTel Collector
  • Batching: reduce connection churn and backend CPU
  • Redaction: attributes and body processors for PII/compliance
  • Multi‑route: split traffic by environment/tenant to different clusters/backends
  • Retention tiers: hot (short), warm (mid), cold (cheap/archival)

Minimal OTel pipeline (conceptual)

Receivers → processors → exporters, per signal:

receivers: otlp, prometheus, filelog
processors: memory_limiter, batch, attributes(resource), transform, tail_sampling
exporters: otlphttp(tempo), prometheusremotewrite(mimir), loki

Ops best practices

  • Start with a single protocol (OTLP) end‑to‑end
  • Keep metrics cardinality in check; gate label additions
  • Make sampling an explicit decision (prove value, then tune)
  • Treat collectors as stateless and horizontally scalable
  • Version and test pipelines as code; lint configs; add golden queries/alerts

See also

  • OpenTelemetry Collector: pipelines and processors (receivers → processors → exporters)
  • OTLP protocol (gRPC/HTTP)
  • Backend options: Mimir/Thanos, Tempo/Jaeger, Loki/Elasticsearch