← Back to portfolio 2025-04-08

Elasticsearch Log Correlation: Connecting the Dots in Distributed Systems

ElasticsearchLoggingDistributed SystemsObservability

When a request fails in a microservices architecture, the error is in one service but the cause is in another. Elasticsearch log correlation is how the dots get connected.

The Correlation ID Pattern

Every incoming request gets a unique correlation ID (UUID). This ID propagates through every service call: HTTP headers, Kafka message metadata, database audit columns. When searching Elasticsearch for a correlation ID, the entire request lifecycle across all services becomes visible.

A Spring Boot filter that reads X-Correlation-ID from incoming requests (or generates one if missing) and stores it in the MDC (Mapped Diagnostic Context) makes this automatic. Every log statement includes the correlation ID without any developer effort.

The Time-Window Query

Correlation IDs find related logs for a single request. But what about systemic issues? If checkout latency spikes, all errors across all services in that time window are needed.

A standard investigation query:

{
  "query": {
    "bool": {
      "must": [
        {"range": {"@timestamp": {"gte": "now-15m", "lte": "now"}}},
        {"term": {"level": "ERROR"}}
      ]
    }
  },
  "aggs": {
    "by_service": {"terms": {"field": "service.name"}}
  }
}

This shows error counts per service in a recent window. The service with the most errors is usually where the problem started.

Common Mistakes

A common mistake is logging everything at DEBUG level, which generates enormous log volume. Elasticsearch ingestion cannot keep up, and queries over long time ranges time out. A better approach: structured logging with INFO as the default, DEBUG only enabled via feature flag per service, and aggressive index lifecycle management (hot-warm-cold architecture with short hot retention). This keeps log volume manageable and queries fast enough to use during incidents.