OpenTelemetry Configuration

1. Overview

EMS services emit traces, metrics, and (via Logback appender) correlated logs to a central OpenTelemetry Collector running in the cluster’s observability namespace. The collector forwards to the observability backend (currently an in-house stack; future: Tempo/Loki/Prometheus or a vendor SaaS).

Two instrumentation paths:

  1. Java agent (opentelemetry-javaagent) — auto-instruments HTTP servers, JDBC, Hazelcast, Kafka, etc. at bytecode level. Baked into the service’s Docker image via Jib’s extraDirectories.

  2. SDK / starter (opentelemetry-spring-boot-starter) — enables manual @WithSpan annotations, @Counted, @Timed method-level instrumentation, and custom span creation via Tracer.

Both are active simultaneously in admin-service. Spring Web auto-instrumentation is disabled in admin-service’s application-otlp.yml — the agent would double-count if Spring Web also reported. Manual spans and JDBC spans cover the rest.

2. Topology

otel-topology

Collector endpoint in-cluster: http://opentelemetry-collector.observability.svc.cluster.local:4317 (gRPC) or :4318 (HTTP). Deployed via ~/dev/idl-xnl-jhb-rc01/argocd/opentelemetry-collector.yml.

3. Admin-Service Configuration

Full detail — this is the reference implementation.

3.1. Profile-gated

Active only when otlp profile is enabled. ArgoCD prod manifest sets config.profiles: "prod,kubernetes,otlp" (or similar — see ArgoCD Deployment Patterns).

3.2. POM configuration

The agent is always baked into the image (regardless of which Spring profile is active); the otlp profile only adds the Spring-side dependencies that activate application-otlp.yml.

In the main <build> section (always runs):

  • Maven plugin: maven-dependency-plugin copy execution, bound to the package phase. Downloads io.opentelemetry.javaagent:opentelemetry-javaagent:<version> into ${agent-extraction-root} (= ${project.build.directory}/jib-agents) as ${opentelemetry-javaagent-filename} (= opentelemetry-javaagent.jar).

  • Jib <extraDirectories> then copies that directory into the image at ${agent-install-location} (= /javaagent). The image always has /javaagent/opentelemetry-javaagent.jar regardless of profile.

  • Jib <jvmFlags> always include -javaagent:/javaagent/opentelemetry-javaagent.jar plus -Dotel.{logs,traces,metrics}.exporter=otlp.

  • Jib <environment> sets OTEL_SERVICE_NAME=${project.artifactId} so traces are tagged with the service name.

Under the otlp profile (Spring-side only):

  • BOM: io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom:<version>

  • Dependency: io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter

  • Dependency: io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations

  • Dependency: io.opentelemetry.instrumentation:opentelemetry-logback-appender-1.0:<version>-alpha

The agent operates at bytecode level and works without any Spring-side dependency. The Spring starter adds @WithSpan annotation support, the logback appender for trace correlation, and SDK-level autoconfiguration. With the otlp profile inactive, the agent is still loaded but it falls back to its own auto-instrumentation only — no Spring @WithSpan, no logback trace correlation.

Anti-pattern (do not copy from older internal scaffolds): downloading the agent into src/main/jib/opt/otel/ with Jib <extraDirectories> pointing at a different path. The path mismatch silently drops the agent from the image; the JVM emits -javaagent: file not found on startup and OTel emits nothing. Source-tree pollution is also wrong — agent jars belong in target/.

See Jib Docker Build § OTel javaagent for the full downloader + Jib XML.

3.3. Runtime config (application-otlp.yml)

spring:
  jpa:
    properties:
      hibernate.generate_statistics: true            # admin-service only — drop for non-JPA services

management:
  metrics:
    export:
      otlp:
        enabled: true

otel:
  java:
    global-autoconfigure:
      enabled: true
  exporter:
    otlp:
      endpoint: 'http://opentelemetry-collector.observability.svc.cluster.local:4317'
    jaeger:
      enabled: false
    zipkin:
      enabled: false
  springboot:
    resource:
      enable: true
  resource:
    attributes:
      'service.version': '${project.version}'
      'deployment.environment': production
  instrumentation:
    annotations:
      enabled: true
    logback-appender.enabled: true
    spring-web.enabled: false
    spring-webmvc.enabled: false
    spring-webflux.enabled: false

Key decisions:

  • global-autoconfigure: true — picks up Spring Boot auto-config

  • Jaeger and Zipkin exporters disabled — we only export OTLP to the collector

  • Resource attributes include service.version (from project.version) and deployment.environment — both surfaced in the backend UI for filtering

  • Spring Web/WebMVC/WebFlux auto-instrumentation disabled — the javaagent already instruments these at bytecode level; keeping the Spring-SDK version enabled produces duplicate spans

  • Logback appender enabled — every log line emitted through Logback carries the current trace/span context so logs correlate in the backend

3.4. Javaagent at runtime

Jib bakes the javaagent at /javaagent/opentelemetry-javaagent.jar (the path is parameterised by the Maven properties agent-install-location and opentelemetry-javaagent-filename so it stays consistent across services). The container’s jvmFlags include:

<jvmFlag>-javaagent:${agent-install-location}/${opentelemetry-javaagent-filename}</jvmFlag>
<jvmFlag>-Dotel.logs.exporter=otlp</jvmFlag>
<jvmFlag>-Dotel.traces.exporter=otlp</jvmFlag>
<jvmFlag>-Dotel.metrics.exporter=otlp</jvmFlag>

Plus environment:

<environment>
    <OTEL_SERVICE_NAME>${project.artifactId}</OTEL_SERVICE_NAME>
</environment>

The exporter selectors (-Dotel.{logs,traces,metrics}.exporter=otlp) are needed because the agent’s default behaviour for some signal types changed across versions. Setting them explicitly avoids surprise.

Customisation (sampling rate, instrumentation toggles, custom resource attributes) is done via env vars (OTEL_TRACES_SAMPLER=parentbased_traceidratio, OTEL_TRACES_SAMPLER_ARG=0.1, etc.) injected by the Helm chart from the ArgoCD valuesObject — not via an agent.properties file. The properties-file approach is supported by the agent but adds an extra config artefact for no benefit.

4. Registration-portal and admin-portal

registration-portal currently has no OTel configuredotlp profile absent, no dependencies, no agent. This is a known gap; adding it is a backlog item.

admin-portal should launch with OTel enabled from day one. Clone admin-service’s otlp profile config:

  • Same POM dependencies

  • Same application-otlp.yml — except service.version adjusted per-service

  • Same Jib extraDirectory for the javaagent

  • Same jvmFlag -javaagent:/opt/otel/javaagent.jar

Gateway-specific spans worth adding manually:

  • POST /api/session/tenant — wrap the token-exchange call in a span with attributes user.sub, tenant.requested, tenant.current-before

  • AdminServiceJwtRelayFilter — wrap the proxy call in a span with admin-service.endpoint attribute

  • TenantResolutionFilter — a short span identifying the resolution source (domain / header / session)

These make debugging multi-tenant auth issues tractable in trace view.

5. Metrics

Micrometer + OTel bridge emit the standard JVM + HTTP + Hazelcast metrics. Additional EMS-custom metrics live in admin-service/src/main/java/…​/config/MetricsConfiguration.javaOtlpMetricsNamingConvention keeps names dot-separated (OTel style) rather than underscore-separated (Prometheus style).

Selected metrics:

  • http.server.requests — rate, p95/p99 latency, status-code distribution per URI (automatic)

  • jdbc.connections.active / jdbc.connections.max — HikariCP pool state

  • hazelcast.partition.is-migrating — cluster rebalance indicator

  • jvm.memory.used / jvm.gc.pause — standard JVM

  • Custom: ems.import.duration / ems.import.rows — import-specific timers, see ImportAsyncConfiguration

Dashboards live in the observability backend; owner: Solution Architect / Ops.

6. Tracing Patterns

6.1. Business-flow spans

Group multiple API calls that belong to the same user journey under a "business flow" span. See design-journal/2026-03/end-to-end-distributed-tracing.adoc for the design. Pattern:

@WithSpan("membership-registration")
public void registerMembership(...) {
    // child spans from auto-instrumented Spring Web + JDBC roll up under this
}

Useful for showing "registration took 3.4s" with breakdown across the participant, payment, and email sub-operations.

6.2. W3C traceparent propagation

Frontend → gateway → admin-service all propagate traceparent header. registration-portal’s interceptor (when OTel lands for it) should extract any existing trace from the browser’s performance-navigation entries and attach; otherwise generate a new root.

Cross-cluster propagation (e.g. admin-service → WordPress → RunSignup) honours the same convention where supported.

7. Logs-to-Traces Correlation

opentelemetry-logback-appender-1.0 emits every log line with the current trace_id and span_id as structured attributes. In the backend UI, clicking a trace shows the correlated log lines. In kubectl logs, the trace/span IDs appear in the log pattern (%X{trace_id}, %X{span_id} via MDC).

No action needed per-log-line; the appender handles it globally when active.

8. Sampling

Current default: 100% (all traces exported). Low volume; backend handles it. When volume grows:

  • Head-based sampling at the collector — drop 90% of low-interest traces (health checks, readiness probes), keep 100% of error traces, keep high-percentage of slow traces.

  • Tail-based sampling — collector decides after collecting the full trace, based on total duration + error status.

Tune at the collector, not at the service. Services always emit; collector filters.

9. Known Gaps

  • registration-portal lacks OTel — backlog item; not a blocker but reduces end-to-end trace visibility.

  • Frontend instrumentation — browser-originated OTLP proxied via gateway is designed (see design-journal/2026-03/end-to-end-distributed-tracing.adoc) but not implemented. Adds browser-to-admin-service trace root.

  • No SLO tracking — metrics exist but service-level objectives are not formalised. Future work.

10. Reference

File Role

admin-service/src/main/resources/config/application-otlp.yml

OTel runtime config for admin-service

admin-service/src/main/java/…​/config/MetricsConfiguration.java

Custom metrics registry + OTel bridge

admin-service/src/main/java/…​/config/apidoc/OtlpMetricsNamingConvension.java

OTel-style naming convention

admin-service/pom.xml (otlp profile)

Dependencies + javaagent download

~/dev/idl-xnl-jhb-rc01/argocd/opentelemetry-collector.yml

OTel Collector ArgoCD Application

design-journal/2026-03/end-to-end-distributed-tracing.adoc

Cross-cutting tracing design including frontend

12. Change History

Date Change

2026-04-24

Initial draft. Grounded in application-otlp.yml and admin-service otlp Maven profile.