Complete OpenTelemetry Observability - Metrics, Traces, and Logs

We're excited to announce that Nexus now provides complete observability through OpenTelemetry integration. Starting with metrics in version 0.3.5, followed by distributed tracing in 0.4.0, and completing with logs export in 0.4.1, Nexus delivers monitoring, tracing, and logging capabilities—all through industry standard OpenTelemetry protocols.

The OpenTelemetry Advantage

OpenTelemetry provides a vendor-neutral, standardized approach to observability that integrates seamlessly with your existing monitoring stack. Whether you're using Prometheus, Grafana Cloud, Datadog, or AWS CloudWatch, Nexus's telemetry data flows directly into your preferred backend with minimal configuration.

Simple Configuration, Powerful Insights

Getting started requires just a few lines in your nexus.toml:


[telemetry]
service_name = "nexus-production"

[telemetry.exporters.otlp]
enabled = true
endpoint = "http://otel-collector:4317"
protocol = "grpc"
timeout = 10000

# Optional: Fine-tune trace sampling
[telemetry.traces]
sample_rate = 0.1  # Sample 10% of requests

This configuration enables all three observability signals—metrics, traces, and logs—sending them to your OpenTelemetry collector for processing.

Metrics (0.3.5): Monitor Performance at Scale

Our metrics implementation follows OpenTelemetry semantic conventions, providing standardized measurements across three key areas:

LLM Operations

Track every interaction with language models through metrics like gen_ai.client.operation.duration and token usage counters. Monitor time-to-first-token for streaming responses, track token consumption by model and client, and identify performance bottlenecks across providers.

MCP Tool Interactions

Measure tool performance with mcp.tool.call.duration and related metrics. Understand which tools are most frequently used, monitor success rates and error patterns, and track search operations with keyword and result count attributes.

Server and Redis Performance

Keep tabs on HTTP request latency with http.server.request.duration and Redis operation health. Monitor connection pool utilization, track rate limiting impact, and ensure optimal backend performance.

Learn more about available metrics and their attributes in our metrics documentation.

Distributed Tracing (0.4.0): See the Complete Picture

Version 0.4.0 introduced distributed tracing that visualizes request flows across your entire AI infrastructure. Each request generates a hierarchical span structure showing:

Request Journey Visualization

HTTP Request Spans: Root spans capturing method, route, status, and client identification
MCP Operation Spans: Detailed tool interaction tracking with authentication context
LLM Operation Spans: Model parameters, token usage, and response details
Redis Operation Spans: Rate limiting checks and connection pool metrics

Smart Sampling and Performance

Configure sampling rates from 0.0 to 1.0 to balance observability with overhead:


[telemetry.traces]
sample_rate = 0.05  # 5% sampling for production
max_events_per_span = 128
max_attributes_per_span = 128

The tracing system maintains W3C trace context propagation, ensuring spans remain connected across service boundaries. Explore configuration options in our tracing guide.

Logs Export (0.4.1): Debug with Full Context

The latest release completes the observability trinity with structured log export. Every log entry is automatically enriched with trace correlation when emitted within an active span:

Structured Logging Benefits

Automatic Trace Correlation: Logs include trace_id and span_id for request tracking
Rich Attributes: Source location, module path, and custom resource attributes

Configuration Options

Control log verbosity and export separately if needed:


# Optional: Separate log export endpoint
[telemetry.logs.exporters.otlp]
enabled = true
endpoint = "http://loki-otlp:4318"
protocol = "http"

Detailed configuration options are available in our logs documentation.

Zero-Overhead Architecture

All telemetry features are designed for production use:

Compile-time optimization: Zero overhead when telemetry is disabled
Efficient batching: Asynchronous export with configurable batch settings
Smart defaults: Delta temporality histograms and automatic high-cardinality limiting
Flexible backends: Works with any OTLP-compatible collector

Conclusion

These observability features provide the foundation for advanced capabilities we're building, including automatic anomaly detection, cost tracking per model and client, and intelligent alerting based on usage patterns. The OpenTelemetry integration ensures Nexus fits seamlessly into your existing observability stack while providing AI-specific insights that generic monitoring tools miss.

To get started with the latest version:


docker pull ghcr.io/grafbase/nexus:stable