Why We Built AmberTrace: OpenTelemetry-Native LLM Observability for Production AI

Before we wrote a single line of AmberTrace, we spent three months talking to engineering teams building AI products in production. Startups, scaleups, internal platform teams. Over 30 conversations.

We kept hearing the same two problems.

Problem 1: LLM Costs That Nobody Can Explain

Every team that had shipped an LLM feature to production had, at some point, opened their OpenAI invoice and felt a sinking feeling.

Not because the number was necessarily catastrophic — though sometimes it was — but because they couldn't explain it. The dashboard tells you how much you spent. It doesn't tell you which endpoint, which user, which model, or which background job was responsible.

One team we talked to had a retry loop silently hammering the API for 40 minutes after a rate-limit event. They found out at the end of the month. Another had a system prompt that had grown from 400 to 1,400 tokens over six weeks of incremental edits — tripling their input costs on every single call. Nobody noticed until the bill arrived.

The underlying issue isn't carelessness. It's that there's no standard instrumentation for LLM cost attribution. Traditional APM tools track HTTP latency and database queries. They don't track tokens per endpoint, model routing decisions, or cost per user session.

So teams fly blind — and discover problems after they've already cost money.

Problem 2: AI Agents Are a Black Box in Production

The second problem is subtler, but in some ways more serious as the industry moves toward agentic architectures.

An AI agent doesn't just make one API call. It reasons, decides which tools to call, executes steps, evaluates its own output, and loops. By the time it returns a result, there may have been 15 model calls involved. Possibly more.

When that agent produces a wrong answer or takes an unexpected action in production, where do you start debugging?

The application logs show inputs and outputs. Everything in between — the intermediate reasoning, the tool invocations, the model decisions at each step — is invisible unless you've deliberately built tracing for it. Most teams haven't, because building it from scratch is expensive and there's been no standard approach.

The result: teams are shipping increasingly complex agentic systems with essentially zero visibility into their runtime behavior.

Why Existing Tools Don't Solve This

The obvious question is: why not Datadog, Honeycomb, or one of the other LLM-specific tools that have emerged?

General-purpose APM tools are excellent at what they do, but they're built around traditional distributed systems primitives — HTTP spans, database queries, function timing. Plugging LLM calls in gives you latency data. It doesn't give you token counts, cost attribution by endpoint, or agent execution traces with tool call trees.

The newer LLM-specific tools we evaluated mostly fell into one of two categories: they required wrapping your entire LLM client in a proprietary SDK, or they were tightly coupled to a specific framework like LangChain. Neither felt right. You shouldn't have to rewrite your application to observe it, and you shouldn't be locked into a vendor-specific standard when an open one already exists.

Why We Built on OpenTelemetry

OpenTelemetry is the industry-standard framework for distributed tracing and observability. It's vendor-neutral, widely supported, and already the default for most modern backend services.

Our core thesis: LLM observability should extend OpenTelemetry, not replace it.

This is the decision that shaped everything about how AmberTrace works. Instead of a proprietary instrumentation layer, AmberTrace hooks into the OpenTelemetry layer that many teams already have. That means:

One line of setup — no changes to your existing LLM call logic
LLM traces live alongside your other telemetry, in tools your team already uses
No vendor lock-in — standard OTEL spans that any compatible backend can consume

We talked to enough engineering teams to know that "add another monitoring sidecar" is a hard sell. "Extend the observability stack you already have" is a much easier one.

What AmberTrace Gives You

Three things that are otherwise painful to get:

Per-endpoint LLM cost tracking — in real time. Not in your monthly invoice. Right now, broken down by endpoint, model, and user. Cost anomalies surface in minutes, not weeks.

Full agent execution traces. Every reasoning step, every tool call, every model decision — captured as structured, queryable spans. When something breaks, you see exactly what happened, not just the final output.

Zero-code instrumentation. pip install ambertrace, configure once, and your existing OpenAI, Anthropic, and Google API calls are automatically traced. No wrapper classes, no SDK migration.

Who This Is For

AmberTrace is built for AI/ML engineers and tech leads at startups shipping LLM-powered features to production — whether that's a RAG pipeline, a customer-facing assistant, a document workflow, or a multi-step agent.

If you're still prototyping, it's probably not the right moment. If you're running LLM workloads in production and either can't explain your bill or can't debug your agents, that's exactly where AmberTrace helps.

We've talked to over 30 teams building in this space. The SDK is live on GitHub. If this resonates — try it. Setup takes about 15 minutes.

AmberTrace is an OpenTelemetry-native LLM observability platform. Zero-code instrumentation for OpenAI, Anthropic, and Google APIs.