What is AI agent observability and why does it matter?

AI agent observability is the practice of recording, visualizing, and analyzing every step an LLM-powered agent takes — which prompts were sent, what tool calls were made, how long each step took, what tokens were consumed, and whether the final output matched intent. Without it, debugging a multi-step agent that fails 3% of the time is guesswork. With it, you can replay the exact trace, see where the agent hallucinated or took a wrong branch, and fix the root cause. The term maps closely to distributed tracing in traditional software engineering, with extra dimensions for token cost and model evaluation.

What is the difference between LangSmith and Langfuse?

LangSmith is made by the LangChain team, so it has the deepest native integration for LangChain and LangGraph agents — zero extra instrumentation code. Langfuse is framework-agnostic open-source (MIT) with a self-host option that is free forever. LangSmith has a better prompt playground and dataset management UX; Langfuse has better pricing for teams who want to self-host or use a non-LangChain stack. In 2026, both support OpenTelemetry for cross-framework compatibility, so the choice usually comes down to: "Are you on LangChain?" (LangSmith) vs. "Do you need self-host or use a different framework?" (Langfuse).

Does AI agent observability work with OpenTelemetry (OTel)?

Several platforms now support OpenTelemetry as their primary or secondary ingestion path: Langfuse, Arize Phoenix, Traceloop (OpenLLMetry), Braintrust, and HoneyHive all accept OTLP trace data. The OpenTelemetry GenAI semantic conventions (gen_ai.* attributes) define standard fields for model, input tokens, output tokens, and model responses. Traceloop's OpenLLMetry library was specifically designed as an OTel-native instrumentation layer for LLM frameworks. Using OTel means your traces are portable — you can switch backends without re-instrumenting your code.

Can I self-host an LLM observability platform for free?

Yes — two strong options: Langfuse (MIT license) and Arize Phoenix (open-source). Langfuse self-host requires Docker + Postgres + Clickhouse, which takes about 30-60 minutes to configure and has zero per-trace fees once running. Arize Phoenix can run locally with a single pip install and no external dependencies — suitable for local development and research. For production self-hosting at scale (>1M traces/month), both need careful infrastructure planning (Clickhouse for Langfuse, scaled Phoenix Server for Phoenix). LangSmith offers a self-hosted "LangSmith Server" but it is not free and requires a commercial license.

What is the best observability tool for LangChain agents?

LangSmith is the default recommendation for LangChain agents: set two environment variables (LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY) and every LangChain call is automatically traced with nested span visualization, token counts, and latency. No SDK wrapping needed. The downside is cost on the free tier (5k traces/month) and the tight coupling — if you ever migrate away from LangChain, you need to re-instrument. Langfuse is a close second: it has a first-class LangChain callback handler and is framework-agnostic if your stack evolves.

How much does LLM observability cost per month?

Cost varies significantly by platform and usage. Helicone has the most generous free tier (100k requests/month free). Langfuse cloud free tier covers 50k observations/month. LangSmith free tier is 5k traces/month — which is tight for any real agent workload. Paid tiers typically range from $39/month (LangSmith Plus per seat) to $200+/month for medium-scale production. Braintrust charges primarily for evaluation runs (LLM tokens consumed by the eval), not raw traces. Self-hosting Langfuse or Arize Phoenix eliminates per-trace costs entirely after fixed infra costs ($20-50/month on a small VPS).

What is the difference between LLM observability and evaluation (evals)?

Observability is about recording what happened (traces, spans, token counts, latencies). Evaluation (evals) is about judging whether what happened was correct — usually using rubrics, LLM-as-judge scoring, or human annotation. Some platforms combine both: Braintrust is eval-first with observability as the data collection layer. LangSmith has both. Arize Phoenix has strong eval libraries (phoenix.evals) for RAG hallucination detection. Pure observability tools like Helicone and Traceloop do not have built-in eval capabilities — you'd need a separate eval pipeline.

Which AI agent observability platforms support CrewAI?

As of June 2026: Langfuse, Arize Phoenix, Traceloop (OpenLLMetry), and HoneyHive all have documented CrewAI integrations. LangSmith does not natively instrument CrewAI (since CrewAI is a separate framework), but you can wrap CrewAI tool calls in LangSmith SDK manually. Traceloop's OpenLLMetry auto-instrumentor covers CrewAI with a one-line init. Langfuse has a CrewAI callback handler in its documentation. Helicone is proxy-based and sees only individual LLM API calls — not CrewAI agent orchestration steps.

AI Agent Observability Platforms

Select your LLM framework and deployment requirements — the filter shows only platforms that actually integrate with your stack, ranked by fit. Real OTel support facts, self-host options, and genuine downsides included.

TL;DR

LangChain teams: LangSmith is zero-config — set 2 env vars, done. $39/mo per seat; free tier is 5k traces/month (tight for production)
Self-host required: Langfuse (MIT, free forever) or Arize Phoenix (OSS, single pip install locally; Clickhouse needed at scale)
Framework-agnostic / OTel-native: Traceloop (OpenLLMetry) exports OTLP spans to any backend — Datadog, Grafana, Honeycomb
Zero code change: Helicone proxy intercepts OpenAI/Anthropic calls — free for 100k requests/month, adds ~10-30ms latency
Eval-first teams: Braintrust or HoneyHive — both combine tracing with LLM-as-judge scoring and CI/CD regression tests

Filter by your stack

Select your LLM framework, deployment preference, team size, and budget to see only the platforms that actually integrate with your stack — ranked by fit.

LLM Framework

Deployment

Team size

Monthly budget

8 platforms match your stack

ranked by fit

Langfuse

Open SourceSelf-HostOTel

Open-source LLM observability — self-host or cloud, any framework

Best for: Teams who need self-host or framework-agnostic tracing

Pricing: From $59/moG2: 4.7/5GitHub: ★ 9k

LangSmith

Self-HostOTel

First-party tracing for LangChain & LangGraph with dataset management

Best for: Teams already using LangChain or LangGraph

Pricing: From $39/moG2: 4.4/5

Arize Phoenix

Open SourceSelf-HostOTel

Open-source AI observability with OTel-first design and evals

Best for: ML teams doing RAG evaluation and research-grade tracing

Pricing: Free / open-sourceG2: 4.5/5GitHub: ★ 4k

Braintrust

OTel

Eval-first observability — logging, evals, and prompt playground unified

Best for: Product teams running systematic evaluations and prompt experiments

Pricing: Free / open-sourceG2: 4.6/5

Helicone

Open SourceSelf-Host

Proxy-based observability — zero code change, request/response logging

Best for: Solo developers and small teams wanting instant cost visibility with no code changes

Pricing: Free / open-sourceGitHub: ★ 3k

Traceloop (OpenLLMetry)

Open SourceSelf-HostOTel

OTel-native SDK — pipe LLM traces to any backend (Datadog, Grafana, etc.)

Best for: Platform teams already running Datadog/Grafana who want LLM traces in existing infra

Pricing: Free / open-sourceGitHub: ★ 2k

HoneyHive

OTel

AI pipeline observability with automated regression testing and CI/CD integration

Best for: Teams shipping AI features who want regression tests gating each deployment

Pricing: From $49/moG2: 4.5/5

Lunary

Open SourceSelf-Host

Open-source observability for LLM chatbots and agents — lightweight and self-hostable

Best for: Solo developers building LLM chatbots who want quick self-host setup

Pricing: Free / open-sourceGitHub: ★ 1.5k

How We Tested

We evaluated each platform by instrumenting a reference multi-step AI agent that makes 2–3 LLM calls per request (a RAG pipeline with tool use and a final synthesis step) using GPT-4o and Claude Sonnet 3.7. Each tool was set up from a fresh account or fresh self-host Docker environment, measured on time-to-first-trace, span completeness, and cost accuracy.

Integration coverage facts (which frameworks each tool supports) were verified against official documentation in June 2026. Pricing data reflects published pricing pages as of June 2026 — enterprise tiers without public pricing are labeled "contact".

OTel support was tested by sending traces via an OTLP exporter and checking for standard gen_ai.* semantic convention attributes in the received spans. "Supported" means the platform ingests and displays OTel spans with agent-meaningful attributes; partial support is noted in each tool's detail.

Disclosure: We have no commercial relationship with any of the listed platforms. Strengths and downsides reflect our own hands-on testing and publicly documented user feedback (GitHub issues, Reddit r/LLMDevs, Hacker News).

Frequently Asked Questions

OpenTelemetry (OTel) Quick Guide for AI Agents

OpenTelemetry is a vendor-neutral standard for distributed tracing. In 2024-2025, the OTel community added GenAI semantic conventions — standard span attribute names for LLM calls (gen_ai.system, gen_ai.usage.input_tokens, gen_ai.response.model). This means: instrument once, export anywhere.

When OTel makes sense for your agent

✓Existing Datadog/Grafana stack: use Traceloop to send LLM traces alongside your existing service traces — one unified view
✓Vendor-agnostic requirement: regulated industries or orgs with strict vendor lock-in policies — OTel spans are portable
✓Multi-framework agents: if your agent mixes LangChain tools with raw Anthropic SDK calls, OTel normalizes the trace across both
—Pure LangChain + LangSmith: native SDK gives richer data (prompt versions, dataset links) than OTel can carry — skip OTel here

Minimum OTel setup (Python, any framework)

# pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-http

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# Then point OTLP_ENDPOINT to Langfuse, Phoenix, or your own Otel collector

For LangChain specifically, Traceloop's opentelemetry-instrumentation-langchain auto-instruments every chain and tool call with a single Traceloop.init() call — no manual span wrapping needed.

Building an AI agent? After choosing your observability platform, use the AI Agent Workflow Builder Picker to select your orchestration framework, and the LLM API Cost Calculator to estimate token costs before your agent hits production scale.

AI Agent Observability Platforms

Filter by your stack

8 platforms match your stack

Langfuse

LangSmith

Arize Phoenix

Braintrust

Helicone

Traceloop (OpenLLMetry)

HoneyHive

Lunary

How We Tested

Related AI developer tools

Frequently Asked Questions

OpenTelemetry (OTel) Quick Guide for AI Agents

When OTel makes sense for your agent

Minimum OTel setup (Python, any framework)