Logging and AI Agents at Customer.io

In this article

Author's note: This blog post was co-written with Claude Opus 4.7.

At Customer.io, we index over 52 billion backend service logs per day. In May 2026, we turned off our legacy self-hosted logging cluster, replacing it with VictoriaLogs and OpenTelemetry, and started teaching our AI agents to investigate production incidents on their own.

As a result, we halved our logging infrastructure costs while making a mountain of observability data accessible to AI agents for triage, troubleshooting, and even resolving customer issues. Here's how we got there—and where it's going next.

Why this matters

Observability isn't just a nice-to-have — it's the difference between knowing what's broken and guessing, and ultimately between a customer's campaign going out on time or not.

Metrics, traces, logs, and profiles are the raw material behind every dashboard we trust, every alert that pages someone at 2 a.m., and every incident retrospective. At Customer.io, just about everyone touches the observability stack at some point: engineers checking in on a fresh release, support staff chasing down a customer report, and an on-call SRE investigating an alert.

The whole thing only works if it's fast, fresh, and trustworthy. The moment any of that wobbles, people start tailing service logs by hand and reading tea leaves — and nobody makes good decisions from tea leaves.

When that happens, the blast radius isn't internal — it shows up as slower incident response, longer customer-reported bugs, and delayed message sends for the businesses that rely on us.

The stack, briefly

At the end of last year, most of our observability stack was self-hosted on GCP: Prometheus for metrics, Grafana for dashboards, AlertManager for alert routing, and a self-hosted logging solution. Even Honeycomb, which stored traces, was fronted by a self-hosted Refinery sampler.

The problem

Our Q4 025 developer survey delivered a clear message: searching internal service logs is painful. The mandate was about as open-ended as mandates get — "Searching logs is slow, flaky, and not trustworthy. Please make it better."

Users of our logging platform regularly reported that queries timed out, routine searches ran slowly, and log indexing fell behind as the system approached performance limits. The platform itself was industry-standard, but our implementation was difficult to maintain and did not meet our internal users’ needs. During peak load, ingestion could lag by several minutes, and searches were slow enough to time out entirely — exactly when teams most needed reliable logs to troubleshoot production incidents. Keeping the platform healthy also required frequent manual intervention: expanding volumes, tuning ingestion, and adjusting capacity as traffic grew, all of which added to the operational burden.

We started where any reasonable team would: by trying to fix what we already had. Our self-hosted logging platform got the tuning treatment. Indexing sped up a bit. Search got a little snappier. Yet developers still didn't trust it — we'd burned through enough goodwill that our team had stockpiled an entire arsenal of workarounds, and the logging platform had quietly become a tool of last resort. Performance gains alone weren't going to change that.

The breakup

With the platform improved, but our users still unhappy, we stepped back and asked bigger questions: should we stay self-hosted or move to a hosted solution? Where do people want to use logs? How do people actually use them today?

The hosted solutions were cost-prohibitive at our log volumes — we generate significant logging data that we can't afford to send to a hosted vendor. No amount of reduced operational burden could offset our projected bill.

The "where" question, though, pointed somewhere useful. Grafana was already home for metrics and dashboards, so logs living there, too, meant one tab, one URL bar, one mental context away from whatever you were debugging. Whatever we chose had to play nicely with Grafana.

That's how we ended up at VictoriaLogs.

The migration

In November 2025, we stood up a VictoriaLogs pilot for staging logs. VictoriaLogs handled the load with dramatically less configuration and dramatically less compute than the legacy logging platform. The promises held up:

Indexing is simple. VictoriaLogs auto-infers streams from log content, which means far less manual config than our old platform and developers don't need to think about the shape of their logs.
OpenTelemetry (OTEL) for ingesting logs. The OTEL pipeline required less compute, was easier to reason about, and was built on an open standard rather than a single vendor's pipeline. It was also being used elsewhere in our observability stack for traces, so we could leverage expertise we already had.
Storage and query patterns hold up under real production load.
Integration with Grafana was seamless. Users could query logs, include logs in dashboards, and plot log frequency on graphs, leveraging an existing data stream in new ways.

By the numbers

Across our Production US, Production EU, and Staging:

	Log Ingestion Over 24h	30 Day Retention	Memory / CPU
VictoriaLogs	52 Billion Logs 59 TiB of data	520 Billion Logs 86 TiB of data	776 GiB Memory 336 CPU Cores
Compared to legacy	Increased; More services are sending logs to VictoriaLogs	Down from 1.7 TiB of Memory Down from 639 CPU Cores

After five months of testing, ramp-up, and user feedback, we finally turned off our old logging stack. We switched to VictoriaLogs for all production logging — cutting our infrastructure bill roughly in half from ~$40,000/month to ~$20,000/month and reducing operational headaches thanks to VictoriaLogs’ simple indexing strategy.

Since then, we've expanded our VictoriaLogs coverage to capture an even larger volume of logs, which our legacy logging platform couldn't handle due to cost or performance constraints. VictoriaLogs has taken it in stride.

Teaching agents to read logs

From triaging and assigning customer tickets to approving code reviews, Customer.io (http://Customer.io) takes an AI-first approach to problem-solving. Naturally, once logs were in a good place, the next question was: how do we make all this data legible to our AI tooling?

The `/observability` skill

An engineer on the DevX team built a Claude Code skill called /observability that takes a description of whatever you're chasing and hunts for it in VictoriaLogs directly via the VictoriaLogs MCP server. It surfaces the relevant logs and then builds a narrative around them — what happened, the likely impact on users, and what to look at next.

Here’s an example of finding a log containing a reference ID using the skill, with customer-identifying information replaced with placeholders:

Input Prompt

/observability:log-check I'm trying to find an error reference in VictoriaLogs and can't seem to find it. 01JXYZ123ABC456DEF789GH0IJ, in environment 12345 or 12344 around May 7th.

The skill reads its bundled reference docs, then searches VictoriaLogs using a wildcard *: query across both US and EU production clusters.

What it found

The reference 01JXYZ123ABC456DEF789GH0IJ was located in US Prod, environment 12345. It's a frontend error event recorded by ui_api — the customer hit a payload size limit (HTTP 413) while working in the journey composer.

Field

Value

Time

2026-05-07 15:09:48 UTC

Service

ui_api (analytics tracking event)

Error message

Our server can't do what we asked it to.: request entity too large (reference 01JXYZ123ABC456DEF789GH0IJ)

Environment

12345

Page

/workspaces/12345/journeys/composer/actions/42

Referrer

/workspaces/12343/journeys/campaigns/15/overview/workflow/actions?actionId=87

User agent

Chrome 147 on Windows

The skill then searched for corresponding backend errors — and found none.

Key observations

The customer retried — the same user hit the same error ~2 minutes earlier with a different reference ID, on the same page and action.
No backend error logged — the 413 was likely rejected at the edge/proxy layer before reaching the application, so it only appears as a frontend analytics event.
Cross-workspace navigation — the customer navigated from workspace 12343 to 12345, matching the environments mentioned in the original query.

Why it's hard to find

The reference ID isn't in a top-level log field — it's buried inside a nested JSON blob (an analytics event payload). A field-specific query like delivery_id: or request_id: won't match it. The skill's wildcard *: search is the only way to surface it in VictoriaLogs.

Next steps

The skill concluded with actionable guidance: if you need to understand what the customer was trying to save that was too large, VictoriaLogs won't have that — you'd need to check the journey composer's API to see what payload size limit applies to that action.

<aside>
⏲️

This response took a total of 2m 16s

</aside>

VictoriaLogs ships with the very powerful LogsQL query language, but even powerful languages have a learning curve, and LogsQL is no exception. The skill flattens that curve to almost nothing — users get a much more accessible interface to the mountain of data we have in VictoriaLogs, without needing to memorize syntax first.

Caution and Security

Exposing backend observability data to LLMs might raise some red flags. All of our LLM subscriptions use enterprise plans to ensure customer data is never used for AI training.

Where this is going

Now that we have proven how effective LLMs are at troubleshooting problems when given adequate data and tools, we are doubling down.

Giving more observability data to bots

While writing this post, our DevX team turned on the Grafana MCP server (https://grafana.com/docs/grafana/latest/developer-resources/mcp/), making not only logs but also metrics, dashboards, and profiling data accessible to LLM agents. This allows workflows that correlate logs with metrics and then link directly to dashboards. Once we enable write operations via MCP, agents will be able to create dashboards on the fly, greatly reducing the effort required to visualize data for investigations and incident postmortems.

Giving observability data to Autonomous Agents

LLMs operated by developers and support staff extract logs and turn them into a useful story, but the next step is teaching our internal autonomous-agent tooling to do the same. The aim is to enable bots to:

Answer production debugging questions on demand.
Triage customer issues and route them to the right team.
Triage and fix customer issues automatically.

The future of observability is exciting. Agents can read observability data directly and synthesize correlations that would take most of us much longer to find, or that we might miss entirely. Pairing that capability with the ability to create and edit disposable dashboards on the fly means we expect to get much more use out of the data we already have. If AI-first problem-solving sounds exciting, consider applying to work at Customer.io.

You should also explore our suite of AI features, including the Agent, MCP server, and CLI.