Debugging Coral With Coral
Observability usually has a one-way feel to it. An application emits logs, metrics, and traces; a backend stores them; humans use that backend’s UI to understand what happened.
Coral changes the direction. Coral is a SQL query layer for agents, so one of the things an agent can query is the OTLP backend that stores Coral’s own telemetry. When Coral emits a slow trace, the agent can ask Coral about it, then look at the Rust code that produced it. In other words, we can use a code agent, powered by Coral, to debug and improve Coral itself.
The loop:
- Run a Coral command.
- Coral emits OpenTelemetry signals.
- OpenObserve, Datadog, or another OTLP-compatible backend stores them.
- The agent uses
coral sqlto inspect those signals. - The agent connects the telemetry evidence back to Coral’s source.
The agent is useful here because it can move across operational data and source code without switching tools. It asks Coral what happened, reads the Rust that made it happen, patches it, and asks Coral again whether the behavior changed.
This isn’t a replacement for an observability UI. UIs are still better for exploring traces interactively or sharing a link with a teammate. The loop is for the case where the question is already specific:
- Why is this Coral query slow?
- Which Coral module owns the slow operation?
- Did this source request fail before or after reaching the engine?
- Did a code change reduce the duration of the same span?
- Did the fix change behavior, or only move time somewhere else?
The rest of this post walks through setting up telemetry from Coral, plugging in an agent, and the prompts I’ve found most useful.
Table of Contents
- Getting Telemetry Out
- Adding the Agent
- The Prompts
- A Small Amount of SQL
- Datadog Instead of OpenObserve
- Bringing in the Backlog
- Summary
Getting Telemetry Out
Coral emits traces, logs, and metrics through OpenTelemetry. Telemetry is disabled by default and is configured through the [otel] section in Coral’s config.toml.
For a local OpenTelemetry collector, the minimal setup looks like this:
[otel]
endpoint = "http://localhost:4318"
The value is the OTLP/HTTP base URL. Coral appends /v1/traces, /v1/logs, and /v1/metrics itself, so the endpoint should point at the collector base, not at one specific signal path.
For this debugging workflow it is useful to set the service name explicitly:
[otel]
endpoint = "http://localhost:4318"
service_name = "coral"
That gives us a stable filter later on:
WHERE service_name = 'coral'
The default trace filter captures Coral application and engine spans, while keeping DataFusion spans disabled to avoid too much noise. When debugging query execution itself, turn DataFusion tracing on:
[otel]
endpoint = "http://localhost:4318"
service_name = "coral"
trace_filter = "coral_app=trace,coral_engine=trace,coral_engine::datafusion=trace"
Now a query can produce spans not only for coral.cli, grpc, and coral.query, but also for physical execution operators such as FilterExec, AggregateExec, ProjectionExec, or DataSourceExec.
That is where the setup becomes interesting. Those spans describe the internals of Coral’s own query execution, and they are just data in an observability backend.
Adding the Agent
Without an agent, the workflow is still useful: run a query, look at traces, find the slow bit.
With an agent, it becomes more interesting. The agent can perform the whole debugging loop:
- Reproduce the behavior with a Coral command.
- Query the telemetry backend through Coral.
- Summarize what happened.
- Locate the relevant code path.
- Patch Coral.
- Re-run the command and compare telemetry.
That last step matters. The agent should not merely make a plausible code change. It should use telemetry as the before-and-after evidence for the change.
For example, suppose an OpenObserve query through Coral feels slow. A typical human debugging flow might start in the OpenObserve UI, then switch to the code, then switch back to the terminal. The agent can keep the whole investigation in one place:
Use Coral to run a representative OpenObserve query. Then use Coral again to
query OpenObserve traces for the Coral service over the same time window.
Tell me which Coral operation took the most time, which source/backend it
belongs to, and which Rust module is likely responsible. Do not change code yet.
That prompt asks for evidence first. Once the agent has identified a likely culprit, the next prompt can move from diagnosis to implementation:
Based on the telemetry you just found, inspect the relevant Coral code path and
make the smallest change that improves the observed behavior. After the change,
run the same Coral query again and compare the new trace summary with the
previous one.
The important point is not that the agent knows Coral’s internals upfront. It is that Coral exposes enough operational evidence for the agent to find its way.
The Prompts
Here are the prompts I would use for this style of work.
Start by making the agent verify that telemetry is flowing:
Check whether Coral telemetry is reaching OpenObserve. Run a small Coral
command to generate fresh activity, then use Coral's OpenObserve source to
summarize which signals are present. Metrics may arrive slightly later, so do
not treat a short delay as a failure.
Then ask for an inventory, not a fix:
Use Coral to query the OpenObserve traces for service_name = 'coral' over the
last hour. I do not need every span. Summarize the operation names, counts, and
slowest durations. Explain what this says about Coral's execution path.
When investigating a concrete behavior, bind the telemetry to the repro:
Run this Coral command, then inspect Coral's traces and logs for the same time
window: <COMMAND>. Identify the slowest or failing operation and map it back to
the most likely crate and module in the Coral repository.
Before editing code, force a hypothesis:
Before changing code, state the hypothesis supported by the telemetry. Include
the span or log evidence, the code path you expect to inspect, and what result
would confirm the fix after re-running the command.
Then allow the patch:
Implement the smallest code change for that hypothesis. Keep adapters thin and
move behavior into the appropriate Coral crate. Afterward, run the relevant
Rust checks and repeat the telemetry query to compare before and after.
Finally, ask for the write-up:
Write a short debugging note: the original symptom, the Coral command used to
reproduce it, the telemetry evidence, the code change, and the after-state.
Keep the note factual and include the exact spans or metrics that changed.
Those prompts are deliberately staged. The agent observes first, explains next, changes code only after that, and then verifies the result with the same telemetry source.
A Small Amount of SQL
Coral ships bundled sources for both OpenObserve and Datadog.
For OpenObserve, configure the source with the usual connection inputs:
export OPENOBSERVE_URL="http://localhost:5080"
export OPENOBSERVE_ORG="default"
export OPENOBSERVE_USERNAME="root@example.com"
export OPENOBSERVE_PASSWORD="..."
coral source add openobserve
OpenObserve log, metric, and trace searches need a stream plus a bounded time window. The source expects timestamps in Unix microseconds:
now=$(date +%s)
start=$(( (now - 3600) * 1000000 ))
end=$(( now * 1000000 ))
With that in place, we can start asking questions.
The agent does not need a large library of SQL. Most investigations can start from one compact trace summary:
coral sql "
SELECT
service_name,
operation_name,
span_status,
COUNT(*) AS spans,
MAX(duration) AS max_duration_us
FROM openobserve.traces
WHERE stream = 'default'
AND start_time = $start
AND end_time = $end
GROUP BY service_name, operation_name, span_status
ORDER BY spans DESC
LIMIT 50
"
In one local run, this reported operations such as:
coral.cli
grpc
coral.query
http.request
http.response
FilterExec
RepartitionExec
AggregateExec
JsonExec
ProjectionExec
SortExec
DataSourceExec
That already tells us a lot.
There was a CLI invocation. It crossed the local gRPC boundary. A query was planned and executed. The HTTP backend made requests. DataFusion ran physical operators. All of that can now be sliced with SQL.
For everything else, I would keep using prompts rather than pasting more SQL into the article:
Using the same time window, inspect Coral logs in OpenObserve only if the trace
summary shows errors or unexplained latency. Summarize the relevant log lines;
do not dump the full result set.
If the investigation depends on counters or histograms, check Coral query
metrics for the same window. Remember that metrics can arrive slightly later
than traces and logs; wait or widen the window before calling them missing.
Datadog Instead of OpenObserve
The same idea works with Datadog. Coral can export telemetry to a Datadog Agent or another Datadog-compatible OTLP path, and Coral’s bundled datadog source can query the resulting logs, spans, and metrics.
Configure the Datadog source like this:
export DD_SITE="datadoghq.com"
export DD_API_KEY="..."
export DD_APPLICATION_KEY="..."
coral source add datadog
The prompts barely change:
Use Coral's Datadog source to inspect spans for service:coral over the last
hour. Summarize the slowest span resources and operation names, then map the
most suspicious one back to the Coral codebase. Do not edit code yet.
Now inspect matching Datadog logs for service:coral in the same time window.
Correlate any warnings or errors with the slow spans you found.
The backend changed, but the shape of the workflow did not. Emit telemetry, then ask the agent to query it as tables and connect it back to the code.
Bringing in the Backlog
The same pattern is not limited to observability backends. Coral also has sources for systems such as GitHub and Linear.
That matters for the agent workflow because the agent does not need a separate GitHub integration, a separate Linear integration, and a separate telemetry integration. Coral is the interface.
For example:
Use Coral to find open GitHub issues for the Coral repository labeled bug or
performance. Pick one that has enough detail to reproduce, summarize the issue,
then inspect recent Coral telemetry to see whether we have evidence for the
same behavior.
Or:
Use Coral to list open Linear issues assigned to the Coral project. Group them
by area, pick the highest-impact debugging issue, and propose the first Coral
command and telemetry query you would run before touching code.
This is where the loop starts to feel less like “query my traces” and more like “operate the project”. The agent can read the backlog, pick a concrete issue, reproduce it with Coral, inspect the traces and logs through Coral, patch the code, and verify the result, without installing another tool beyond Coral itself.
Summary
Once telemetry is flowing, Coral’s observability data is just another data source. That means a code agent does not have to treat traces, logs, and metrics as something outside its normal workflow. It can query them with Coral, reason about them, and connect them directly to the Rust code which produced them.
The loop is compact: reproduce the behavior, query the telemetry, form a hypothesis, change the smallest relevant code path, and query the telemetry again. Coral provides both sides of that loop: it is the system being observed and the interface used to inspect the observations.
That is the core idea: make Coral report what it is doing, give the code agent access to those reports through Coral, and use the evidence to improve Coral itself.