ai-agentdata-engineeringtext-to-sqltrinoicebergai

AI Agents That Do Data Engineering — Beyond Text-to-SQL to Pipelines

How to delegate the SQL writing, pipeline debugging, quality checks, and catalog cleanup a data engineer repeats every day to an AI agent. We cover the architecture, self-correcting loop, and production guardrails of a data engineering agent running on Trino, Iceberg, and PySpark.

Data DynamicsJune 18, 202612 min read

Look closely at a data engineer's day and, surprisingly, a lot of it goes to repeatable work. Skimming a new table's schema to find the right join and writing the SQL, reading a stack trace when a pipeline breaks and fixing it, checking how many NULLs crept into the loaded data, filling in catalog descriptions. None of it is hard on its own, but bundled together it eats the whole day.

These tasks share a trait. The goal is clear, the tools are fixed, and the result is verifiable. That's exactly the kind of work AI agents are good at. This article goes beyond "ask a chatbot for SQL" to how you design a data engineering agent that, on top of a data platform, plans on its own, runs queries, and fixes its own failures.

What you'll learn

How a data engineering agent fundamentally differs from a Text-to-SQL chatbot

The architecture by which an agent reaches the catalog, query engine, and storage

The four things agents do well: Text-to-SQL, pipeline authoring, quality checks, and cataloging

The self-correcting loop that reads an error and fixes itself when a query fails

The guardrails you must lay down before going to production

If agent concepts themselves (ReAct, Plan-and-Execute, Tool Use, memory) feel unfamiliar, skim the AI Agent Complete Guide first and this article will go much more smoothly. If function calling and MCP are new, the AI Agent · MCP · A2A Introductory Guide pairs well too.

1. What's Different From a Chatbot — "Asking" vs "Getting It Done"

"Text-to-SQL" is a familiar phrase. Put in natural language, get SQL back. But that's just the agent's first step. Real data engineering isn't about writing SQL — it's about owning the result all the way to the query producing the right answer.

Let's see the difference in a picture.

Loading diagram…

The crux is whether the loop is closed. A chatbot tosses out the SQL and stops, so if it hallucinates a column name or builds the wrong join, no one knows until a human catches it. An agent runs it, reads the error and fixes it on failure, and when the result looks off (e.g., zero rows) changes its hypothesis and tries again. This observe → correct cycle is the decisive line between chatbot and agent.

Aspect	Text-to-SQL chatbot	Data engineering agent
Output	SQL string	Verified result / applied change
Execution	Human does it	Agent does it (under guardrails)
Failure handling	Human debugs	Reads error and self-corrects
Schema awareness	Dumps it all into the prompt	Retrieves only what's needed (agentic)
Scope	Single query	Pipelines, quality, catalog

2. Architecture — How the Agent Reaches the Data Platform

An agent's capabilities come down to which tools you put in its hands. The tools a data engineering agent needs are exactly those a human engineer uses: a catalog to find metadata, an engine to run queries, storage where the data lives, and an execution environment to run transforms.

Loading diagram…

One key design principle here: the agent never touches data directly. It always reaches the platform only through the narrow channel of a tool. Why does this matter? Because each channel becomes the single place to attach permissions, auditing, and limits (covered in §5). Our own Argus catalog agent stands on exactly this philosophy — not wiring AI code into the API server, but splitting it out as an independent agent that reaches the catalog only through tools. How to register tools as OpenAI schemas and bind them to execution functions is laid out at code level in the Argus agent's tools & function calling article.

3. The Four Things Agents Do Well

3-1. Text-to-SQL — An Agent That "Searches" the Schema

Naive Text-to-SQL stuffs every schema in the database into the prompt. In a lakehouse with hundreds of tables the context window blows up, costs explode, and you only raise the odds of the LLM picking the wrong table.

An agent takes a different approach. It retrieves only the schema it needs, on demand — this is "Agentic RAG." If static RAG retrieves once and answers once, an agent holds retrieval as a tool and repeats it on its own.

Loading diagram…

To search schemas by meaning, you have to embed table/column descriptions. That part builds on the Embedding Model Guide and Vector Database Comparison, and the retrieval pipeline overall on the RAG Complete Guide. An agent running on Trino needs to know the dialect and patterns, so hand it Trino SQL Patterns and Trino Federation in Practice as its "textbooks" too.

3-2. Authoring & Fixing Pipelines — An Agent That Writes PySpark

Move from querying to transforming and the stage becomes PySpark. Give the agent a goal like "build a daily sales aggregate table from the source orders table," and it checks the schema, writes the transform code, and submits it with tests.

# A transform the agent generated — always shipped with a test
from pyspark.sql import functions as F
 
def build_daily_sales(orders):
    return (
        orders
        .filter(F.col("status") == "PAID")
        .withColumn("order_date", F.to_date("created_at"))
        .groupBy("order_date")
        .agg(F.sum("amount").alias("gross_sales"),
             F.countDistinct("customer_id").alias("buyers"))
    )

There's one thing the agent forgets less than humans: it writes the test alongside. The pattern of splitting transform logic into verifiable units to test is laid out in the PySpark Testing Guide, and you should demand that same discipline from the agent. If you're curious about the design principles of agents that write code well (SWE agents), the Why Spec-Driven Development series is a good reference.

3-3. Data Quality Checks — An Agent That Generates and Runs Rules

Loaded data isn't the finish line. NULL ratios, uniqueness constraints, value ranges — it has to pass quality rules before you can trust it. An agent looks at the schema and samples, proposes appropriate quality rules, and runs them via a tool like Deequ to fetch the results.

Loading diagram…

How to define and run the rules themselves is detailed in PySpark Data Quality (Deequ). The agent's role isn't to replace the rule engine but to rapidly produce rules and interpret failures.

3-4. Catalog & Metadata — The Area We've Already Built

Table descriptions, column comments, PII tagging, business-term mapping. This tedious-but-important work is the first killer use case for an agent. And this isn't theory — it's an area we already build and operate.

Architecture (three execution modes: batch, polling daemon, assistant): A Data Catalog AI Agent Built With Only the Standard Library
LLM layer (OpenAI-compatible, local models): The Argus agent's model/LLM article
Governance (AI generates but humans approve, PII safeguards): The AI That Governs the Catalog

The governance article especially is a living example of the guardrails we'll discuss in §5, so if you're seriously considering a data engineering agent, it's well worth a read.

4. The Self-Correcting Loop — Using Failure as Fuel

The point where an agent beats a chatbot is how it handles failure. A human reads the error when they get a column name wrong and re-checks the catalog. The agent does the same.

Loading diagram…

This loop must have a ceiling. Retry forever and costs explode, and you fall into "loop hell," circling on the same mistake. Usually you combine: ① a max retry count (e.g., 3), ② a token/cost budget, and ③ stall detection like "stop if the query is identical to the previous one." How to observe and cut off these limits is operational territory, covered in detail in Part 1 of the upcoming Operations trilogy (AgentOps observability).

5. Guardrails — The Real 90% of Production

Building a demo takes a day, but 90% of getting to production is keeping the agent from causing accidents. Autonomy and safety are a trade-off, and guardrails fill the gap between them.

Loading diagram…

The guardrails almost always adopted in practice:

Read-only by default. Writes, DDL, and deletes only via a separate approval path.
Human-in-the-loop. Hard-to-reverse actions stop at a proposal; a human applies. Argus's suggest/apply split is exactly this pattern — see the governance article.
Always dry-run first. Look at the plan and scan size with EXPLAIN, and block before execution if it exceeds limits.
Row, timeout, and cost limits. So a runaway query can't eat the cluster. This aligns with Trino resource groups.
Authority via the delegated user token. The agent must not carry admin rights at all times. Argus's serve mode — which takes a delegated user token and acts only with that person's permissions — is a good model.
Every tool call to an audit log. What was done, why, and with what result must be traceable.

Anyone who's lived through an incident knows these guardrails aren't "nice to have" but "no production without them."

6. Production Pitfalls — The Distance Between Demo and Reality

Finally, the pitfalls you actually meet once it's live.

Schema hallucination. It confidently writes plausible columns that don't exist. → Don't inject the schema via prompt; have it retrieve via tools, and always verify with EXPLAIN.
Cost explosion. One self-correcting loop runs several LLM calls + Trino full scans. → Enforce token/scan budgets and a retry ceiling.
Quietly wrong joins. No error, but the result is wrong (Cartesian product, wrong key). → Put a sanity check step on result row counts/distributions into the loop.
Governance bypass. If the agent skips RLS/masking and reads the source directly, that's a data incident. → Always with delegated authority, routed through the catalog's policies.
No observability. If you can't see why the agent threw that query, debugging is impossible. → Tracing is not optional; it's required.

The last two — governance and observability — are so important we split them out into the dedicated Operations trilogy.

Closing — And a Preview of the Operations Trilogy

A data engineering agent isn't "magic that replaces people"; it's a well-designed system that handles repeatable, verifiable work through tools in a closed loop. Text-to-SQL is just the first step, and the real value comes from fixing its own failures, verifying quality, and running safely inside guardrails.

There's a wide river between building a good demo and trusting it in production. The next Operations trilogy lays the three bridges across it — observability, evaluation, security — one at a time.

Part 1 · AgentOps: Agent Observability and Tracing — what the agent was thinking when it called that tool, how many tokens and how much it cost, and how to reproduce a failure.
Part 2 · Agent Evaluation — When There's More Than One Right Answer — task success rate, trajectory evaluation, regression testing, and the traps of LLM-as-judge.
Part 3 · Agent Security — Beyond Prompt Injection — least-privilege tool permissions, sandboxing, the confused deputy, and MCP trust boundaries.

In one sentence: An agent's value isn't "generating SQL" but "owning a verified result," and what holds up that ownership is the closed loop and the guardrails.