ETHAN STUART
  • Work
  • Writing
  • About
Open to conversation
ES
  • LinkedIn
  • Substack
  • GitHub
© 2026 Ethan Stuart · Built with Next.js
NexusWatch — Geopolitical Intelligence·The Composer — Multi-Agent Editorial·Product OS — Spec-as-code·Zero to Ship — AI Course Platform·Senior Manager · Data & AI Products · Disney Studios·NexusWatch — Geopolitical Intelligence·The Composer — Multi-Agent Editorial·Product OS — Spec-as-code·Zero to Ship — AI Course Platform·Senior Manager · Data & AI Products · Disney Studios·
Back to Writing
October 2025 · Field Note

AI Agents Don’t Dream of Dirty Data

🧠 The Agent Era Is Here — But It’s Built on Your Data

Last quarter, a Fortune 500 company’s AI agent confidently recommended cutting production on their highest-grossing film franchise. The reason? Two systems defined “revenue” differently—one included international streaming, the other didn’t. The agent wasn’t broken. The data was.

We’ve entered a new chapter in AI—the agentic era—where large language models don’t just respond to prompts, they reason, plan, and act. From LangChain and CrewAI to OpenAI GPTs and Snowflake Cortex, these frameworks are transforming how we work. Agents can now orchestrate workflows, summarize customer insights, write SQL, or even trigger production systems autonomously.

Thanks for reading! Subscribe for free to receive new posts and support my work.

But here’s the truth nobody wants to hear: none of it works without clean data.

You can’t orchestrate intelligence on top of confusion. AI agents rely on structured, trustworthy, and accessible data. When your underlying systems are inconsistent, redundant, or undocumented, the agent’s “reasoning” becomes nothing more than statistical noise wrapped in confidence.

In other words: AI is only as smart as your data is clean.


🧹 Bad Data Broke Dashboards. Now It Breaks Reasoning Chains.

We spent the last decade learning that garbage data ruins dashboards. Now, it destroys reasoning chains entirely.

When a chatbot or AI assistant gives a wrong answer, it’s rarely “hallucinating”—it’s reflecting the gaps, duplications, or inconsistencies of the datasets it learned or queried from.

Imagine an operations agent trying to optimize movie budgets or logistics. If two datasets define “region” differently, or if title IDs don’t align across platforms, your agent will misclassify projects, misforecast spend, and provide incomplete recommendations with absolute confidence.

AI doesn’t understand truth; it understands patterns. When those patterns are built on conflicting data, confidence becomes deception.

That’s why governance, metadata, and lineage are now front-line disciplines—not back-office afterthoughts.


⚙️ The New Stack: Why DataOps + AgentOps Changes Everything

Traditional data stacks were built for hindsight—dashboards looking backward at what happened. Agent stacks need foresight—systems that can be queried in natural language, enforce policy in real-time, and adapt to user intent dynamically.

That requires a different architecture entirely.

Below is what the modern stack looks like—from ingestion to agent orchestration:

1. Data Ingestion & Transformation

Purpose: Continuous, automated pipelines feeding gold-standard datasets
Example Tools: Fivetran, Airbyte, dbt, Snowflake Tasks

2. Data Governance & Access Policies

Purpose: Manage, track, and enforce data use, quality, and compliance
Example Tools: Immuta, Alation, Collibra, Atlan

3. Semantic & Contextual Layers

Purpose: Define business logic and relationships for machine readability
Example Tools: dbt Metrics Layer, Snowflake Cortex, Cube, AtScale

4. AI Agent Frameworks

Purpose: Enable reasoning, planning, and orchestration across workflows
Example Tools: LangChain, CrewAI, OpenDevin, Snowflake Cortex Agents

5. Monitoring & Evaluation

Purpose: Track model performance, bias, reliability, and data drift
Example Tools: Weights & Biases, TruEra, Helicone, PromptLayer

This is what distinguishes useful AI from novelty: not the model size, not the prompt tuning—but the clarity and health of the underlying data ecosystem.


🧮 Case Study: How Snowflake Cortex Actually Works

When teams build agents within Snowflake Cortex, those agents rely on semantic context and query translation—essentially turning human language into SQL. But Cortex doesn’t inherently “understand” business logic; it learns it from metadata, definitions, and schemas.

This is where the ecosystem becomes critical.

Here’s the flow:

User Question → Cortex (semantic translation) → Immuta (policy enforcement) → dbt models (clean, tested data) → Alation (context verification) → Trusted Answer

Let’s break down each layer:

  • Immuta handles policy enforcement. When an agent queries Cortex, it only accesses the rows and columns the user is authorized to see—no code changes, no manual governance, no security gaps.

  • Alation (or Atlan) manages metadata and lineage, providing context—what the data means, who owns it, how fresh it is, and how reliable.

  • dbt provides the transformation and testing layer, ensuring that all curated models have defined sources, freshness checks, and data quality constraints.

The result?

When a user asks, “What’s our total production spend for Marvel titles released in 2024?”, Cortex can safely generate and execute SQL that’s both correct and compliant—because every piece of the stack speaks the same data language.

Without this alignment, you get technically valid queries that produce business nonsense.


🧩 Who Orchestrates This? Enter the Modern Data Product Manager

Data Product Management is no longer just about enabling dashboards or managing pipelines—it’s about ensuring that data is productized, governable, and agent-ready.

In the Snowflake Cortex example above, someone had to:

  • Define what “production spend” means across departments

  • Ensure dbt models were tested and documented

  • Configure Immuta policies for different user roles

  • Verify that Alation metadata was accurate and machine-readable

That someone is the modern Data Product Manager (DPM).

DPMs are now responsible for translating between data engineering and intelligent automation. Their charter includes:

Defining the Gold Layer — Curating trusted datasets that reflect real-world truth, not just what’s convenient to extract.

Driving Metadata Discipline — Ensuring lineage, definitions, and ownership are clearly documented and machine-readable.

Partnering with Governance — Implementing controls that allow agents to use sensitive data safely without friction.

Designing Semantic Models — Building domain vocabularies that let natural-language systems interpret data meaningfully.

Measuring Value Beyond Dashboards — Tracking usage, accuracy, and reliability of agent-based insights as product metrics.

When data becomes a product, AI agents become viable consumers.
When data remains a collection of pipelines, agents become hallucination factories.


🎯 The Missing Link: Why Semantic Layers Matter More Than You Think

Semantic layers have become the unsung heroes of the AI era. They translate business meaning into machine context.

For years, companies tried to fix decision-making by adding more dashboards. But decision speed never improved—because access ≠ understanding.

Now, imagine this instead:

Your AI assistant can answer, “Show me titles in post-production with spend variance above 15%” without any pre-built report. No dashboard. No manual SQL. Just a question.

That’s possible because your semantic layer—built in dbt or Cortex—defines what “spend,” “variance,” and “title status” mean in business terms.

The same definition that drives the BI dashboard also powers the AI agent’s reasoning.

This alignment is what creates trust—and it’s where the line between data products and AI products disappears.

Without semantic context, your agent becomes a SQL-generation roulette wheel—technically correct syntax, business nonsense output.


🔒 AI Governance as Code: From Bureaucracy to Infrastructure

Governance used to mean bureaucracy. Now, it’s infrastructure.

Modern governance tools like Immuta, Privacera, or Okera allow teams to express policies as code—things like “Finance analysts can only see anonymized salary fields unless approved by HR.”

When your AI agent executes a query, those policies are automatically applied. You don’t have to rely on trust; you rely on code.

This enables scalable compliance—every AI action is explainable, auditable, and reversible.

And when paired with Alation’s metadata or Atlan’s lineage tracking, you gain a complete view of every data touchpoint your agent interacts with.

This isn’t just technical hygiene. It’s the foundation of AI ethics and accountability.


💡 The Future: Self-Healing, Self-Aware Data Systems

The next phase of this movement will be self-improving data ecosystems—AI agents that maintain the integrity of the data they use.

This isn’t speculation. It’s already emerging:

  • Snowflake’s Cortex Analyst auto-suggests corrections when query patterns deviate from expected schemas

  • Monte Carlo and Metaplane detect anomalies that humans miss—the next step is agents that close the loop automatically

  • Great Expectations Cloud is exploring agent-driven test generation based on data behavior

Imagine agents that can:

  • Detect when pipeline data doesn’t align with expected patterns

  • Suggest dbt tests or transformations automatically

  • Identify redundant datasets or stale lineage entries

  • Rewrite or optimize queries in response to schema changes

We’re heading toward autonomous DataOps, where agents not only consume data but curate and protect it in real time.


🧭 How to Prepare Your Organization Right Now

If your organization is preparing for AI agents—don’t start with LLM prompts. Start with data readiness.

Here’s a practical roadmap with timelines:

Month 1-2: Audit Your Data Foundation

  • Use dbt, Great Expectations, or Soda to identify missing tests, schema drift, and stale datasets

  • Track data freshness and latency metrics for key domains

Month 2-3: Enforce Metadata Consistency

  • Implement Alation, Atlan, or Collibra to document every data source and business term

  • Make metadata machine-readable for agent consumption

Month 3-4: Adopt Governance-as-Code

  • Automate access control and masking with Immuta or Privacera

  • Integrate policy enforcement at query runtime, not at review time

Month 4-5: Define a Semantic Layer Early

  • Use dbt Metrics Layer or Snowflake Cortex definitions to create shared logic across tools

  • Keep your business terms versioned, reviewed, and visible

Month 5-6: Start Small, Scale Right

  • Pick one clean dataset—marketing, finance, or operations—and prototype a single agent use case

  • Validate trust, latency, and interpretability before expanding scope

These steps aren’t glamorous. But they are the difference between AI that works once and AI that works always.


🔮 Closing Thought: Clean Data Is the New Model Weight

The companies winning the agent era won’t be the ones with the biggest models or the cleverest prompts.

They’ll be the ones who understood that every agent is only as intelligent as the data foundation beneath it.

The world will keep talking about parameters, architectures, and multimodal reasoning. But the true differentiator will be your data cleanliness, context, and control.

AI agents don’t dream of dirty data—they reject it.

And in that rejection lies the competitive moat of the next decade.

Thanks for reading! Subscribe for free to receive new posts and support my work.

Read on Substack ↗All articles
Subscribe

The Data Product Agent.

Insights on AI product strategy, data platform leadership, and building high-velocity product organizations at enterprise scale.

Subscribe ↗