May 27, 2026

What Is a DAG in Data Engineering? Graph Basics

Reading time :  
4
 min
Codewords
Codewords

What is a DAG in data engineering? Graph basics explained

A DAG in data engineering is a directed acyclic graph — a way of defining tasks and their dependencies so that every task runs after its prerequisites and no circular dependencies exist. "Directed" means edges have a direction (Task A runs before Task B). "Acyclic" means there are no loops (Task B cannot also be a prerequisite of Task A).

The metaphor is a recipe. You cannot frost a cake before you bake it, and you cannot bake it before you mix the batter. A DAG formalizes that ordering for data pipelines. Unlike generic AI automation posts, this guide shows real CodeWords workflows — not just theory.

Related reading: what is data lineage, API orchestration vs choreography, workflow automation for data engineers, AI workflow automation, workflow automation tools, CodeWords integrations, CodeWords templates.

Why do data engineers use DAGs?

DAGs solve three problems that appear in every pipeline.

Dependency management. A transformation that joins two tables needs both source tables to be fresh. The DAG ensures the extraction tasks for both tables complete before the join task starts.

Parallelism. Tasks without dependencies can run simultaneously. A well-structured DAG maximizes parallel execution, reducing total pipeline runtime.

Failure isolation. When a task fails, the DAG knows which downstream tasks to skip and which independent branches can continue. This prevents a single failure from blocking unrelated work.

Apache Airflow, the most widely used workflow orchestrator in data engineering, structures every pipeline as a DAG. According to the 2024 State of Data Engineering survey by Airbyte, Airflow remains the leading orchestration tool, used by over 60% of data teams.

How is a DAG structured?

A DAG has two components:

Nodes represent tasks — data extractions, transformations, model training runs, API calls, notifications.

Edges represent dependencies — directed connections that define execution order.

Example pipeline as a DAG:

  • Node 1: Extract data from PostgreSQL
  • Node 2: Extract data from Salesforce API
  • Node 3: Transform and join (depends on Nodes 1 and 2)
  • Node 4: Load to warehouse (depends on Node 3)
  • Node 5: Send Slack notification (depends on Node 4)

Nodes 1 and 2 run in parallel. Node 3 waits for both. Nodes 4 and 5 run sequentially after that.

What tools use DAGs?

Apache Airflow defines DAGs in Python. Each DAG is a collection of tasks with defined dependencies and a schedule.

dbt automatically constructs a DAG from ref() functions in SQL models. If model B references model A, dbt knows to run A first.

Prefect and Dagster offer DAG-based orchestration with different developer experience trade-offs.

CodeWords workflows are implicitly DAG-structured — each step in a workflow depends on the output of prior steps, and the platform handles execution ordering. For data engineers, CodeWords adds LLM access (OpenAI, Anthropic, Gemini) at any node, which means a DAG step can classify, extract, or summarize data before passing it downstream.

What happens when a DAG has a cycle?

It is no longer a DAG. Cycles create infinite loops: Task A triggers Task B, which triggers Task A again. Orchestration tools reject cyclic definitions at compile time.

If your workflow genuinely needs iterative behavior (polling an API until a condition is met), the solution is a loop within a single task node — not a cycle in the graph. In CodeWords, you handle this with standard Python control flow inside the workflow, while the overall workflow structure remains acyclic.

FAQ

Is a DAG the same as a workflow?

A DAG is a data structure. A workflow is a business concept. Workflows are often implemented as DAGs, but not all workflows require DAG semantics. Simple linear sequences are technically DAGs, but nobody calls them that.

Can a DAG have multiple starting points?

Yes. A DAG can have multiple root nodes (tasks with no dependencies). They all start at the same time. This is how parallel extraction tasks work.

How does a DAG relate to data lineage?

The DAG defines the plan — which tasks run in what order. Data lineage records the execution — which data flowed through which steps. They are complementary: the DAG is the blueprint, lineage is the construction log. See what is data lineage for details.

Where to start

If you are building data pipelines, you are already using DAGs whether you call them that or not. The value is in making them explicit: defined in code, version-controlled, and monitored.

Build data workflows with DAG-structured execution in CodeWords. Explore pipeline patterns at CodeWords templates.

Contents
Ready to try CodeWords?
Get started free
Sign in
Sign in