What is a DAG in data engineering? graph basics
What is a DAG in data engineering? Graph basics explained
A DAG in data engineering is a directed acyclic graph — a way of defining tasks and their dependencies so that every task runs after its prerequisites and no circular dependencies exist. "Directed" means edges have a direction (Task A runs before Task B). "Acyclic" means there are no loops (Task B cannot also be a prerequisite of Task A).
The metaphor is a recipe. You cannot frost a cake before you bake it, and you cannot bake it before you mix the batter. A DAG formalizes that ordering for data pipelines. Unlike generic AI automation posts, this guide shows real CodeWords workflows — not just theory.
Why do data engineers use DAGs?
DAGs solve three problems: Dependency management (ensuring source tables are fresh before joins run), parallelism (tasks without dependencies can run simultaneously), and failure isolation (when a task fails, independent branches can continue).
Apache Airflow, the most widely used workflow orchestrator in data engineering, structures every pipeline as a DAG. According to the 2024 State of Data Engineering survey, Airflow remains the leading orchestration tool, used by over 60% of data teams.
How is a DAG structured?
A DAG has nodes (tasks — extractions, transformations, API calls) and edges (dependencies — directed connections defining execution order). Example: Node 1 (Extract from PostgreSQL) and Node 2 (Extract from Salesforce API) run in parallel, then Node 3 (Transform and join) waits for both, then Node 4 (Load to warehouse) and Node 5 (Send Slack notification) run sequentially.
What tools use DAGs?
Apache Airflow defines DAGs in Python. dbt automatically constructs a DAG from ref() functions in SQL models. Prefect and Dagster offer DAG-based orchestration with different developer experience trade-offs. CodeWords workflows are implicitly DAG-structured — each step depends on the output of prior steps, and the platform adds LLM access at any node.
What happens when a DAG has a cycle?
It is no longer a DAG. Cycles create infinite loops: Task A triggers Task B, which triggers Task A again. Orchestration tools reject cyclic definitions at compile time. If your workflow genuinely needs iterative behavior (polling an API until a condition is met), handle this with a loop within a single task node.
FAQ
Is a DAG the same as a workflow?
A DAG is a data structure. A workflow is a business concept. Workflows are often implemented as DAGs, but simple linear sequences are technically DAGs too.
How does a DAG relate to data lineage?
The DAG defines the plan — which tasks run in what order. Data lineage records the execution — which data flowed through which steps. The DAG is the blueprint, lineage is the construction log.
Build data workflows with DAG-structured execution in CodeWords. Explore pipeline patterns at CodeWords templates.




