## BLOG

The Grammar of Data: Define Once, Run Anywhere with Cross-Engine Expressions

By Simon Späti (guest) | June 30, 2026

← ALL POSTS

Cross-Engine Expressions

The Grammar of Data

batting Noun

.filter(yr==2015) Verb

.aggregate(hits) Verb

.into_backend(pg) Modifier

Express. Verify. Run.

Grammars for languages or any other field are a beautiful thing. They compress complex systems into a language with a couple of rules. For the spoken language example, we know when to capitalize a letter or how to start a sentence. There are clear rules. Grammars also help us remember, as we do not need to recall every little rule, but apply them in a structured way.

For text editing, we have Vim motions that help us navigate a text document with 1000s of shortcuts, but because there is a grammar, we do not need to remember them all, but learn the structure of the grammar and combine them. But what if you work in data? What if we could have the same for data, a grammar for data engineering, or a language that defines it?

Expressing our needs declaratively and decisively? Also, expressing it in a way that leads to reproducible outcomes, or works with multiple parts and execution engines already out there. This is what we will discuss in this article. How existing tooling, such as Ibis, provides some capabilities, and how xorq extends them by adding full lineage and transparency for humans, with included executable memory for useful tabular data, all manifested in a single git repository.

Expressions for Data Engineering Workloads

Having a grammar for data engineering means we can express the workloads in a declarative manner, and then be sure we can deterministically reproduce and apply that exact definition.

It’s similar to the concept of a Declarative Data Stack I introduced a while back, but it gives the stack not only configurations but also a language with in-built manifestation and execution engines.# The Grammar of Data: Define Once, Run Anywhere with Cross-Engine Expressions Grammars for languages or any other field are a beautiful thing. They compress complex systems into a language with a couple of rules. For the spoken language example, we know when to capitalize a letter or how to start a sentence. There are clear rules. Grammars also help us remember, as we do not need to recall every little rule, but apply them in a structured way.

Expressions for Data Engineering Workloads

Having a grammar for data engineering means we can express the workloads in a declarative manner, and then be sure we can deterministically reproduce and apply that exact definition.

Write a declarative expression, manifest it into a validated and hashed build, then execute it on any engine.

In the above image, we see: 1. How to express (write) our transformations and business logic. It’s the context of every ML or DE pipeline. 2. We can build the expression into a manifest that has a unique hash, runs input validations, tracks lineage, creates a deterministic cache, and produces a human-readable expr.yaml you can diff and review in a PR. 3. Lastly, we can execute it in any execution engine with the same manifest.

This is hugely powerful and separates the concerns of defining logic, verification in the manifest step, and execution as a composable data stack, as Wes McKinney called it, with multi-compute engine possibilities.

How the DE Language Works: Different Expression Types

Every grammar starts with nouns, and here the noun is the source, a node that holds data but carries no transformation yet. It might be an in-memory table, a registered connection to a warehouse, or just a lazy pointer to a file on disk that hasn’t been read. They’re simply referenced, the way a noun refers to a thing before any verb acts on it.

The verbs in our language are transforms such as filter, select, mutate, aggregate, join, order, limit. Each one takes a source (or another transformed expression) and returns a new, immutable expression. You do not mutate anything before it, only describe what should happen next.

Looking at a definition such as .filter(...).aggregate(...).mutate(...), we can see this as a sentence. The moment a verb is applied, the expression stops being a plain noun and becomes a statement, a description of “data plus what should happen to it.” But the sentence isn’t spoken yet, it stays inert, fully composed but unexecuted, until something finally asks it to run. That’s the deferred part of the grammar: writing the sentence and saying it out loud are two different acts.

There’s a third part of speech worth naming: the template. Instead of writing a sentence about a specific noun, you can write one about a noun’s shape, a schema with no rows behind it. A template says “given something with a column of this type, here is what I’ll do to it,” and only later gets bound to an actual source, at which point the placeholder resolves and it becomes an ordinary statement again.

And we have modifiers that ride alongside a statement without changing what it computes. They’re small tags of metadata that say “this expression also represents a fitted model” or “this is a saved reference to something else.” It’s like a footnote with additional metadata that doesn’t change the surface meaning, but adds context for later use.

This analogy makes the grammar compose the same way regardless of which engine eventually executes it. There are more parts, but with just these four, noun, verb, template, modifier, you can read (and write) arbitrarily complex data pipelines the same way learning a handful of verb-and-object combinations in a text editor lets you compose arbitrarily complex edits.

Avoids building “Inner-Platform Effect” with repeated tools

With this grammar, we can avoid repeatedly implementing the same logic we already have, but manifest and express our logic once, and reuse it with different execution engines, exactly what Ibis and xorq allow. Similar to what the inner-platform effect means for software best practices.

Why a Grammar is Really Good for LLMs

Having a grammar is really good for LLMs, too. It helps them first to declare data artifacts and second to execute them reproducibly.

On top, expressions can be LLM-agnostic, and we can interchange the LLMs we use just with an expression. Also, the chart is just an expression, or the data catalog and the metrics.

Model Once, Represent Everywhere: Expressing the Full Data Stack with a Single Expression

Like UDA (Unified Data Architecture) from Netflix, we define our expressions once and represent them everywhere. Netflix built UDA to solve duplicated models, inconsistent terminology, and siloed systems, where the same concept like ‘actor’ or ‘movie’ gets modeled differently across teams, with no shared foundation. Their answer was a full knowledge graph with a metamodel, making the conceptual model part of the actual control plane.

Not everyone needs Netflix-scale tooling, though. For a code-first approach, xorq gives you the same core principle: define once, execute anywhere by writing a declarative Ibis expression, serializing them as content-addressed YAML artifacts, and running against any supported engine, fully reproducible.

The difference worth noting: UDA is a semantic layer defining what data means across systems. Xorq is a computational layer defining what transformations do across engines. Both reject the same anti-pattern of re-implementing the same logic for every system.

Entering Xorq: The Horizontal Data Architecture

Xorq is an executable memory system for tabular data that works horizontally across your data stack, supporting everything from discovery with a catalog to defining transformation logic to modeling.

It has declarative transformation (Pandas style), and you can build ML pipelines and prepare data with its semantics in a single stack that is not vertically integrated, but horizontally integrated, giving your agents a catalog of executable pipelines and turning short-lived agent work such as wrangling scripts, sklearn pipelines, ad-hoc tables into durable, composable, executable artifacts that any future agent or human can discover, reproduce, and reuse.

Vertical siloed stacks bind each tool to its own compute; the horizontal stack defines one grammar of expressions that runs across many engines.

The horizontal data stack shows what Xorq brings to the table. Xorq’s origins started from a git-native semantic layer, for data analysts out of college, to build semantic models for a living, to make their lives easier.

From point-and-click tools, dragging tables and drawing joins manually, only to add more reporting tools on top to create pixel-perfect reports. Also performance-wise, it didn’t scale, meaning we needed cubes to make it faster, adding another layer of complexity.

And there was no lineage that shows from source to dashboard. The question asked was: “what if we could do this end-to-end data engineering workflow locally?”. This is what the horizontal data stack and xorq are providing.

To add semantic layer capabilities, Julien Hurault and Hussain built the Boring Semantic Layer + the Xorq catalog, providing a semantic model you define in Python, check into git, and query from the CLI.

Compressing Logic into a Single Executable

Compression of a full data stack into a single executable is hard, but xorq tries exactly this with the help of Ibis, git, uv, and DataFusion.

The design choices of xorq showcase even better what it is, and what they enable:

Ibis as expression layer (v9.5.0+, partial): Declarative dataframe expressions compiled to multiple backends (xorq supports a subset of the Ibis API, not the full surface)
Git for state and storage: The catalog is a git repo of entries with git-annex support for large files
uv for reproducible environments: Each entry ships with a wheel and pinned requirements.txt.
DataFusion for embedded compute: Pipelines execute in-process with SQL and UDFs

Composable Data Engines

Another big advantage of expressions and having a grammar for data engineering is easily switching between backends, with no change to the transformation or business logic. It’s just defining the backend from Apache Arrow Flight to DuckDB or any other engine.

We write the definitions and express our tabular data and computations. The engine, in this case xorq, can build it into a manifest file that is deterministic and hashed.

Xorq uses Ibis as the expression layer for single-backend logic, then builds the cross-engine expression tree into a serialized YAML artifact. When moving data between backends, xorq transfers Apache Arrow RecordBatch streams between them—each backend acts as a RecordBatch transducer. No CSV serialization, no JSON encoding needed. This makes backend switching fast and memory-efficient. Write declarative Ibis expressions that run like a tool—xorq extends Ibis with caching, multi-engine execution, and UDFs.

Here’s an example of using DuckDB and Postgres in conjunction:

import xorq.api as xo

# Connect to engines
pg = xo.postgres.connect_env()
db = xo.duckdb.connect()

# Load data from different sources
batting = pg.table("batting")
awards = xo.examples.awards_players.fetch(backend=db)

# Filter in respective engines
recent = batting.filter(batting.yearID == 2015)
nl_awards = awards.filter(awards.lgID == "NL")

# Move data to postgres for join
result = recent.join(
    nl_awards.into_backend(pg),
    ["playerID"]
)

result.execute()

Move data between different engines within a single expression using into_backend(), here Postgres and DuckDB

You can see how easily you choose your most optimized execution engine, whether in the above example choosing DuckDB for filtering recent batting and using Postgres to filter NL (National League) awards, and joining the two with the Postgres engine.

Engines supported by xorq as of now, with the ability to move data between them, are (check Supported backends for the latest):

Embedded: DataFusion, DuckDB, SQLite, Pandas
Warehouses: Snowflake, Databricks, Trino, Postgres
Lakehouse: PyIceberg
Arrow Flight: GizmoSQL (DuckDB over Arrow Flight SQL)

Cross-Engine Expression Tree

With different engines supported, we can use the compressed single executable logic across engines. We can build expression graphs before executing them, which works like this, with one expression, many engines:

expr = penguins.into_backend(xo.sqlite.connect())
expr.ls.backends

The output of building a cross-engine expression is a directory containing your serialized pipeline with a unique hash identifying each build and its artifacts and expressions. When executed, the output is the resulting object or data.

And the expressions are tools, Arrow is the pipe. E.g., a Unix pipe streams text between small programs. Xorq pipes Arrow streams between expressions: unix : programs :: xorq : arrow-transforms

That executes like this:

In [6]: expr.to_pyarrow_batches()
Out[6]: <pyarrow.lib.RecordBatchReader at 0x15dc3f570>

This is quite short and potentially abstract to understand when never used, but we will go into more examples and details in another article.

A Shared Language for Data

This article introduces a new way of describing data transformations for machine learning or data engineering pipelines in a direct and simple way that works locally with any execution engine, without changing the code itself.

It’s a good place if you need a trusted harness for a data engineering persona. We can define once and use it with the engine that works best for your workload and data engineering environment.

We had a look at how we write -> manifest -> execute with xorq, its advantages, and why you might use it for modeling once and representing everywhere. By adding AI agents to the mix, which help us pull the right lever, instead of bigger, more expensive models or more tokens, we improve accuracy with more semantic understanding, with a grammar the model can learn and apply, even pre-manifest before execution, and run them deterministically every time. This is a huge addition to working just with agentic Skills files that are free-form Markdown and pull data all over, or are not defined precisely enough. It’s all about having high-quality context in the right format, with a clear definition where humans and AI agents can interchange and help each other.

There’s a lot more to come, with showcasing the horizontal data stack and the use cases it supports, how we build expressions versus running computations, and how data catalogs are integrated into the picture, too.

Check out xorq code and star it on GitHub, or read more behind the scenes at Xorq documentation.

They also have a macOS desktop app coming up that does it all in one unified app, geared towards non-technical users. Join the waitlist for that.