## BLOG

THE LIFECYCLE OF AN EXPRESSION

How an expression becomes a pipeline

Most pipeline code runs once and disappears into a result. You get a dataframe, but you lose the logic that produced it. Xorq keeps the logic around as a first-class object called an expression.

An expression is a description of a computation. Not the result. Not the code that runs it. The description itself.

This matters because a description can be inspected, hashed, cached, serialized, versioned, and sent to a different machine. The result of running df.groupby("species").mean() in pandas is a dataframe. The result of writing the same thing as a Xorq expression is a graph that describes grouping by species and computing means. You choose when that graph executes, where it executes, and whether to cache the result.

This post walks through what you can do with an expression: write one, execute it, cache intermediate results, move data between engines, compose with functions, build a portable artifact, and run it somewhere else.

Write an expression

You have the iris dataset. You want total sepal width by species, but only for flowers with sepal length above five.

import xorq.api as xo

expr = (
    xo.deferred_read_csv("flowers.csv")
    .filter([xo._.sepal_length > 5])
    .group_by("species")
    .agg(xo._.sepal_width.sum())
)

At this point, nothing has run. No data has been read. expr is a tree of operations: read the iris table, filter rows where sepal length exceeds five, group by species, sum sepal width. Each method call returns a new expression. The original is never modified.

Execute it

result = expr.execute()

Now the work happens. Xorq compiles the expression graph and runs it. The return value is a pandas DataFrame.

Pandas gave you a dataframe. A xorq expression is a description that produces a dataframe. That description can be hashed, cached, and sent to a different engine.

expr.to_parquet("output.parquet")       # write directly to file
batches = expr.to_pyarrow_batches()     # stream Arrow batches

Execution is the only step that touches real data. Everything before it is graph construction.

Cache intermediate results

Expressions can include cache boundaries. When the graph executes, cached nodes check whether they already have a stored result. If they do, the upstream computation is skipped entirely.

from xorq.caching import ParquetCache

cached_expr = expr.cache(ParquetCache.from_kwargs())

The cache key is derived from the expression graph itself. Same expression, same key. Change a filter predicate, you get a new key, and the cache misses. No manual invalidation. No TTLs to configure by default.¹

A common pattern is caching an expensive join or aggregation, then building further transformations on top:

expensive = (
    big_table
    .join(dimension_table, "id")
    .group_by("category")
    .agg(total=xo._.amount.sum())
    .cache(ParquetCache.from_kwargs())
)

# This only recomputes the filter, not the join + aggregation
filtered = expensive.filter(xo._.total > 1000)
result = filtered.execute()

Place .cache() where it matters and the expression graph handles the rest.

Move between engines

A single expression can span multiple backends.

pg = xo.postgres.connect_examples()
datafusion = xo.connect()

batting = (
    pg.table("batting")
    .filter(xo._.yearID == 2015)
    .into_backend(datafusion, "batting")
)

xo.connect() is xorq’s default DataFusion backend — the same one the first example used implicitly when we called xo.deferred_read_csv(). Here we’re making it explicit because there are two backends. The filter runs on Postgres (where the data lives), then into_backend() pulls the result as Arrow batches into DataFusion. From that point on, DataFusion handles the rest of the graph.

Arrow is the data format that moves between engines. No CSV serialization, no JSON encoding, no ORM translation. Just columnar batches.

Compose with functions

pipe() applies a function to an expression and returns the result.

def add_discount(table):
    return table.mutate(
        discount_value=table.price * table.discount
    )

expr = sales_table.pipe(add_discount)

pipe() is still lazy. It returns a new expression, not a result. You can chain pipes and place cache boundaries between them — so if clean and add_features are expensive but stable, you cache their output and only train_model re-executes when you iterate on it:

expr = (
    raw_data
    .pipe(clean)
    .pipe(add_features)
    .cache(ParquetCache.from_kwargs())
    .pipe(train_model)
)

Build it into an artifact

The Python code that defines your expression depends on your environment — installed packages, import paths, local files, database connections that may or may not be available. A build captures the expression as a self-contained artifact that doesn’t depend on any of that.

xorq build my_pipeline.py -e expr
# → builds/28ecab08754e

The 12-character directory name is a hash of the expression structure. Same expression, same hash. The build directory contains everything needed to reconstruct and run the expression on another machine² — no access to the original Python code, environment, or database required:

builds/28ecab08754e/
├── expr.yaml            # The computation graph as YAML
├── profiles.yaml        # Database connection profiles
├── expr_metadata.json   # Expression kind and schema
├── build_metadata.json  # Xorq version, build timestamp
└── database_tables/
    └── *.parquet        # Snapshotted input data

The expr.yaml is human-readable. You can diff it, review it in a PR, and reconstruct the expression from it alone.

Run it from the artifact

A cache speeds up repeated execution on your machine. A build serves a different purpose: sharing and reproducing. Someone else can verify your computation, run it against new data, or use it as a starting point for their own work — without your code, your packages, or your environment. In practice, sharing happens through the xorq catalog, which versions and organizes builds so teammates and agents can discover and build on each other’s work. We’ll cover the catalog in a companion post. The build includes snapshotted input data as parquet, so if the data was captured at build time, they don’t even need access to the original database.

xorq run builds/28ecab08754e -o output.parquet

Or in Python:

from xorq.ibis_yaml.compiler import load_expr

expr = load_expr("builds/28ecab08754e")
result = expr.execute()

The expression you get back is the same lazy graph you started with. You can add more transformations, cache it, or move it to a different backend — the artifact is a starting point, not a dead end.

Rerun it and only recompute what changed

Because the hash is derived from the expression structure, rebuilding the same expression produces the same hash. If you modify the expression, you get a new hash and a new build. Cache nodes inside the expression still work: if the upstream portion of the graph hasn’t changed, cached results are reused.

# Original
expr_v1 = data.filter(xo._.year == 2015).cache(ParquetCache.from_kwargs())

# Modified — only the filter changed, upstream cache still valid
expr_v2 = data.filter(xo._.year == 2016).cache(ParquetCache.from_kwargs())

The full lifecycle

write      filter, join, aggregate, UDF, cache, into_backend
    ↓
    ├→ execute    .execute(), .to_parquet(), .to_pyarrow_batches()
    │
    └→ build      xorq build → content-addressed YAML artifact
         ↓
       share      git push, catalog add, copy the directory
         ↓
       run        xorq run → load from YAML, execute anywhere
         ↓
       iterate    change the expression, get a new hash, reuse cached results

Each step is optional. You can write and execute without ever building. You can build without using the catalog. The lifecycle isn’t a mandatory pipeline — it’s a set of capabilities you use when they help.

Expressions are the building block of the xorq system. Everything else — the catalog, caching, multi-engine execution, builds — is built on top of them. Once you have an intuition for how expressions work, the higher-level concepts follow naturally.

Footnotes

Xorq does offer a TTL-aware cache for workloads where time-based expiry makes sense, but the default cache is purely structural.↩︎
Builds snapshot in-memory tables and can bundle local files, but expressions that read from a local path (e.g. a parquet file on disk) require that path to exist on the target machine — or to be loaded as a memtable first so the data is snapshotted into the build.↩︎