Xorq in 5 Minutes

XORQ IN 5 MINUTES

← ALL POSTS

Xorq is a Python library that turns data pipeline code into portable, cacheable, versionable artifacts. You write expressions. Xorq handles caching, multi-engine execution, and reproducibility. Why Xorq walks through the real pipeline problems these concepts solve.

Expressions

An expression describes a computation without running it.

import xorq.api as xo
from xorq.caching import ParquetCache

expr = (
    xo.deferred_read_parquet("sales.parquet")
    .filter(xo._.amount > 100)
    .group_by("region")
    .agg(total=xo._.amount.sum())
    .cache(ParquetCache.from_kwargs())
)

result = expr.execute()

Nothing runs until .execute(). The expression is a graph of operations that can be inspected, hashed, and serialized before any data moves.

Caching

.cache() marks a point in the graph where results are stored. The cache key is derived from the expression itself. Same expression, same key. Change the expression, the key changes and the cache recomputes automatically.

No naming scheme. No expiration timers. No invalidation logic.

Multi-engine

Expressions can span databases. .into_backend() moves data between engines using Arrow — no temp files, no serialization code.

pg = xo.postgres.connect_env()
duckdb = xo.duckdb.connect()  # or xo.snowflake.connect_env(), xo.databricks.connect_env()
                               # only this line changes — none of the code below

expr = (
    pg.table("users")
    .filter(xo._.active == True)
    .into_backend(duckdb)
    .group_by("region")
    .agg(count=xo._.id.count())
)

The filter runs in Postgres. The aggregation runs in DuckDB. One expression, two engines. Swap duckdb for Snowflake or Databricks and nothing else changes.

Builds

xorq build compiles an expression into a self-contained artifact — a directory containing the computation graph as YAML, connection profiles, and any snapshotted data. The directory name is a hash of the expression. Same expression, same hash.

xorq build pipeline.py -e expr
# builds/28ecab08754e

xorq run builds/28ecab08754e -o output.parquet

You can diff builds, check them into version control, or send them to another machine. xorq run reconstructs the expression from the artifact and executes it.

The catalog

The catalog is a git-backed registry for build artifacts.

xorq catalog add builds/28ecab08754e --alias sales-summary
xorq catalog push

Your teammate pulls:

xorq catalog pull
xorq catalog list-aliases
# sales-summary

Every add, remove, and alias change is a git commit. Versioning, history, and collaboration come from git itself. No external metadata service.

Lineage

Every output column can be traced back through the expression graph to its source.

xorq lineage sales-summary
Lineage for column 'total':
Field:total #1
└── Aggregate #2
    └── Sum #3
        └── Field:amount #4
            └── Filter #5
                └── Read:sales.parquet #6

Lineage is a property of the expression, not a separate system to deploy.

Agents

For AI coding agents, the catalog is shared memory across sessions. An agent searches existing entries before computing anything new. If a prior session already answered the question, it’s a cache hit. If not, the agent composes a new expression from existing parts, executes it, and adds the result. The next session builds on that.

session 1: agent computes churn-by-channel → adds to catalog
session 2: same question → cache hit, no recomputation
session 3: related question → extends existing entry

The hash is the handshake. Agents don’t need to know who computed an entry or when. Same expression, same hash, same result.

What Xorq is not

Not an orchestrator. Xorq doesn’t schedule pipelines or manage retries. Use Airflow, Dagster, or cron for that.

Not a streaming engine. Xorq is batch-oriented. Use Kafka or Flink for real-time event processing.

Not a database. Xorq composes across databases. It doesn’t replace them.

Next

  • Why Xorq — nine pipeline problems and how Xorq solves them
  • Quickstart — install and run your first expression