Xorq is a Python library that turns data pipeline code into portable, cacheable, versionable artifacts. You write expressions. Xorq handles caching, multi-engine execution, and reproducibility. Why Xorq walks through the real pipeline problems these concepts solve.
An expression describes a computation without running it.
import xorq.api as xo
from xorq.caching import ParquetCache
expr = (
xo.deferred_read_parquet("sales.parquet")
.filter(xo._.amount > 100)
.group_by("region")
.agg(total=xo._.amount.sum())
.cache(ParquetCache.from_kwargs())
)
result = expr.execute()Nothing runs until .execute(). The expression is a graph of operations that can be inspected, hashed, and serialized before any data moves.
.cache() marks a point in the graph where results are stored. The cache key is derived from the expression itself. Same expression, same key. Change the expression, the key changes and the cache recomputes automatically.
No naming scheme. No expiration timers. No invalidation logic.
Expressions can span databases. .into_backend() moves data between engines using Arrow — no temp files, no serialization code.
pg = xo.postgres.connect_env()
duckdb = xo.duckdb.connect() # or xo.snowflake.connect_env(), xo.databricks.connect_env()
# only this line changes — none of the code below
expr = (
pg.table("users")
.filter(xo._.active == True)
.into_backend(duckdb)
.group_by("region")
.agg(count=xo._.id.count())
)The filter runs in Postgres. The aggregation runs in DuckDB. One expression, two engines. Swap duckdb for Snowflake or Databricks and nothing else changes.
xorq build compiles an expression into a self-contained artifact — a directory containing the computation graph as YAML, connection profiles, and any snapshotted data. The directory name is a hash of the expression. Same expression, same hash.
xorq build pipeline.py -e expr
# builds/28ecab08754e
xorq run builds/28ecab08754e -o output.parquetYou can diff builds, check them into version control, or send them to another machine. xorq run reconstructs the expression from the artifact and executes it.
The catalog is a git-backed registry for build artifacts.
xorq catalog add builds/28ecab08754e --alias sales-summary
xorq catalog pushYour teammate pulls:
xorq catalog pull
xorq catalog list-aliases
# sales-summaryEvery add, remove, and alias change is a git commit. Versioning, history, and collaboration come from git itself. No external metadata service.
Every output column can be traced back through the expression graph to its source.
xorq lineage sales-summaryLineage for column 'total':
Field:total #1
└── Aggregate #2
└── Sum #3
└── Field:amount #4
└── Filter #5
└── Read:sales.parquet #6
Lineage is a property of the expression, not a separate system to deploy.
For AI coding agents, the catalog is shared memory across sessions. An agent searches existing entries before computing anything new. If a prior session already answered the question, it’s a cache hit. If not, the agent composes a new expression from existing parts, executes it, and adds the result. The next session builds on that.
session 1: agent computes churn-by-channel → adds to catalog
session 2: same question → cache hit, no recomputation
session 3: related question → extends existing entry
The hash is the handshake. Agents don’t need to know who computed an entry or when. Same expression, same hash, same result.
Not an orchestrator. Xorq doesn’t schedule pipelines or manage retries. Use Airflow, Dagster, or cron for that.
Not a streaming engine. Xorq is batch-oriented. Use Kafka or Flink for real-time event processing.
Not a database. Xorq composes across databases. It doesn’t replace them.