Draft — this post is a work in progress and has not been published yet.
You wrote a pipeline. It works. It produces the right numbers. On your machine.
Your boss asks a question. You can answer it in five minutes if they walk over and look at your screen. Or you can deploy a dashboard and they’ll see it in 30 minutes if your org is fast, a day if it’s not. But the thing that runs on your laptop and the thing a non-technical person can look at are different media. There’s always a translation step — and that step is where things break, get stale, or get lost.
This is familiar to anyone who’s worked with notebooks. The notebook runs on your machine. It depends on your data, your environment, your cached intermediate results. A colleague asks to run it and you spend an hour explaining which cells to run in what order and which paths to change. Your boss asks for a tweak and you either do it live or tell them to wait for the next dashboard refresh. The notebook is a single-player artifact trapped on a single machine.
The industry noticed this problem. Their answer was heavyweight platforms — Domino Data Lab, Hex, Databricks notebooks, Saturn Cloud. The pitch: run your notebook in our managed environment, and collaboration comes for free. In practice, you get a Docker container you didn’t configure, a proprietary caching layer you can’t inspect, and a monthly bill that scales with compute. The collaboration problem gets buried under infrastructure, not solved. Your notebook is still a notebook. It’s just running on someone else’s computer now.
The collaboration problem doesn’t need a platform. It needs two things: a way to make pipeline artifacts portable (so they leave your laptop intact) and a way to give them identity (so they can be found, versioned, and reused). That’s a Python library problem, not an infrastructure problem. Apache Arrow handles the portability — data moves between engines as columnar batches without serialization. A content-addressed cache handles identity — same computation, same hash, automatic deduplication.
The Xorq catalog is built on top of these primitives. If you’ve read Xorq in 5 minutes, you know that Xorq expressions describe computations that can be cached, run across engines, and compiled into portable artifacts. This post is about what happens after that — when the pipeline needs to be shared, versioned, discovered, and reproduced by people and agents who aren’t you.
You computed churn rates by channel. It took two hours to get the joins right and the numbers validated. A week later, a colleague starts the same analysis from scratch. They don’t know your work exists. Or they found a file called churn_by_channel_v2_final.parquet on a shared drive but don’t know if it’s current, what logic produced it, or whether it’s the same thing they need.
This happens on every data team. The shared drive fills up with parquet exports, notebooks, and Python scripts with dates, version numbers, and initials in their names. Two analysts independently compute the same metric and get slightly different numbers because they made slightly different assumptions. Neither realizes the other’s work exists.
shared-drive/analyses/
churn_by_channel_v2_final.parquet
churn_by_channel.ipynb
churn-analysis-march.parquet
churn_pipeline_danny.py
churn-model-danny-fixed/
user_features_DONT_DELETE.csv
Git doesn’t solve this either. You can version the code, but the output — the parquet files, the trained models, the cached results — can’t go in git. So the code lives in git and the artifacts live somewhere else, and the link between them is maintained by convention and memory.
Every xorq build already has an identity: an input-addressed hash computed from the expression and its inputs. Same computation over the same data produces the same hash, regardless of who ran it or when — whether or not the build is ever catalogued.
The catalog adds two things on top. First, it packages multiple builds into a single git repo — each entry carries its expression graph and connection profiles. Second, it optionally assigns each entry a human-readable alias that points to the hash.
xorq catalog add builds/28ecab08754e --alias churn-by-channelNow churn-by-channel is an address anyone on the team can use. If your colleague independently writes the same computation, they get the same hash. The catalog tells them it already exists.
xorq catalog list-aliases
# churn-by-channel
# sales-summary
# user-featuresThe hash is native to xorq and it’s for machines. The alias is what the catalog adds, and it’s for people.
The weekly report shows different churn numbers than last week. Your manager asks what changed. You check the code — same as last week. You check the data — the upstream table got a backfill on Thursday. Or maybe a colleague tweaked the query and forgot to mention it. Or maybe both happened and the effects cancelled out in some columns but not others.
This is a versioning problem, but not a code versioning problem. Git tracks your Python files. It doesn’t track the artifacts those files produce. The artifact that actually ran in production and the code in your repo can drift apart silently.
Teams try to solve this with naming conventions (v1, v2, v3), with dates in filenames, or with manual changelogs in Confluence. None of these are reliable because they depend on humans remembering to update them.
Every catalog operation — adding an entry, removing one, reassigning an alias — is a git commit. The catalog is a git repository.
f3e5caf add alias: churn-by-channel -> 7f4e1a9bc302
a1b2c3d add: 7f4e1a9bc302
9d8e7f6 add: 28ecab08754e (aliases churn-by-channel)
c0a1b2d initial commit
This tells you that churn-by-channel originally pointed to 28ecab08754e, then was reassigned to 7f4e1a9bc302. Both artifacts are still in the catalog. You can load either one, compare their expression graphs, and see exactly what changed.
xorq catalog log churn-by-channelf3e5caf21e4b 7f4e1a9bc302 2026-03-16 14:30:00
9d8e7f6a1b2c 28ecab08754e 2026-03-09 10:15:00
This isn’t a changelog somebody wrote. It’s the actual history of what was published and when.
You built something useful — a feature engineering pipeline, a cleaned dataset, a churn model. A colleague on another team wants to use it. Now begins the handoff.
You send them the notebook. They can’t run it because they don’t have the same Postgres credentials. You send them a parquet export. They ask which version is current. You put it on S3 and tell them the path. Next week, you update the pipeline and forget to update the file. They’re working off stale results and neither of you knows. Someone suggests a model registry. Now you’re evaluating MLflow, writing migration scripts, and managing a service that needs to stay running.
The pipeline was 20 lines of Python. Sharing it shouldn’t require a new service.
The catalog is a git repo. Sharing it uses the same tools you already use for code.
xorq catalog pushYour colleague pulls:
xorq catalog pullThey get every entry, every alias, the full history. If they don’t have the catalog yet:
xorq catalog clone git@github.com:your-org/data-catalog.gitThe same access controls you already use — SSH keys, GitHub permissions, deploy tokens — work for the catalog. No new service to deploy. No new credentials to manage.
Teams that get far enough build some kind of registry — a database tracking which models are deployed, which datasets exist, which pipelines are active. Over time it drifts. Someone deletes a file but the registry row stays. Someone adds a model but forgets to register it. Eventually people stop trusting the registry and go back to checking the filesystem directly.
The catalog stores entries, aliases, metadata, and the index in one place — a git repo. Every mutation is a single atomic commit. There’s no separate database that can drift.
xorq catalog check
# OKThis validates that every alias points to a real entry, every entry has both an archive and a metadata file, and the index matches the filesystem. In normal operation, inconsistency can’t happen — because entries and their metadata are committed together.
Access control. The catalog uses git permissions. Per-entry access control requires splitting entries across multiple catalogs or managing at the git hosting level.
Large binary storage. Catalog entries are zip files in git. For builds with large snapshotted datasets, you’ll hit git’s performance limits. Git LFS or an external blob store may be needed.
Real-time synchronization. The catalog syncs via git push and git pull. Two people adding entries simultaneously will need to pull and merge.
Schema evolution. The catalog versions artifacts, not schemas. If your expression’s output schema changes between builds, both builds are stored, but the catalog doesn’t provide migration tools or compatibility checks.
| Problem | What happens today | With the catalog |
|---|---|---|
| Work has no address | Files on shared drives with invented names | Content-addressed hash + human-readable alias |
| Numbers changed, nobody knows why | Check git for code, hope the artifact matches | Every catalog operation is a git commit |
| Sharing means IT support | S3 buckets, model registries, deployment pipelines | git push / git pull |
| People recompute existing work | Search Slack, ask around, give up and redo it | xorq catalog list-aliases |
| Production doesn’t match testing | Rebuild per environment, hope for consistency | Alias reassignment, same hash everywhere |
| Can’t reproduce last week’s numbers | Reconstruct environment from logs | Load the build artifact, re-execute |
| Metadata drifts from reality | Registry and artifacts diverge | Atomic git commits, xorq catalog check |