Why the Catalog

WHY THE CATALOG

← ALL POSTS

Draft — this post is a work in progress and has not been published yet.

The gap between your laptop and everyone else

You wrote a pipeline. It works. It produces the right numbers. On your machine.

Your boss asks a question. You can answer it in five minutes if they walk over and look at your screen. Or you can deploy a dashboard and they’ll see it in 30 minutes if your org is fast, a day if it’s not. But the thing that runs on your laptop and the thing a non-technical person can look at are different media. There’s always a translation step — and that step is where things break, get stale, or get lost.

This is familiar to anyone who’s worked with notebooks. The notebook runs on your machine. It depends on your data, your environment, your cached intermediate results. A colleague asks to run it and you spend an hour explaining which cells to run in what order and which paths to change. Your boss asks for a tweak and you either do it live or tell them to wait for the next dashboard refresh. The notebook is a single-player artifact trapped on a single machine.

The industry noticed this problem. Their answer was heavyweight platforms — Domino Data Lab, Hex, Databricks notebooks, Saturn Cloud. The pitch: run your notebook in our managed environment, and collaboration comes for free. In practice, you get a Docker container you didn’t configure, a proprietary caching layer you can’t inspect, and a monthly bill that scales with compute. The collaboration problem gets buried under infrastructure, not solved. Your notebook is still a notebook. It’s just running on someone else’s computer now.

The collaboration problem doesn’t need a platform. It needs two things: a way to make pipeline artifacts portable (so they leave your laptop intact) and a way to give them identity (so they can be found, versioned, and reused). That’s a Python library problem, not an infrastructure problem. Apache Arrow handles the portability — data moves between engines as columnar batches without serialization. A content-addressed cache handles identity — same computation, same hash, automatic deduplication.

The Xorq catalog is built on top of these primitives. If you’ve read Xorq in 5 minutes, you know that Xorq expressions describe computations that can be cached, run across engines, and compiled into portable artifacts. This post is about what happens after that — when the pipeline needs to be shared, versioned, discovered, and reproduced by people and agents who aren’t you.


Problem 1: Your work has no address

The problem

You computed churn rates by channel. It took two hours to get the joins right and the numbers validated. A week later, a colleague starts the same analysis from scratch. They don’t know your work exists. Or they found a file called churn_by_channel_v2_final.parquet on a shared drive but don’t know if it’s current, what logic produced it, or whether it’s the same thing they need.

This happens on every data team. The shared drive fills up with parquet exports, notebooks, and Python scripts with dates, version numbers, and initials in their names. Two analysts independently compute the same metric and get slightly different numbers because they made slightly different assumptions. Neither realizes the other’s work exists.

shared-drive/analyses/
  churn_by_channel_v2_final.parquet
  churn_by_channel.ipynb
  churn-analysis-march.parquet
  churn_pipeline_danny.py
  churn-model-danny-fixed/
  user_features_DONT_DELETE.csv

Git doesn’t solve this either. You can version the code, but the output — the parquet files, the trained models, the cached results — can’t go in git. So the code lives in git and the artifacts live somewhere else, and the link between them is maintained by convention and memory.

How the catalog makes your work findable

Every xorq build already has an identity: an input-addressed hash computed from the expression and its inputs. Same computation over the same data produces the same hash, regardless of who ran it or when — whether or not the build is ever catalogued.

The catalog adds two things on top. First, it packages multiple builds into a single git repo — each entry carries its expression graph and connection profiles. Second, it optionally assigns each entry a human-readable alias that points to the hash.

xorq catalog add builds/28ecab08754e --alias churn-by-channel

Now churn-by-channel is an address anyone on the team can use. If your colleague independently writes the same computation, they get the same hash. The catalog tells them it already exists.

xorq catalog list-aliases
# churn-by-channel
# sales-summary
# user-features

The hash is native to xorq and it’s for machines. The alias is what the catalog adds, and it’s for people.


Problem 2: The numbers changed and nobody knows why

The problem

The weekly report shows different churn numbers than last week. Your manager asks what changed. You check the code — same as last week. You check the data — the upstream table got a backfill on Thursday. Or maybe a colleague tweaked the query and forgot to mention it. Or maybe both happened and the effects cancelled out in some columns but not others.

This is a versioning problem, but not a code versioning problem. Git tracks your Python files. It doesn’t track the artifacts those files produce. The artifact that actually ran in production and the code in your repo can drift apart silently.

Teams try to solve this with naming conventions (v1, v2, v3), with dates in filenames, or with manual changelogs in Confluence. None of these are reliable because they depend on humans remembering to update them.

How the catalog makes artifact changes traceable

Every catalog operation — adding an entry, removing one, reassigning an alias — is a git commit. The catalog is a git repository.

f3e5caf  add alias: churn-by-channel -> 7f4e1a9bc302
a1b2c3d  add: 7f4e1a9bc302
9d8e7f6  add: 28ecab08754e (aliases churn-by-channel)
c0a1b2d  initial commit

This tells you that churn-by-channel originally pointed to 28ecab08754e, then was reassigned to 7f4e1a9bc302. Both artifacts are still in the catalog. You can load either one, compare their expression graphs, and see exactly what changed.

xorq catalog log churn-by-channel
f3e5caf21e4b  7f4e1a9bc302  2026-03-16 14:30:00
9d8e7f6a1b2c  28ecab08754e  2026-03-09 10:15:00

This isn’t a changelog somebody wrote. It’s the actual history of what was published and when.


Problem 3: Sharing your work means becoming IT support

The problem

You built something useful — a feature engineering pipeline, a cleaned dataset, a churn model. A colleague on another team wants to use it. Now begins the handoff.

You send them the notebook. They can’t run it because they don’t have the same Postgres credentials. You send them a parquet export. They ask which version is current. You put it on S3 and tell them the path. Next week, you update the pipeline and forget to update the file. They’re working off stale results and neither of you knows. Someone suggests a model registry. Now you’re evaluating MLflow, writing migration scripts, and managing a service that needs to stay running.

The pipeline was 20 lines of Python. Sharing it shouldn’t require a new service.

How the catalog replaces you with a git pull

The catalog is a git repo. Sharing it uses the same tools you already use for code.

xorq catalog push

Your colleague pulls:

xorq catalog pull

They get every entry, every alias, the full history. If they don’t have the catalog yet:

xorq catalog clone git@github.com:your-org/data-catalog.git

The same access controls you already use — SSH keys, GitHub permissions, deploy tokens — work for the catalog. No new service to deploy. No new credentials to manage.

Problem 4: Your metadata system drifts from reality

The problem

Teams that get far enough build some kind of registry — a database tracking which models are deployed, which datasets exist, which pipelines are active. Over time it drifts. Someone deletes a file but the registry row stays. Someone adds a model but forgets to register it. Eventually people stop trusting the registry and go back to checking the filesystem directly.

How the catalog ties metadata to the computation itself

The catalog stores entries, aliases, metadata, and the index in one place — a git repo. Every mutation is a single atomic commit. There’s no separate database that can drift.

xorq catalog check
# OK

This validates that every alias points to a real entry, every entry has both an archive and a metadata file, and the index matches the filesystem. In normal operation, inconsistency can’t happen — because entries and their metadata are committed together.


Problems the catalog does not solve

Access control. The catalog uses git permissions. Per-entry access control requires splitting entries across multiple catalogs or managing at the git hosting level.

Large binary storage. Catalog entries are zip files in git. For builds with large snapshotted datasets, you’ll hit git’s performance limits. Git LFS or an external blob store may be needed.

Real-time synchronization. The catalog syncs via git push and git pull. Two people adding entries simultaneously will need to pull and merge.

Schema evolution. The catalog versions artifacts, not schemas. If your expression’s output schema changes between builds, both builds are stored, but the catalog doesn’t provide migration tools or compatibility checks.


Summary

Problem What happens today With the catalog
Work has no address Files on shared drives with invented names Content-addressed hash + human-readable alias
Numbers changed, nobody knows why Check git for code, hope the artifact matches Every catalog operation is a git commit
Sharing means IT support S3 buckets, model registries, deployment pipelines git push / git pull
People recompute existing work Search Slack, ask around, give up and redo it xorq catalog list-aliases
Production doesn’t match testing Rebuild per environment, hope for consistency Alias reassignment, same hash everywhere
Can’t reproduce last week’s numbers Reconstruct environment from logs Load the build artifact, re-execute
Metadata drifts from reality Registry and artifacts diverge Atomic git commits, xorq catalog check