xorq to Earth: Introducing Declarative, Multi-Engine Pipelines

Hussain Sultan

March 27, 2025

Today, we’re excited to announce xorq (https://github.com/xorq-labs/xorq), an open source computational framework that greatly simplifies building and deploying multi-engine ML pipelines with first-class support for pandas.

xorq wraps your data processing logic with execution, planning and production deployment capabilities so that you can focus on the data and iterate faster. It’s similar to Snowpark, except that it works across multiple query engines.

Why we developed xorq

Before founding xorq, we were data scientists building complex ML pipelines at leading tech companies. We were frustrated by the brittleness of bespoke tooling required to fluidly develop and deploy ML pipelines:

  • Having to juggle separate SQL jobs, pandas scripts, and ML framework specific transformations
  • Painfully slow iteration process - every small change meant a full pipeline re-run
  • Moving a working pipeline from local dev to production often involved rewriting it as a separate microservice
  • Manually caching intermediary results to experiment
  • Disaggregating infrastructure to scale preprocessing

We wanted a more ergonomic way to build, cache, and serve pipelines—without locking ourselves into a single engine. And it was with that idea that Dan, Daniel, and I founded xorq on a mission to help Python developers make a quantum leap in ML development and innovation.

xorq is pronounced “zork”

We’ll tell the story behind the name xorq on another day, but for now, just know that it’s from out of this world, and we pronounce it “zork.”

You know what else is out of this world? GitHub Stars (🙏): 

What makes xorq unique

  • xorq is a multi-engine system: Seamlessly move data between different query engines, allowing you to leverage the strengths of each engine while maintaining a unified workflow — powered by Arrow and Arrow Flight.
  • xorq provides deferred and composable primitives to express complex data processing logic.
  • Portable Python UDFs: xorq features a portable DataFusion-backed UDF engine with first-class support for pandas dataframes. Ibis under the hood pushes your data access logic to its target backend for processing, translating and taking care of nuances between backend-specific features.
  • Built-in caching: We reuse previous results if nothing changes, saving time and costs and eliminating the possibility of mistaken pipeline lineage due to manual caching mis-steps.
  • One pipeline definition: You can export your full pipeline to YAML for version control, static analysis, and (coming soon) push-button deployment.

Who xorq is for

xorq is for Python developers who are building increasingly heterogeneous ML pipelines. It is especially valuable to people who require the ability to:

As such, xorq is well-suited to ML and AI use cases such as:

  • Fraud, marketing and risk modeling in financial services e.g. XGBoost
  • Model lifecycle and experiment management
  • Ad-hoc multi-engine augmentation for SQL-like operations from Python
  • RAG/LLM pipelines

Installation

Get started easily with:

pip install xorq

Writing your first UDxF

Here's a simple example demonstrating xorq's ease-of-use:

import xorq as xo
import xorq.vendor.ibis.expr.datatypes as dt


@xo.udf.make_pandas_udf(
    schema=xo.schema({"title": str, "url": str,}),
    return_type=dt.bool,
    name="url_in_title",
)
def url_in_title(df):
    return df.apply(
        lambda s: (s.url or "") in (s.title or ""),
        axis=1,
    )


con = xo.connect()

name = "hn-data-small.parquet"
expr = xo.deferred_read_parquet(
    con,
    xo.options.pins.get_path(name),
    name,
).mutate(**{"url_in_title": url_in_title.on_expr})

expr.execute().head()

Let us know what you think

We really look forward to your feedback on the new release of xorq. Here are some resources to help  you get started:

What’s next

We're targeting our V1 release for June. Between now and then, our roadmap includes several key enhancements:

  1. Remote, Apache Arrow-native cache for collaboration - Enabling seamless teamwork with distributed caching capabilities
  2. Push-button deployment via Flight endpoints - Serve cached artifacts through Arrow Flight endpoints for instant production deployment
  3. Integration with package managers - Seamless compatibility with uv and/or pixi for reproducible, deterministic execution environments
  4. Enhanced observability and logging - Comprehensive monitoring and debugging tools to track pipelines

Let us know what you think!

FAQs

What does it mean to be pandas-style?

pandas-style: users can define their functions expecting to receive a pandas DataFrame and returning a python object castable to a pyarrow object, as opposed to having to receive pyarrow objects and returning pyarrow objects

What does sklearn-style mean?

sklearn-style: users can either create deferred pipelines directly referencing scikit-learn classes (which conform to the fit-transform / fit-predict API) or create their own deferred operations by providing the fit and transform/ predict methodshttps://scikit-learn.org/stable/developers/develop.html#apis-of-scikit-learn-objects

Free xorq Training

Spend 30 minutes with xorq engineering to get on the fast path to better ML engineering.

Schedule Free Training