xorq to Earth: Introducing Declarative, Multi-Engine Pipelines

Today, we’re excited to announce xorq (https://github.com/xorq-labs/xorq), an open source computational framework that greatly simplifies building and deploying multi-engine ML pipelines with first-class support for pandas.

xorq wraps your data processing logic with execution, planning and production deployment capabilities so that you can focus on the data and iterate faster. It’s similar to Snowpark, except that it works across multiple query engines.

‍

Why we developed xorq

Before founding xorq, we were data scientists building complex ML pipelines at leading tech companies. We were frustrated by the brittleness of bespoke tooling required to fluidly develop and deploy ML pipelines:

Having to juggle separate SQL jobs, pandas scripts, and ML framework specific transformations
Painfully slow iteration process - every small change meant a full pipeline re-run
Moving a working pipeline from local dev to production often involved rewriting it as a separate microservice
Manually caching intermediary results to experiment
Disaggregating infrastructure to scale preprocessing

We wanted a more ergonomic way to build, cache, and serve pipelines—without locking ourselves into a single engine. And it was with that idea that Dan, Daniel, and I founded xorq on a mission to help Python developers make a quantum leap in ML development and innovation.

xorq is pronounced “zork”

We’ll tell the story behind the name xorq on another day, but for now, just know that it’s from out of this world, and we pronounce it “zork.”

You know what else is out of this world? GitHub Stars (🙏):

Star

What makes xorq unique

xorq is a multi-engine system: Seamlessly move data between different query engines, allowing you to leverage the strengths of each engine while maintaining a unified workflow — powered by Arrow and Arrow Flight.
xorq provides deferred and composable primitives to express complex data processing logic.
Portable Python UDFs: xorq features a portable DataFusion-backed UDF engine with first-class support for pandas dataframes. Ibis under the hood pushes your data access logic to its target backend for processing, translating and taking care of nuances between backend-specific features.
Built-in caching: We reuse previous results if nothing changes, saving time and costs and eliminating the possibility of mistaken pipeline lineage due to manual caching mis-steps.
One pipeline definition: You can export your full pipeline to YAML for version control, static analysis, and (coming soon) push-button deployment.

‍

Who xorq is for

xorq is for Python developers who are building increasingly heterogeneous ML pipelines. It is especially valuable to people who require the ability to:

Update or query data stored across multiple data engines.
Perform processing on data that is not supported by the engine in which it’s stored (e.g., AsOf processing to Trino queries).
Iteratively change and rerun pipelines on a frequent basis with sklearn-style interfaces.
Abstract and compose pipeline stages from other teams and users

As such, xorq is well-suited to ML and AI use cases such as:

Fraud, marketing and risk modeling in financial services e.g. XGBoost
Model lifecycle and experiment management
Ad-hoc multi-engine augmentation for SQL-like operations from Python
RAG/LLM pipelines

Installation

Get started easily with:

pip install xorq

Writing your first UDxF

Here's a simple example demonstrating xorq's ease-of-use:

import xorq as xo
import xorq.vendor.ibis.expr.datatypes as dt


@xo.udf.make_pandas_udf(
    schema=xo.schema({"title": str, "url": str,}),
    return_type=dt.bool,
    name="url_in_title",
)
def url_in_title(df):
    return df.apply(
        lambda s: (s.url or "") in (s.title or ""),
        axis=1,
    )


con = xo.connect()

name = "hn-data-small.parquet"
expr = xo.deferred_read_parquet(
    con,
    xo.options.pins.get_path(name),
    name,
).mutate(**{"url_in_title": url_in_title.on_expr})

expr.execute().head()

Let us know what you think

We really look forward to your feedback on the new release of xorq. Here are some resources to help you get started:

xorq Quickstart
xorq documentation
xorq project on GitHub
Connect with us on the xorq Discord server
Schedule a free 30-minute xorq training session with us

What’s next

We're targeting our V1 release for June. Between now and then, our roadmap includes several key enhancements:

Remote, Apache Arrow-native cache for collaboration - Enabling seamless teamwork with distributed caching capabilities
Push-button deployment via Flight endpoints - Serve cached artifacts through Arrow Flight endpoints for instant production deployment
Integration with package managers - Seamless compatibility with uv and/or pixi for reproducible, deterministic execution environments
Enhanced observability and logging - Comprehensive monitoring and debugging tools to track pipelines

Let us know what you think!

FAQs

What does it mean to be pandas-style?

pandas-style: users can define their functions expecting to receive a pandas DataFrame and returning a python object castable to a pyarrow object, as opposed to having to receive pyarrow objects and returning pyarrow objects

What does sklearn-style mean?

sklearn-style: users can either create deferred pipelines directly referencing scikit-learn classes (which conform to the fit-transform / fit-predict API) or create their own deferred operations by providing the fit and transform/ predict methodshttps://scikit-learn.org/stable/developers/develop.html#apis-of-scikit-learn-objects

‍