Today, we’re excited to announce xorq (https://github.com/xorq-labs/xorq), an open source computational framework that greatly simplifies building and deploying multi-engine ML pipelines with first-class support for pandas.
xorq
wraps your data processing logic with execution, planning and production deployment capabilities so that you can focus on the data and iterate faster. It’s similar to Snowpark, except that it works across multiple query engines.
Why we developed xorq
Before founding xorq, we were data scientists building complex ML pipelines at leading tech companies. We were frustrated by the brittleness of bespoke tooling required to fluidly develop and deploy ML pipelines:
- Having to juggle separate SQL jobs, pandas scripts, and ML framework specific transformations
- Painfully slow iteration process - every small change meant a full pipeline re-run
- Moving a working pipeline from local dev to production often involved rewriting it as a separate microservice
- Manually caching intermediary results to experiment
- Disaggregating infrastructure to scale preprocessing
We wanted a more ergonomic way to build, cache, and serve pipelines—without locking ourselves into a single engine. And it was with that idea that Dan, Daniel, and I founded xorq on a mission to help Python developers make a quantum leap in ML development and innovation.
xorq is pronounced “zork”
We’ll tell the story behind the name xorq on another day, but for now, just know that it’s from out of this world, and we pronounce it “zork.”
You know what else is out of this world? GitHub Stars (🙏):
What makes xorq unique
- xorq is a multi-engine system: Seamlessly move data between different query engines, allowing you to leverage the strengths of each engine while maintaining a unified workflow — powered by Arrow and Arrow Flight.
- xorq provides deferred and composable primitives to express complex data processing logic.
- Portable Python UDFs: xorq features a portable DataFusion-backed UDF engine with first-class support for pandas dataframes. Ibis under the hood pushes your data access logic to its target backend for processing, translating and taking care of nuances between backend-specific features.
- Built-in caching: We reuse previous results if nothing changes, saving time and costs and eliminating the possibility of mistaken pipeline lineage due to manual caching mis-steps.
- One pipeline definition: You can export your full pipeline to YAML for version control, static analysis, and (coming soon) push-button deployment.
Who xorq is for
xorq is for Python developers who are building increasingly heterogeneous ML pipelines. It is especially valuable to people who require the ability to:
- Update or query data stored across multiple data engines.
- Perform processing on data that is not supported by the engine in which it’s stored (e.g., AsOf processing to Trino queries).
- Iteratively change and rerun pipelines on a frequent basis with sklearn-style interfaces.
- Abstract and compose pipeline stages from other teams and users
As such, xorq is well-suited to ML and AI use cases such as:
- Fraud, marketing and risk modeling in financial services e.g. XGBoost
- Model lifecycle and experiment management
- Ad-hoc multi-engine augmentation for SQL-like operations from Python
- RAG/LLM pipelines
Installation
Get started easily with:
pip install xorq
Writing your first UDxF
Here's a simple example demonstrating xorq's ease-of-use:
import xorq as xo
import xorq.vendor.ibis.expr.datatypes as dt
@xo.udf.make_pandas_udf(
schema=xo.schema({"title": str, "url": str,}),
return_type=dt.bool,
name="url_in_title",
)
def url_in_title(df):
return df.apply(
lambda s: (s.url or "") in (s.title or ""),
axis=1,
)
con = xo.connect()
name = "hn-data-small.parquet"
expr = xo.deferred_read_parquet(
con,
xo.options.pins.get_path(name),
name,
).mutate(**{"url_in_title": url_in_title.on_expr})
expr.execute().head()
Let us know what you think
We really look forward to your feedback on the new release of xorq. Here are some resources to help you get started:
- xorq Quickstart
- xorq documentation
- xorq project on GitHub
- Connect with us on the xorq Discord server
- Schedule a free 30-minute xorq training session with us
What’s next
We're targeting our V1 release for June. Between now and then, our roadmap includes several key enhancements:
- Remote, Apache Arrow-native cache for collaboration - Enabling seamless teamwork with distributed caching capabilities
- Push-button deployment via Flight endpoints - Serve cached artifacts through Arrow Flight endpoints for instant production deployment
- Integration with package managers - Seamless compatibility with
uv
and/orpixi
for reproducible, deterministic execution environments - Enhanced observability and logging - Comprehensive monitoring and debugging tools to track pipelines
Let us know what you think!
FAQs
What does it mean to be pandas-style?
pandas-style: users can define their functions expecting to receive a pandas DataFrame and returning a python object castable to a pyarrow object, as opposed to having to receive pyarrow objects and returning pyarrow objects
What does sklearn-style mean?
sklearn-style: users can either create deferred pipelines directly referencing scikit-learn classes (which conform to the fit-transform / fit-predict API) or create their own deferred operations by providing the fit
and transform
/ predict
methodshttps://scikit-learn.org/stable/developers/develop.html#apis-of-scikit-learn-objects