By Daniel Mesejo | March 24, 2024

Welcome once again to LETSQL’s exploration series!
In the previous blog posts, we showed how we optimize ML preprocessing and leveraging User Defined Aggregate Functions (UDAF) for training. In this post, we go a step further and experiment database-style optimizations to an end-to-end machine learning inference pipeline. We also show a working example of compiling a simple XGBoost into SQL and benefit from database style optimizations e.g. predicate pushdowns, projection pushdowns and constant folding etc.
You can find the complete code on GitHub.
Machine Learning is transforming every industry by drawing insights from high-value data stored in databases, data warehouses, and lakes.
Data analysts and data scientists use complex analytics queries with pre-trained models to predict new data outcomes. Often, these queries involve data processing operators and relational-specific operators, like filters or joins, due to the data being stored in a relational DB.
While real-time ML inference is gaining traction, most of the Enterprise use-cases are done via batch workflows. We make an observation that the batch use-cases can benefit from the SQL’s relational machinery and show an example of compiling an XGBoost model into CASE statements in the trees.
We will demonstrate how cross-domain optimization can enhance UX and performance of prediction queries using Microsoft’s Length of Stay Dataset. The task is to find which patients that have low number of re-admissions within the last 180 days (rcount < 2) are likely to have a long stay (lengthofstay > 1)1. We already have a trained simple XGBRegressor model ("model.json") and the data is stored inside a PostgreSQL database.
The dataset and the full description of the features can be found here.
The usual solution would involve downloading all data from Postgres into a pandas dataframe, applying some preprocessing, running the XGBRegressor on top of the dataframe and finally filter it.
The problems with the previous workflow can be put into two major categories: UX and Performance.
A non exhaustive list of UX problems are:
Regarding performance, the workflow has two significant issues that need attention:
Some of these areas are subject of research and I encourage the users to check out academic papers here and here. Therefore, it is essential that a good tool abstracts the user from such details while providing a performant solution that takes advantage of the relational algebra and many years of research. There must be a better way!
Ideally we would like a tool that allows us to declaratively specify our query and results in the following SQL query:
The core idea is that we read and parse the model.json file and convert it into an SQL query that can be executed by Postgres. Here is what a single tree looks like when compiled to SQL2:
CASE
WHEN patients.rcount < 2
AND (
patients.psychologicaldisordermajor IS NULL
OR patients.psychologicaldisordermajor >= 1
)
AND (
patients.hematocrit IS NULL OR patients.hematocrit >= 9.69999981
)
THEN 1.24346876
WHEN patients.rcount < 2
AND (
patients.psychologicaldisordermajor IS NULL
OR patients.psychologicaldisordermajor >= 1
)
AND patients.hematocrit < 9.69999981
THEN 1.66323793
WHEN patients.rcount < 2
AND patients.psychologicaldisordermajor < 1
AND (
patients.hemo IS NULL OR patients.hemo >= 1
)
THEN 1.3903867
WHEN patients.rcount < 2 AND patients.psychologicaldisordermajor < 1 AND patients.hemo < 1
THEN 0.698453844
ENDAdditionally such tool could beneficially also be able to do the following optimizations out-of-the-box3:
rcount < 2 is pushed upward and into the XGBRegressor, resulting in multiple subtrees being pruned resulting in fewer features.lengthofstay above 1. This is a topic of research and left for future implementation.In the rest of the post we will create a function transpile_predict that will take a query such as the one above and convert it to a query that can be executed by Postgres. For this we will use sqlglot, a sql transpiler. From the README on GitHub:
SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 21 different dialects like DuckDB, Presto / Trino, Spark / Databricks, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically and semantically correct SQL in the targeted dialects.
The full transpile_predict function code is:
The first step will be to parse the query. For this sqlglot provides the function parse_one that parses the given SQL string and returns a syntax tree (AST). Once we have the AST, we extract the node corresponding to the predict_xgb function and load the model from it:
The second step is to get the trees from the model:
If the model is simple enough, the next step would be to inline it. To achieve this, first, we need to transform the reduced list of trees into case statements. Considering the rcount < 2, we could also prune some branches:
And finally replace the predict_xgb function with the inlined trees:
The code for each of the functions can be found in the GitHub repo.
A simple benchmark shows that the Postgres only solution is about 5 times faster:
The main takeaway should not be the improved performance 4 but rather the better UX gained by integrating relational algebra and machine learning. With the added benefit of increasing security by transmitting only necessary data, like eid.
In the example we were lucky enough that the model was small enough to be inlined, but: What happens if the model is not small enough to be inlined? If you were paying attention you probably caught the line:
One solution is to split the model in two, a cheap model (bmi > 35) to be inlined and a more complex one (bmi <= 35) to be run on by a specialized XGBoost inference operator (like gbdt-rs).
Another challenge is how to do predicate pushdown into the model, since we only required predict_xgb('model.json', patients.*) > 1 we may not need to examine all the trees.
The LETSQL team is working hard to build a multi-engine scheduler and optimizer that can take advantage of the relational algebra and machine learning. But, we can’t do it alone. That’s why we need your feedback and suggestions on how to improve your experience. Share your thoughts with us in the comments section or on our community forum, and let’s redefine the boundaries of data science and machine learning integration together. To stay updated with our latest developments, sign up for our newsletter or visit letsql.dev.
Thanks to Dan Lovell and Hussain Sultan for the comments and the thorough review.
The task is a toy problem to showcase the gap between relational algebra and machine learning.↩︎
Perhaps, this type of stress-testing is use-full when you have a lot of case statements. Of course, that’s just for union statements but similar concept may apply for CASE statements.↩︎
The optimizations listed can be found in the papers listed in the Resources section.↩︎
Further improvements could be achieved by tuning the DB for this specific type of workload.↩︎