SoCC 2023
Conference paper

A Comparison of End-to-End Decision Forest Inference Pipelines

View publication


Decision forest, including RandomForest, XGBoost, and Light- GBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for de- cision forest inference, such as ONNX, TreeLite from Ama- zon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. While these frame- works are fully optimized for inference computations, they are all decoupled with databases and general data manage- ment frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database infer- ence, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function (UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine- grained SQL operations. The relation-centric representation can achieve significantly better performance for large mod- els. We optimized both implementations and conducted a comprehensive benchmark to compare these two implemen- tations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark- SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.