About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SoCC 2023
Conference paper
A Comparison of End-to-End Decision Forest Inference Pipelines
Abstract
Decision forest, including RandomForest, XGBoost, and Light- GBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for de- cision forest inference, such as ONNX, TreeLite from Ama- zon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. While these frame- works are fully optimized for inference computations, they are all decoupled with databases and general data manage- ment frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database infer- ence, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function (UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine- grained SQL operations. The relation-centric representation can achieve significantly better performance for large mod- els. We optimized both implementations and conducted a comprehensive benchmark to compare these two implemen- tations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark- SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.