C.A. Micchelli, W.L. Miranker
Journal of the ACM
Decision forest, including RandomForest, XGBoost, and Light- GBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for de- cision forest inference, such as ONNX, TreeLite from Ama- zon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. While these frame- works are fully optimized for inference computations, they are all decoupled with databases and general data manage- ment frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database infer- ence, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function (UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine- grained SQL operations. The relation-centric representation can achieve significantly better performance for large mod- els. We optimized both implementations and conducted a comprehensive benchmark to compare these two implemen- tations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark- SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Kenneth L. Clarkson, Elad Hazan, et al.
Journal of the ACM