The proposed solution enables a payment network and banks to collaboratively train an ensemble model, in particular a random forest, without learning anything about each other’s private datasets. Choosing an ensemble model enables the team to take advantage of the well-known properties of ensembles to reduce variance and increase accuracy. Conventionally, a random forest consists of greedy decision trees, where features in a tree are chosen greedily using some judiciously defined criterion, such as information gain. The solution proposes to train a random forest consisting of random decision trees (RDT). In a random decision tree, features for the tree nodes are chosen at random instead of using a selection criterion. The structure of a random decision tree is built independently of the training data. The training data is used only to determine labels associated with the leaf nodes of the tree.
The proposed solution allows the payment network (PN) and each bank to locally engineer complex features. Incorporating statistical features of transaction graphs, including attributes of account nodes and their neighborhoods, can significantly boost the accuracy of the trained model. The PN side applies a pipeline of proven graph-based financial crime detection techniques to the PN data and feeds these results into an ensemble of privacy-preserving decision trees to incorporate the influence of the bank data without exposing the latter to PN (or vice versa). The features extracted by a participant remain locally at the participant, and the training, as well as inference protocols, are designed to preserve the privacy of the features from the other participants including the aggregator.
A benefit of using RDTs is that tree structures can be built independently of the training data. For ease of exposition, the PN builds the tree structures. The most challenging part of the training process is to (privately) compute the label for each leaf node, which may depend on both PN and bank features. The team proposed a novel protocol based on homomorphic encryption (HE) that enables the PN and banks to collaborate for computing the labels of leaf nodes. At the end of this protocol, the PN does not learn any information about any bank’s account dataset, and any bank does not learn any information about the PN’s transaction dataset or the other banks’ account datasets.
To protect against inference time attacks, we incorporate differential privacy by building on the techniques presented here, wherein each bank adds calibrated Laplace noise, when computing the number of labels of ‘red’ leaf nodes under HE.
We believe our unique combination of privacy techniques, scalability of deployment, and extensibility to new features makes our solution extremely compelling for real-world deployment. And we are continuing to develop our approach and will submit results to a top privacy conference.
This work was carried out by a multi-lab IBM Research team with the following members: Nathalie Baracaldo Angel, Nir Drucker, Naoise Holohan , Keith Houck, Swanand Kadhe, Ryo Kawahara, Alan King, Eyal Kushnir, Heiko Ludwig, Ambrish Rawat, Hayim Shaul, Mikio Takeuchi and Yi Zhou.