Towards high-precision and reusable entity resolution algorithms over sparse financial datasets
We describe our approach to the FEIII Data Challenge, which requires matching entities across multiple financial datasets (FFIEC, SEC and LEI). By making use of a high-level language (HIL) that includes constructs for expressing both the matching logic and the policies to avoid or reduce the ambiguities among the matches, we are able to produce highly-accurate results in a sparse context, with only name and location attributes. As part of the high-level specification, we also make use of a Smart-Term Generation (STG) component, which provides us with a sophisticated subroutine for normalizing company names. The high-level specification is reusable, in the sense that the same HIL specification (modulo changing the attribute names) is uniformly applicable not only between FFIEC and SEC, but also between FFIEC and LEI, and between LEI and SEC. Our approach used only the data provided by the organizers, without analyzing any additional (external) datasets. For the task linking FFIEC records to SEC, we achieved 92.82% precision, 84.32% recall, and 88.38% F1-score. The precision and F1-score were the maximum reported across all participants, and recall was 1.3% less than the maximum 85.63%. For the task linking FFIEC records to LEI, we achieved 99.14% precision, 92.54% recall and 95.72% F1-score, with our F1-score 1.72% less than the maximum reported 97.44%. In this short paper, we provide a description of our method, together with an analysis of our results as well as possible directions for improvement.