PETS 2023

A Utility-Preserving De-Identification Approach with Relation Extraction Filtering


The volume of information, generated each day is increasing at a staggering rate. Much of this ever-increasing amount of data is in free text (e.g., reports, contracts, and medical notes). The ability to leverage such information depends on compliance and privacy regulations, which vary across countries, and further make de-identification of unstructured documents a non-trivial task. Existing solutions mainly explore modern Named Entity Recognition (NER) methods to solve the ambiguity issue of free text analysis by leveraging the context. However, these solutions do not preserve the utility of the de-identified text since they can mark non-sensitive entities as Personal Information producing high number of false positives. This work demonstrates a novel utility preserving approach for the unstructured documents de-identification. This method utilizes NER and Relation Extraction (RE) techniques to expand the information associated to each entity, linking related entities, and therefore increasing the contextual information about each detected entity. We empirically demonstrate that our approach improves the quality of entity detection, thus increasing the overall quality of the data de-identification process and the utility of the processed corpus.