Facilitating Federated Genomic Data Analysis by Identifying Record Correlations while Ensuring Privacy
With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. Identifying related records is a fundamental step in creating high-quality datasets for genomic research. However, genomic data may reveal sensitive information about individuals. In this paper, we present a privacy-preserving solution for identifying samples with high kinship relationships in the federated datasets. In the client-server setting, the researchers lightly synchronize to decide the metadata to share with the server. To improve privacy, we propose a framework based on random shuffling, synthetic records generation technique, and a variant of local differential privacy. Furthermore, we provide detailed privacy analysis and extensive evaluations on real genomic data from OpenSNP. The experiment results show that our proposed schema is secure for honest-but-curious servers and allows efficiently identifying related samples with high accuracy.