Focus: Function clone identification on cross-platform

Lirong Fu; Shouling Ji; Changchang Liu; Peiyu Liu; Fuzheng Duan; Zonghui Wang; Whenzhi Chen; Ting Wang

doi:10.1002/int.22752

International Journal of Intelligent Systems

Paper

22 Nov 2021

Focus: Function clone identification on cross-platform

View publication

Abstract

Automatic identification of function clones on cross-platform aims at determining whether two functions are identical or not without access to the source code, which is a fundamental challenge in vulnerability search, code plagiarism detection, and malware classification. With the rapid development of deep neural network in program analysis, the state-of-the-art neural network-based function clone identification methods propose to represent functions as embeddings by graph neural network (GNN). However, such a novel representation of functions brings in two challenges. (1) The feature engineering that accurately maps the raw data of binary code to machine learning features is complicated. (2) A highly accurate embedding of functions requires a customized GNN to focus on the most critical features to identify binary code. To the best of our knowledge, currently, a comprehensive work that can overcome the above challenges is still missing. In this paper, we propose a novel prototype named as Focus, which is designed to accurately and efficiently identify similar functions. Specifically, inspired by natural language processing techniques which effectively learns text semantic across natural languages, Focus can learn representative semantic features of functions by a customized learning model. To address the second challenge, a multi-head attention mechanism can be employed to capture the critical features of a function. Through extensive experiments, we demonstrate that Focus achieves high accuracy of function clone identification on a broad range of eight architectures. In particular, the identification performance (AUC value) of Focus is 97% and 99% for cross-platform and single-platform, respectively. Furthermore, the evaluation in real world applications shows that our Focus identifies 24 vulnerable functions among the top-30 candidates, which is one time higher than the baseline approaches.

Conference paper