Does encoding matter? A novel view on the quantitative genetic trait prediction problem

Dan He; Laxmi Parida

doi:10.1109/BIBM.2015.7359667

BIBM 2015

Conference paper

16 Dec 2015

Does encoding matter? A novel view on the quantitative genetic trait prediction problem

View publication

Abstract

Given a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem. In this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We evaluate various encoding mechanisms and investigate by theory how different encodings affect the performance of the genetic trait prediction algorithms. To our knowledge, this is the first analysis on different encoding mechanisms for genetic trait prediction problem. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power. Our experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model.

Paper