Performance evaluation of different encoding strategies for quantitative genetic trait prediction
Given the genotype values of a set of biallelic molecular markers, such as Single Nucleotide Polymorphisms (SNPs), on a collection of plant, animal or human samples, quantitative genetic traits, such as weight, height, fruit size etc. of these samples can be predicted effectively. Quantitative genetic traits prediction has received great attention given that it helps breeding companies to develop more effective breeding strategies. Although lots of work have been proposed for the prediction task, relatively less attention has been paid on the effects of encodings of the genotypes values. Quantitative genetic trait prediction is usually presented as a linear regression model. In the regression model, genotypes need to be encoded numerically according to their types: one heterozygous type and two homozygous types. A traditional encoding encodes the two homozygous types as 0 and 2 respectively and the heterozygous type as 1. In this work, we evaluated five existing genetic encoding models as well as two recently proposed encoding methods which consider the genetic trait prediction problem as a multiple regression on categorical data problem. We also discussed the scenario of epistasis, where multiple markers could interact with each other. We evaluated the performance of five statistically-intuitive encoding strategies and eight biologically-oriented encoding strategies as well as the extensions of the previously mentioned two encoding methods. We showed that overall the two recent encoding methods achieve better prediction accuracy for both single marker scenario and epistasis scenario.