Comparing active site sequence representations for kinase-ligand affinity prediction Jannis Born, Tien Huynh, Astrid Stroobants, Wendy Cornell, Eric Martin, Matteo Manica We have previously reported extension of the PaccMann string-based model to proteochemometric activity classification and molecule generation of kinase inhibitors, trained as a single model based on hundreds of kinase family members with individual kinases represented by their active site residues rather than full sequence. Here we compare impact of specific choice of active site residues, exploring active site definitions from Sheridan et al. (29 residues) and Martin et al.(16 residues). For predicting activity of unseen ligands, the Martin representation outperformed the Sheridan one, and the representation combining residues from Sheridan and Martin performed best of all. For predicting activity of unseen kinases, none of the three representations was superior. These latest results support our earlier findings that superior performance in activity prediction can be achieved by representing the target with a subset of key residues rather than the full sequence, additionally offering improvements in speed.