Grasp detection from visual data is a recognition problem, where the goal is to determine regions in images which correspond to high grasp-ability with respect to certain quality metric. Existing deep learning based approaches mainly focus on predicting grasps, where the quality of the predictions is largely influenced by the choice of the CNN architecture and the objective function used for learning grasp representations. This paper presents a deep learning framework termed EnsembleNet which learns to produce and evaluate grasps within a unified framework. To achieve this, we formulate grasp detection as a two step procedure: i) Grasp generation - where, EnsembleNet generates four different grasp representations (regression grasp, joint regression-classification grasp, segmentation grasp, and heuristic grasp), and ii) Grasp evaluation - where EnsembleNet produces confidence scores for the generated grasps and selects the grasp with the highest score as the output. We evaluated EnsembleNet for grasp detection on RGB-D object datasets. The experiments show that the grasps produced by EnsembleNet are more accurate compared to the independent CNN models and the state-of-the-art grasp detection methods.