This paper presents an efficient framework to perform recognition and grasp detection of objects from RGB-D images of real scenes. The framework uses a novel architecture of hierarchical cascaded forests, in which object-class and grasp-pose probabilities are computed at different levels of an image hierarchy (e.g., patch and object levels) and fused to infer the class and the grasp of unseen objects. We introduce a novel training objective function that minimizes the uncertainties of the class labels and the grasp ground truths at the leaves of the forests, thereby enabling the framework to perform the recognition and grasp detection of objects. Our objective function is learned from features that are extracted from RGB-D point clouds of the objects. For that, we propose a novel method to encode an RGB-D point cloud into a representation that facilitates the use of large convolution neural networks to extract discriminative features from RGB-D images. We evaluate our framework on challenging object datasets, where we demonstrate that our framework outperforms the state-of-theart methods in terms of object-recognition and grasp-detection accuracies. We also show experiments by using live video streams from a Kinect mounted on our in-house robotic platform.