Text-based games are becoming popular in reinforcement learning as real-world simulation environments. They are usually imperfect information games and their interactions are only in the textual modality. For challenging these games, it is effective to complement the missing information by providing knowledge outside the game, such as human commonsense. However, the knowledge was from only textual information in previous works. In this paper, we study the advantage of employing commonsense reasoning obtained from visual datasets such as scene graph datasets. In general, images tell more comprehensive information at once compared to text for humans. This property allows extracting commonsense relationship knowledge more useful for acting effectively in a game. We compare the statistics of spatial relationships available in Visual Genome, a scene graph dataset, and ConceptNet, a text-based knowledge, to analyze the effectiveness of introducing scene graph datasets. We also conducted experiments on a text-based game task that is required commonsense reasoning. Our experimental results demonstrated that our proposed methods have higher and competitive performance than the existing state-of-the-art methods.