Compositional action recognition is a novel challenge in the computer vision community and focuses on revealing the different combinations of verbs and nouns instead of treating subject-object interactions in videos as individual instances only. Existing methods tackle this challenging task by simply ignoring appearance information or fusing object appearances with dynamic instance tracklets. However, those strategies usually do not perform well for unseen action instances. For that, in this work we propose a novel learning framework called Counterfactual Debiasing Network (CDN) to improve the model generalization ability by removing the interference introduced by visual appearances of objects/subjects. It explicitly learns the appearance information in action representations and later removes the effect of such information in a causal inference manner. Specifically, we use tracklets and video content to model the factual inference by considering both appearance information and structure information. In contrast, only video content with appearance information is leveraged in the counterfactual inference. With the two inferences, we conduct a causal graph which captures and removes the bias introduced by the appearance information by subtracting the result of the counterfactual inference from that of the factual inference. By doing that, our proposed CDN method can better recognize unseen action instances by debiasing the effect of appearances. Extensive experiments on the Something-Else dataset clearly show the effectiveness of our proposed CDN over existing state-of-the-art methods.