Sequence to sequence models attempt to capture the correlation between all the words in the input and output sequences. While this is quite useful for machine translation where the correlation among the words is indeed quite strong, it becomes problematic for conversation modelling where the correlation is often at a much abstract level. In contrast, humans tend to focus on the essential concepts discussed in the conversation context and generate responses accordingly. In this paper, we attempt to mimic this response generating mechanism by learning the essential concepts in the context and response in an unsupervised manner. The proposed model, referred to as Mask & Focus maps the input context to a sequence of concepts which are then used to generate the response concepts. Together, the context and the response concepts generate the final response. In order to learn context concepts from the training data automatically, we mask words in the input and observe the effect of masking on response generation. We train our model to learn those response concepts that have high mutual information with respect to the context concepts, thereby guiding the model to focus on the context concepts. Mask & Focus achieves significant improvement over the existing baselines in several established metrics for dialogues.