Collaborative Communication platforms (e.g., Slack) support multi-party conversations which contain a large number of messages on shared channels. Multiple conversations intermingle within these messages. The task of conversation disentanglement is to cluster these intermingled messages into conversations. Existing approaches are trained using loss functions that optimize only local decisions, i.e. predicting reply-to links for each message and thereby creating clusters of conversations. In this work, we propose an end-to-end reinforcement learning (RL) approach that directly optimizes a global metric. We observe that using existing global metrics such as variation of information and adjusted rand index as a reward for the RL agent deteriorates its performance. This behaviour is because these metrics completely ignore the reply-to links between messages (local decisions) during reward computation. Therefore, we propose a novel thread-level reward function that captures the global metric without ignoring the local decisions. Through experiments on the Ubuntu IRC dataset, we demonstrate that the proposed RL model improves the performance on both link-level and conversation-level metrics.