Learning disentangled multimodal representations for the fashion domain
In many visual domains (like fashion, furniture, etc.) the search for products on online platforms requires matching textual queries to image content. For example, the user provides a search query in natural language (e.g.,pink floral top) and the results obtained are of a different modality (e.g., the set of images of pink floral tops). Recent work on multimodal representation learning enables such cross-modal matching by learning a common representation space for text and image. While such representations ensure that the n-dimensional representation of pink floral top is very close to representation of corresponding images, they do not ensure that the first k1 (< n) dimensions correspond to color, the next k2 (< n) correspond to style and so on. In other words, they learn entangled representations where each dimension does not correspond to a specific attribute. We propose two simple variants which can learn disentangled common representations for the fashion domain wherein each dimension would correspond to a specific attribute (color, style, silhoutte, etc.). Our proposed variants can be integrated with any existing multimodal representation learning method. We use a large fashion dataset of over 700K fashion items crawled from multiple fashion e-commerce portals to evaluate the learned representations on four different applications from the fashion domain, namely, cross-modal image retrieval, visual search, image tagging, and query expansion. Our experimental results show that the proposed variants lead to better performance for each of these applications while learning disentangled representations.