About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
WACV 2018
Conference paper
Learning disentangled multimodal representations for the fashion domain
Abstract
In many visual domains (like fashion, furniture, etc.) the search for products on online platforms requires matching textual queries to image content. For example, the user provides a search query in natural language (e.g.,pink floral top) and the results obtained are of a different modality (e.g., the set of images of pink floral tops). Recent work on multimodal representation learning enables such cross-modal matching by learning a common representation space for text and image. While such representations ensure that the n-dimensional representation of pink floral top is very close to representation of corresponding images, they do not ensure that the first k1 (< n) dimensions correspond to color, the next k2 (< n) correspond to style and so on. In other words, they learn entangled representations where each dimension does not correspond to a specific attribute. We propose two simple variants which can learn disentangled common representations for the fashion domain wherein each dimension would correspond to a specific attribute (color, style, silhoutte, etc.). Our proposed variants can be integrated with any existing multimodal representation learning method. We use a large fashion dataset of over 700K fashion items crawled from multiple fashion e-commerce portals to evaluate the learned representations on four different applications from the fashion domain, namely, cross-modal image retrieval, visual search, image tagging, and query expansion. Our experimental results show that the proposed variants lead to better performance for each of these applications while learning disentangled representations.