Fashion is considered to be the world’s second-largest polluter, after oil and gas, as it comes with severe environmental costs in managing the unsold dead inventory. Fear of counterfeiting and loss of brand-value force fashion houses to take unsustainable measures like destroying unsold inventory. At the heart of this problem, lies the mismatch between supply and demand. Inaccurate demand forecasts and the urge to avoid stock-outs lead to overproduction. Unlike other retail industries which have rich historical time series sales data for forecasting, the fashion industry is heavily trend-driven; whereby, most products are new designs (i.e. no historical data to forecast). In this paper we propose forecasting algorithms for new product demand forecasting based on available product attributes, images and external factors. Existing works on new product demand forecasting are mostly based on K-Nearest Neighbor(KNN) approaches that lack the ability to model complex non-linear relations between multi-modal data sources, sales, and external factors. To address this, we propose and empirically evaluate different multi-modal attention-based encoder-decoder models that can effectively forecast the sales time-series of new products in the fast fashion domain. We also study the impact of various multi-modal fusion techniques in new product time-series forecasting, which enables effective gradient flow towards data-sources leading to more information gain. To overcome the black-box nature of our models, we incorporate self-attention and cross-attention techniques and empirically validate their efficacy to enable effective explanations. We conduct experiments on a large-scale fashion data set (comprising of 10,290 products distributed across 45 categories) and report results and interesting findings to illustrate the benefits of modeling new product time-series forecast as a multi-modal encoder-decoder sequence problem as opposed to the conventional KNN approaches.