We present a conditional generative method that maps low-dimensional embeddings of image and natural language to a common latent space hence extracting semantic relationships between them. The embedding specific to a modality is first extracted and subsequently a constrained optimization procedure is performed to project the two embedding spaces to a common manifold. Based on this, we present a method to learn the conditional probability distribution of the two embedding spaces; first, by mapping them to a shared latent space and generating back the individual embeddings from this common space. However, in order to enable independent conditional inference for separately extracting the corresponding embeddings from the common latent space representation, we deploy a proxy variable trick - wherein, the single shared latent space is replaced by two separate latent spaces. We design an objective function, such that, during training we can force these separate spaces to lie close to each other, by minimizing the Euclidean distance between their distribution functions. Experimental results demonstrate that the learned joint model can generalize to learning concepts of double MNIST digits with additional attributes of colors, thereby enabling the generation of specific colored images from the respective text data.