23 Jul 2020
5 minute read

Image captioning as an assistive technology

IBM Research’s Science for Social Good initiative pushes the frontiers of artificial intelligence in service of positive societal impact. Partnering with non-profits and social enterprises, IBM Researchers and student fellows since 2016 have used science and technology to tackle issues including poverty, hunger, health, education, and inequalities of various sorts. For example, one project in partnership with the Literacy Coalition of Central Texas developed technologies to help low-literacy individuals better access the world by converting complex images and text into simpler and more understandable formats.

Working on a similar accessibility problem as part of the initiative, our team recently participated in the 2020 VizWiz Grand Challenge to design and improve systems that make the world more accessible for the blind. Posed with input from the blind, the challenge is focused on building AI systems for captioning images taken by visually impaired individuals.

IBM Research was honored to win the competition by overcoming several challenges that are critical in assistive technology but do not arise in generic image captioning problems. For full details, please check our winning presentation.

Image captioning has witnessed steady progress since 2015, thanks to the introduction of neural caption generators with convolutional and recurrent neural networks. 1 2 This progress, however, has been measured on a curated dataset namely MS-COCO. The scarcity of data and contexts in this dataset renders the utility of systems trained on MS-COCO limited as an assistive technology for the visually impaired. This motivated the introduction of Vizwiz Challenges for captioning images taken by people who are blind.

Comparison between outputs of Descriptive Image Captioning and Goal-Oriented Image Capturing
Figure 1:
Comparison between Descriptive Image Captioning and Goal-Oriented Image Capturing

To sum up in its current art, image captioning technologies produce terse and generic descriptive captions. For this to mature and become an assistive technology, we need a paradigm shift towards goal oriented captions; where the caption not only describes faithfully a scene from everyday life, but it also answers specific needs that helps the blind to achieve a particular task. For example, finding the expiration date of a food can or knowing whether the weather is decent from taking a picture from the window. Vizwiz Challenges datasets offer a great opportunity to us and the machine learning community at large, to reflect on accessibility issues and challenges in designing and building an assistive AI for the visually impaired.

In our winning image captioning system, we had to rethink the design of the system to take into account both accessibility and utility perspectives.

Example output of IBM's Goal-Oriented Image Captioning System, identifying a bottle of wine.
Figure 2:
Example output of IBM's Goal-Oriented Image Captioning System

Firstly on accessibility, images taken by visually impaired people are captured using phones and may be blurry and flipped in terms of their orientations. Therefore, our machine learning pipelines need to be robust to those conditions and correct the angle of the image, while also providing the blind user a sensible caption despite not having ideal image conditions. To address this, we use a Resnext network that is pretrained on billions of Instagram images that are taken using phones,3 and we use a pretrained network to correct the angles of the images.4

Secondly on utility, we augment our system with reading and semantic scene understanding capabilities. Many of the Vizwiz images have text that is crucial to the goal and the task at hand of the blind person. We equip our pipeline with optical character detection and recognition OCR. 56 Then, we perform OCR on four orientations of the image and select the orientation that has a majority of sensible words in a dictionary. In order to improve the semantic understanding of the visual scene, we augment our pipeline with object detection and recognition pipelines.7

Finally, we fuse visual features, detected texts and objects that are embedded using fasttext with a multimodal transformer.8 To ensure that vocabulary words coming from OCR and object detection are used, we incorporate a copy mechanism in the transformer that allows it to choose between copying an out of vocabulary token or predicting an in vocabulary token.9 We train our system using cross-entropy pretraining and CIDER training using a technique called Self-Critical sequence training introduced by our team in IBM in 2017.10

Our work on goal oriented captions is a step towards blind assistive technologies, and it opens the door to many interesting research questions that meet the needs of the visually impaired. It will be interesting to train our system using goal oriented metrics and make the system more interactive in a form of visual dialog and mutual feedback between the AI system and the visually impaired.

Examples of captions produced by the IBM captioning system.
Figure 3:
Examples of captions produced by the IBM captioning system.

IBM researchers involved in the Vizwiz competiton (listed alphabetically): Pierre Dognin , Igor Melnyk , Youssef Mroueh , Inkit Padhi , Mattia Rigotti , Jerret Ross and Yair Schiff .


23 Jul 2020





  1. Vinyals, Oriol et al. “Show and Tell: A Neural Image Caption Generator.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
  2. Karpathy, Andrej, and Li Fei-Fei. “Deep Visual-Semantic Alignments for Generating Image Descriptions.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39.4 (2017)
  3. Dhruv Mahajan et al. “Exploring the Limits of Weakly Supervised Pre-training”. In: CoRRabs/1805.00932 (2018). arXiv: 1805.00932.
  4. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. “Unsupervised Representation Learning by Predicting Image Rotations”. (2018). arXiv: 1803.07728.
  5. Jeonghun Baek et al. “What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis”. In: International Conference on Computer Vision (ICCV). to appear. 2019. published.
  6. Youngmin Baek et al. “Character Region Awareness for Text Detection”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 9365–9374.
  7. Mingxing Tan, Ruoming Pang, and Quoc V Le. “Efficientdet: Scalable and efficient object detection”. In: arXiv preprint arXiv: 1911.09070 (2019).
  8. Piotr Bojanowski et al. “Enriching Word Vectors with Subword Information”. In: Transactions of the Association for Computational Linguistics5 (2017), pp. 135–146.issn: 2307-387X
  9. Jiatao Gu et al. “Incorporating Copying Mechanism in Sequence-to-Sequence Learning”. In: CoRRabs/1603.06393 (2016). arXiv: 1603.06393.
  10. Steven J. Rennie et al. “Self-critical Sequence Training for Image Captioning”. In: CoRRabs/1612.00563 (2016). arXiv: 1612.00563.