To sum up in its current art, image captioning technologies produce terse and generic descriptive captions. For this to mature and become an assistive technology, we need a paradigm shift towards goal oriented captions; where the caption not only describes faithfully a scene from everyday life, but it also answers specific needs that helps the blind to achieve a particular task. For example, finding the expiration date of a food can or knowing whether the weather is decent from taking a picture from the window. Vizwiz Challenges datasets offer a great opportunity to us and the machine learning community at large, to reflect on accessibility issues and challenges in designing and building an assistive AI for the visually impaired.
In our winning image captioning system, we had to rethink the design of the system to take into account both accessibility and utility perspectives.
Firstly on accessibility, images taken by visually impaired people are captured using phones and may be blurry and flipped in terms of their orientations. Therefore, our machine learning pipelines need to be robust to those conditions and correct the angle of the image, while also providing the blind user a sensible caption despite not having ideal image conditions. To address this, we use a Resnext network that is pretrained on billions of Instagram images that are taken using phones,3 and we use a pretrained network to correct the angles of the images.4
Secondly on utility, we augment our system with reading and semantic scene understanding capabilities. Many of the Vizwiz images have text that is crucial to the goal and the task at hand of the blind person. We equip our pipeline with optical character detection and recognition OCR. 5, 6 Then, we perform OCR on four orientations of the image and select the orientation that has a majority of sensible words in a dictionary. In order to improve the semantic understanding of the visual scene, we augment our pipeline with object detection and recognition pipelines.7
Finally, we fuse visual features, detected texts and objects that are embedded using fasttext with a multimodal transformer.8 To ensure that vocabulary words coming from OCR and object detection are used, we incorporate a copy mechanism in the transformer that allows it to choose between copying an out of vocabulary token or predicting an in vocabulary token.9 We train our system using cross-entropy pretraining and CIDER training using a technique called Self-Critical sequence training introduced by our team in IBM in 2017.10
Our work on goal oriented captions is a step towards blind assistive technologies, and it opens the door to many interesting research questions that meet the needs of the visually impaired. It will be interesting to train our system using goal oriented metrics and make the system more interactive in a form of visual dialog and mutual feedback between the AI system and the visually impaired.
IBM researchers involved in the Vizwiz competiton (listed alphabetically): Pierre Dognin , Igor Melnyk , Youssef Mroueh , Inkit Padhi , Mattia Rigotti , Jerret Ross and Yair Schiff .
Vinyals, Oriol et al. “Show and Tell: A Neural Image Caption Generator.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) ↩
Karpathy, Andrej, and Li Fei-Fei. “Deep Visual-Semantic Alignments for Generating Image Descriptions.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39.4 (2017) ↩
Dhruv Mahajan et al. “Exploring the Limits of Weakly Supervised Pre-training”. In: CoRRabs/1805.00932 (2018). arXiv: 1805.00932. ↩
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. “Unsupervised Representation Learning by Predicting Image Rotations”. (2018). arXiv: 1803.07728. ↩
Jeonghun Baek et al. “What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis”. In: International Conference on Computer Vision (ICCV). to appear. 2019. published. ↩
Youngmin Baek et al. “Character Region Awareness for Text Detection”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 9365–9374. ↩
Mingxing Tan, Ruoming Pang, and Quoc V Le. “Efficientdet: Scalable and efficient object detection”. In: arXiv preprint arXiv: 1911.09070 (2019). ↩
Piotr Bojanowski et al. “Enriching Word Vectors with Subword Information”. In: Transactions of the Association for Computational Linguistics5 (2017), pp. 135–146.issn: 2307-387X ↩
Jiatao Gu et al. “Incorporating Copying Mechanism in Sequence-to-Sequence Learning”. In: CoRRabs/1603.06393 (2016). arXiv: 1603.06393. ↩
Steven J. Rennie et al. “Self-critical Sequence Training for Image Captioning”. In: CoRRabs/1612.00563 (2016). arXiv: 1612.00563. ↩