Conference paper

REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark

Abstract

Accurate multi-modal document retrieval iscrucial for Retrieval-Augmented Generation(RAG), yet existing benchmarks do not fullycapture real-world challenges with their currentdesign. We introduce REAL-MM-RAG, an au-tomatically generated benchmark designed toaddress four key properties essential for real-world retrieval: (i) multi-modal documents, (ii)enhanced difficulty, (iii) Realistic-RAG queriesand (iv) accurate labeling. Additionally, wepropose a multi-difficulty-level scheme basedon query rephrasing to evaluate models’ seman-tic understanding beyond keyword matching.Our benchmark reveals significant model weak-nesses, particularly in handling table-heavydocuments and robustness to query rephras-ing. To mitigate these shortcomings, we cu-rate a rephrased training set and introduce anew finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models toachieve state-of-the-art retrieval performanceon REAL-MM-RAG benchmark. Our workoffers a better way to evaluate and improve re-trieval in multi-modal RAG systems while alsoproviding training data and models that addresscurrent limitations. Our benchmark is availableat this project page.

Related