Risks and potentials of using EMV for internet payments
Els van Herreweghen, Uta Wille
USENIX Workshop on Smartcard Technology 1999
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation, is an important and often overlooked task with several additional challenges. We present mtRAG, an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. mtRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on mtRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. mtRAG is available at https://github.com/ibm/mt-rag-benchmark.
Els van Herreweghen, Uta Wille
USENIX Workshop on Smartcard Technology 1999
Pol G. Recasens, Yue Zhu, et al.
EuroSys 2024
Arthur Nádas
IEEE Transactions on Neural Networks
Amy Lin, Sujit Roy, et al.
AGU 2024