does the dataset adapt to evaluating RAG? From the paper, it is for evaluating VLM's VQA performance.