Towards Robust Visual Question Answering: Integrating LLMs with Advanced Image Processing Techniques
Keywords:
Visual Question Answering, Multimodal Learning, Zero-Shot Generalization, Image Processing, Large Language ModelsAbstract
Large Language Models (LLMs) such as GPT-4 have expanded capabilities into tasks that require understanding across modalities, particularly in Visual Question Answering (VQA). Despite their prowess in language tasks, LLMs face challenges when directly applied to VQA due to discrepancies in processing visual and textual data. To bridge these gaps, our research introduces Img2LLM, a novel framework that integrates advanced image processing techniques with LLMs to enhance VQA performance without the need for extensive multimodal training. Img2LLM utilizes adaptive image descriptors that generate context-relevant, question-answer formatted prompts for LLMs, enabling effective zero-shot application in VQA tasks. Our approach significantly outperforms traditional methods and achieves new benchmarks on diverse datasets, including Flamingo and A-OKVQA, demonstrating both enhanced accuracy and efficiency in VQA.
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.