Towards Robust Visual Question Answering: Integrating LLMs with Advanced Image Processing Techniques

Authors

  • MENDYGALIYEVA AIGERIM Astana, Kazakhstan

Keywords:

Visual Question Answering, Multimodal Learning, Zero-Shot Generalization, Image Processing, Large Language Models

Abstract

Large Language Models (LLMs) such as GPT-4 have expanded capabilities into tasks that require understanding across modalities, particularly in Visual Question Answering (VQA). Despite their prowess in language tasks, LLMs face challenges when directly applied to VQA due to discrepancies in processing visual and textual data. To bridge these gaps, our research introduces Img2LLM, a novel framework that integrates advanced image processing techniques with LLMs to enhance VQA performance without the need for extensive multimodal training. Img2LLM utilizes adaptive image descriptors that generate context-relevant, question-answer formatted prompts for LLMs, enabling effective zero-shot application in VQA tasks. Our approach significantly outperforms traditional methods and achieves new benchmarks on diverse datasets, including Flamingo and A-OKVQA, demonstrating both enhanced accuracy and efficiency in VQA.

Published

2024-10-20

How to Cite

MENDYGALIYEVA AIGERIM. (2024). Towards Robust Visual Question Answering: Integrating LLMs with Advanced Image Processing Techniques. Research Retrieval and Academic Letters, (7). Retrieved from https://ojs.publisher.agency/index.php/RRAL/article/view/4412