Audio Visual GK Questions

Multimodal Fine-Tuning of LLMs for Robust Document Visual Question Answering

Abstract: Document Visual Question Answering (DocVQA) necessitates comprehension of both the spatial layout and the textual content. Multimodal pretraining is a foundational component of existing ...

IEEE

Audio-Visual Semantic Graph Network for Audio-Visual Event Localization

Abstract: Audio-visual event localization (AVEL) aims to identify both the categories and temporal boundaries of events that are both audible and visible in unconstrained videos. However, the inherent ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Multimodal Fine-Tuning of LLMs for Robust Document Visual Question Answering

Audio-Visual Semantic Graph Network for Audio-Visual Event Localization

Trending now