'Paper Review/Multimodal' 카테고리의 글 목록

RA-TTA : Retrieval-Augmented Test-Time Adaptation For Vision-Language Models (ICLR 2025)Youngjun Lee, Doyoung Kim, Junhyeok Kang, Jihwan Bang, Hwanjun Song, Jaegil Leehttps://openreview.net/forum?id=V3zobHnS61 본 논문은 대규모 웹 이미지 데이터베이스로부터 얻은 외부 지식을 활용하여 테스트 분포에 VLM을 적응시키는 Retrieval Augmented-TTA(RA-TTA)를 제안한 논문입니다. TTA라는 개념을 처음 접해보는데, 공부하면서 잘 리뷰해보도록 하겠습니다. 논문 리뷰 시작하겠습니다! 😊 📌 Abstract & Introducti..

LLaVA : Visual Instruction Tuning (Neurips 2023)Haotian Liu, Chunyuan Li2, Qingyang Wu, Yong Jae Leehttps://arxiv.org/abs/2304.08485 오늘은 LLM과 고성능 이미지 인코더를 Visual Instruction Tuning 방식을 통해 효율적으로 연결해 멀티모달 챗 능력을 가능하게 만든 논문인 LLaVA: Large Language and Vision Assistant를 리뷰해보도록 하겠습니다! 😄 📌 Abstract & Introduction 본 연구는 Visual Instruction Tuning을 통해 범용적인 작업들을 수행할 수 있는 Vision Languge 어시스턴트를 구축하는 작업을 처음으..

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (ICML 2023)Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoihttps://arxiv.org/abs/2301.12597 오늘은 LLM과 고성능 이미지 인코더를 효율적으로 연결해 멀티모달 작업을 가능하게 만든 논문인 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 를 리뷰해보도록 하겠습니다! 😄 📌 Abstract & Introduction ..

Flamingo: a Visual Language Model for Few-Shot Learning(2022 NeurIPS)Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech 외 23명https://arxiv.org/abs/2201.12086 오늘은 대규모 거대 언어 모델과 비전 인코더를 결합해 이미지와 텍스트를 동시에 처리하며, 적은 수의 예시만으로도 새로운 태스크에 빠르게 적응하는 Few-Shot Learning 특화 멀티모달 모델인 Flamingo를 다룬 논문인 Flamingo : a Visual Language Model for Few-Shot Learning 을 리뷰해보도록 하겠습니다! 📌 Abstract & Introduction..

BLIP : Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation(2022 ICML)Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoihttps://arxiv.org/abs/2201.12086 오늘은 멀티모달 비전-언어 통합 학습을 위해 제안된 BLIP 모델을 다룬 논문인 Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 을 리뷰해보도록 하겠습니다! 📌 AbstractVision Language 도메인에서 사전학습은 다양한..

Robust Speech Recognition via Large-ScaleWeak Supervision(2023 ICML)Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskeverhttps://arxiv.org/abs/2107.07651 오늘은 Large-Scale Weak Supervision 학습을 통해 강건한 음성 인식 및 번역 성능을 달성한 Whisper 모델을 다룬 논문인 Robust Speech Recognition via Large-Scale Weak Supervision을 리뷰해보도록 하겠습니다! 📌 Abstract본 논문의 연구자들은 인터넷에 있는 방대한 양의 음성 전사 데이터를 활용해 단순히..

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation(2021 NeurIPS)Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare Shafiq Joty, Caiming Xiong, Steven C.H. Hoihttps://arxiv.org/abs/2107.07651 오늘은 이미지와 텍스트 표현을 융합하기 전에 모달리티 간 정렬을 먼저 진행함으로써 성능을 획기적으로 향상시킨 ALBEF 논문Align before Fuse: Vision and Language Representation Learning with Momentum Distillation을 리뷰..

CLIP : Learning Transferable Visual Models From Natural Language Supervision (2021 ICML)Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal 외 6명https://arxiv.org/abs/2103.00020 오늘은 이미지와 텍스트를 결합하여 제로샷 성능을 획기적으로 향상시킨 CLIP(Contrastive Language–Image Pretraining)을 소개한 논문 Learning Transferable Visual Models From Natural Language Supervision을 리뷰해보도록 하겠습니다! 😁 📌 Ab..

티스토리툴바