|
Title: Deep Learning for Vision and Language Reasoning
Abstract Understanding visual information along with the natural language appears to be desiderata in recent research communities. Notable efforts have made toward bridging the fields of computer vision and natural language processing, and have opened the door to methods from visual question answering to video-grounded dialogue. However, it is widely accepted that in order to develop truly intelligent AI systems, we need to bridge the gap between perception and cognition. The purpose of this tutorial is to present the history and recent approaches of various vision-and-language reasoning tasks including visual/video question answering, and visual/video dialogue. In this tutorial, we will provide an intuitive explanation of these topics, from the basic building blocks including attention, transformers, to recent trends such as causal learning in detail. Further, this tutorial will also cover recent advances in vision-language pre-training methods based on Transformer architectures, which show state-of-the-art performance in various downstream tasks. The limitations of current approaches are also discussed Program Schedule First Part (1h30m)     ● 1. Visual Question Answering         ○ 1.1 Attention-based Approaches         ○ 1.2 Debiasing Approaches         ○ 1.3 Causal lnference Approaches     ● 2. Video Question Answering         ○ 2.1 Single-modal Video Question Answering         ○ 2.2 Multi-modal Video Question Answering Second Part (1h30m)     ● 1. Visual Dialogue         ○ 1.1 Attention-based Approaches         ○ 1.2 Co-reference Approaches         ○ 1.3 Causal Inference Approach     ● 2. Video-grounded Dialogue         ○ 2.1 RNN seq-2-seq Approaches         ○ 2.2 Trasformer Approaches     ● 3. Vision-language Pre-training Lecturers Junyeong Kim Post-doc Researcher, Korea Advanced Institute of Science and Technology, South Korea Junyeong Kim is a Post-doc researcher in the Artificial Intelligence and Machine Learning Lab., in the School of Electrical Engineering at Korea Advanced Institute of Science and Technology. He received a B.S. and M.S. and Ph.D. degrees in Electrical Engineering at KAIST, in 2015, 2017, and 2021, respectively. His research interest lies in video-language inferences including video question answering and video-grounded dialogue and vision-language reasoning. He focuses on developing AI agents that can 'observe' and 'converse' as a human does. He has written 5 top-tier conference papers including CVPR, AAAI, ECCV. He received the Outstanding Ph.D. Thesis Award in 2021 from the School of Electrical Engineering, KAIST. | |||||
|