Summer School is Free for all online, with Youtube for overflow.                                           ZOOM link will appear here soon. Summer Sch

Large Language Models : Science and Stakes (June 3-14, 2024)

submited by
Style Pass
2024-05-16 00:00:08

Summer School is Free for all online, with Youtube for overflow. ZOOM link will appear here soon.

Summer School is Free for all online, with Youtube for live overflow. ZOOM link will appear here soon.

Summer School is Free for all online, with Youtube for live overflow. ZOOM link will appear here soon.

Summer School is Free for all online, with Youtube for live overflow. ZOOM link will appear here soon.

Abstract: Over the last decade, multimodal vision-language (VL) research has seen impressive progress. We can now automatically caption images in natural language, answer natural language questions about images, retrieve images using complex natural language queries and even generate images given natural language descriptions.Despite such tremendous progress, current VL research faces several challenges that limit the applicability of state-of-art VL systems. Even large VL systems based on multimodal large language models (LLMs) such as GPT-4V struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding (i.e., make-up visual facts). In this talk, first I will present our work on building a parameter efficient multimodal LLM. Then, I will present our more recent work studying and tackling the following outstanding challenges in VL research: visio-linguistic compositional reasoning, robust automatic evaluation, and geo-diverse cultural understanding.

Aishwarya Agrawal is an Assistant Professor in the Department of Computer Science and Operations Research at University of Montreal. She is also a Canada CIFAR AI Chair and a core academic member of Mila — Quebec AI Institute. She also spends one day a week at Google DeepMind as a Research Scientist. Aishwarya’s research interests lie at the intersection of computer vision, deep learning and natural language processing, with the goal of developing artificial intelligence (AI) systems that can “see” (i.e. understand the contents of an image: who, what, where, doing what?) and “talk” (i.e. communicate the understanding to humans in free-form natural language).

Leave a Comment