The 2026 meeting of the Computer Vision and Pattern Recognition conference (CVPR 2026) will be held Wednesday, June 3rd through Sunday, June 7th in Denver, Colorado. Google is proud to be a Platinum Sponsor of CVPR 2026, featuring research from across Google including Google Research, Google DeepMind, and Cloud.
Attending in person? Visit the Google booth (#557) at the Colorado Convention Center to explore our latest advancements in computer vision and machine perception.
Continue below to learn more about how Google researchers are engaged at CVPR 2026 (Google affiliations highlighted in bold).
All session times are provided in Mountain Standard Time (MST).
Join us at the Google booth, #557, for live demos and Q&A's (times are subject to change).
Fri, Jun 5 | 11:00AM — 11:30AM
Vision Banana: Image Generators are Generalist Vision LearnersA unified model that treats visual perception as an image generation task via text-guided instruction tuning. It achieves state-of-the-art performance on a diverse suite of 2D and 3D visual understanding benchmarks, showing that generative pre-training provides a powerful foundation for computer vision tasks. Bring your own image!
Presenter: Songyou Peng
Fri, Jun 5 | 12:00PM — 12:30PM
Proactive Multimodal Agents in Intelligent EyewearThis demo features an AI agent that connects physical environments to digital tasks by interpreting first-person video from smart glasses. It demonstrates how egocentric context, like recognizing objects, can trigger automated app or browser actions such as e-commerce, navigation, or media retrieval.
Presenters: Meiqi Guo, Lei Shu, Shoubin Yu, Boqing Gong
Fri, Jun 5 | 1:00PM — 1:30PM
Learning from Single-Life Videos - Can we train on the experiences of only a single individual?Presenters: Dima Damen, Sayna Ebrahimi, Tengda Han
Fri, Jun 5 | 4:00PM — 4:30PM & Sat, Jun 6 | 12:00PM — 12:30PM
Discover Android XRExperience a live demonstration of the latest Android XR features and experiences, including cutting-edge computer vision and spatial intelligence running natively on the new Android XR platform. We are showcasing how to use XR glasses as a portable private display, Gemini on AndroidXR, auto-spatialization of 2D content and more.
Presenters: Federico Tombari, Lukas Hoyer, Ivana Tosic Rodgers
Sat, Jun 6 | 1:00PM — 1:30PM
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text AlignmentA foundational image-text encoder with spatial awareness, leading to strong results for vision and multimodal applications.
Presenters: Andre Araujo, Erik de Godoy, Gabriele Berton
Sat, Jun 6 | 4:30PM — 5:00PM
BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion ModelsA compact, 195M-parameter image-to-image diffusion model optimized for on-device use.
By removing text-conditioning, this multi-task architecture enables object removal, outpainting, and relighting in just 290ms on a Pixel 10, offering a fast, private, and efficient editing experience on the edge.
Presenters: Fei Deng, Yanwu Xu, Zhipeng Bao, Karthik Raveendran
Sat, Jun 6 | 5:30PM — 6:00PM
Project Astra 3DThe Project Astra 3D team presents 3DCodeBench, a benchmark designed to demonstrate the proficiency of Gemini models in generating diverse 3D objects through code execution. This work illustrates a future where Gemini models autonomously interface with software to assist artists in the automated creation of 3D assets.
Presenters: Lei Shu, Yipeng Gao
Sun, Jun 7 | 12:00PM — 12:30PM
Multimodal at Frontier AI in Google DeepMindPresenter: Alireza Fathi
Join us at the Google booth, #557, for live demos and Q&A's (times are subject to change).
Kiosk 1: Fri, Jun 5 | 12:00PM — 1:00PM & Kiosk 1: Sat, Jun 6 | 12:00PM — 1:00PM
BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion ModelsA compact, 195M-parameter image-to-image diffusion model optimized for on-device use.
By removing text-conditioning, this multi-task architecture enables object removal, outpainting, and relighting in just 290ms on a Pixel 10, offering a fast, private, and efficient editing experience on the edge.
Presenters: Fei Deng, Yanwu Xu, Zhipeng Bao, Karthik Raveendran
Kiosk 2: Fri, Jun 5 | 12:00PM — 1:00PM
Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a TimePresenters: Skanda Koppula, Mehdi S. M. Sajjadi
Kiosk 1: Fri, Jun 5 | 4:00PM — 5:00PM
Vision Banana: Image Generators are Generalist Vision LearnersA unified model that treats visual perception as an image generation task via text-guided instruction tuning. It achieves state-of-the-art performance on a diverse suite of 2D and 3D visual understanding benchmarks, showing that generative pre-training provides a powerful foundation for computer vision tasks. Bring your own image!
Presenter: Songyou Peng
Kiosk 2: Fri, Jun 5 | 4:00PM — 5:00PM
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text AlignmentA foundational image-text encoder with spatial awareness, leading to strong results for vision and multimodal applications.
Presenters: Andre Araujo, Erik de Godoy, Gabriele Berton
Kiosk 2: Sat, Jun 6 | 12:00PM — 1:00PM
Project Genie: Experimenting with infinite, interactive worldsPresenter: Hang Qi
Kiosk 1: Sat, Jun 6 | 4:30PM — 5:30PM & Kiosk 1: Sun, Jun 7 | 12:00PM — 1:00PM
Discover Android XRExperience a live demonstration of the latest Android XR features and experiences, including cutting-edge computer vision and spatial intelligence running natively on the new Android XR platform. We are showcasing how to use XR glasses as a portable private display, Gemini on AndroidXR, auto-spatialization of 2D content and more.
Presenters: Federico Tombari, Lukas Hoyer, Ivana Tosic Rodgers
Oral: Fri, Jun 5 | 1:12AM — 1:25PM, Mile High Ballroom 3A - 4A (Oral Session 2D: Spatio-Temporal Reconstruction)
Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a TimeChuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie*, Shuyang Sun, Rahul Sukthankar, Joëlle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi
Poster: Fri, Jun 5 | 4:00PM — 6:00PM, Exhibit Hall A & F (Poster Session 2, #20)
Oral: Sat, Jun 6 | 9:12AM — 9:25AM, Mile High Ballroom 3A - 4A (Oral Session 3D: Multimodal Modeling)
FINER: MLLMs Hallucinate Under Fine-grained Negative QueriesRui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz
Poster: Sat, Jun 6 | 11:45AM — 1:45PM, Exhibit Hall F (Poster Session 3, #20)
Oral: Sun, Jun 7 | 10:02AM — 10:15AM, Mile High Ballroom 3A - 4A (Oral Session 5D: Human-Centric Modeling & Lighting)
Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse ViewsKunwar Maheep Singh, Jianchun Chen, Vladislav Golyanik, Stephan J. Garbin, Thabo Beeler, Rishabh Dabral, Marc Habermann, Christian Theobalt
Poster: Sun, Jun 7 | 11:45AM — 1:45PM, Exhibit Hall F (Poster Session 5, #24)
Oral: Fri, Jun 5 | 1:50PM — 2:02PM, Mile High Ballroom 1A - 2A (Oral Session 2C: Gaussian Splatting & Reconstruction)
Selfi: Self-Improving Reconstruction Engine via 3D Geometric Feature AlignmentYouming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun, John Flynn, Steve Marschner, Lucy Chai
Poster: Fri, Jun 5 | 4:00PM — 6:00PM, Exhibit Hall A & F (Poster Session 2, #17)
Oral: Sat, Jun 6 | 9:00AM — 9:12AM, Mile High Ballroom 1A - 2A (Oral Session 3C: Generative Editing)
3D-LATTE: Latent Space 3D Editing from Textual InstructionsMaria Parelli, Michael Oechsle, Michael Niemeyer, Federico Tombari, Andreas Geiger
Poster: Sat, Jun 6 | 11:45AM — 1:45PM, Exhibit Hall F (Poster Session 3, #13)
Oral: Sat, Jun 6 | 2:12PM — 2:25PM, Bluebird Ballroom (Oral Session 4A: Geometric Understanding)
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context LearnersNikita Araslanov, Martin Sundermeyer, Hidenobu Matsuki, David Joseph Tan, Federico Tombari
Poster: Sat, Jun 6 | 4:45PM — 6:45PM, Exhibit Hall A & F (Poster Session 4, #2)
Oral: Sat, Jun 6 | 9:37AM — 9:50AM, Mile High Ballroom 3A - 4A (Oral Session 3D: Multimodal Modeling)
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMsBowen Sun, Yujun Cai, Ming-Hsuan Yang, Hang Wu, Yiwei Wang
Poster: Sat, Jun 6 | 11:45AM — 1:45PM, Exhibit Hall F (Poster Session 3, #22)
Fri, Jun 5 | 10:45AM — 12:45PM, Exhibit Hall A & F (Poster Session 1, #443)
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Leijie Wang*, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan, Enming Luo, Chun-Ta Lu, Tushar Dogra, Ranjay Krishna, Ariel Fuxman
Fri, Jun 5 | 10:45AM — 12:45PM, Exhibit Hall A & F (Poster Session 1, #152)
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Qihao Liu*, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu
Fri, Jun 5 | 4:00PM — 6:00PM, Exhibit Hall A & F (Poster Session 2, #505)
OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath
Fri, Jun 5 | 4:00PM — 6:00PM, Exhibit Hall A & F (Poster Session 2, #102)
Radiance Meshes for Volumetric Reconstruction
Alexander Mai, Trevor Hedstrom, George Kopanas, Janne Kontkanen, Falko Kuester, Jonathan T. Barron
Fri, Jun 5 | 4:00PM — 6:00PM, Exhibit Hall A & F (Poster Session 2, #618)
Representing 3D Faces with Learnable B-Spline Volumes
Prashanth Chandran, Daoye Wang, Timo Bolkart
Sat, Jun 6 | 4:45 PM — 6:45PM, Exhibit Hall A & F (Poster Session 4, #119)
MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
Jian Zou, Xiaoyu Xu, Zhihua Wang, Yilin Wang, Balu Adsumilli, Kede Ma
Sun, Jun 7 | 11:45AM — 1:45PM, Exhibit Hall F (Poster Session 5, #396)
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
Sun, Jun 7 | 11:45AM — 1:45PM, Exhibit Hall F (Poster Session 5, #379)
Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
Nando Metzger*, Prune Truong, Goutam Bhat, Konrad Schindler, Federico Tombari
Sun, Jun 7 | 11:45AM — 1:45PM, Exhibit Hall F (Poster Session 5, #501)
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang, Peixuan Wu, Jiacheng Zhou, Yao Gao, Yaoguang Ma, Ming-Hsuan Yang, Kaiwei Wang
Sun, Jun 7 | 3:30PM — 5:30 PM, Exhibit Hall A (Poster Session 6, #659)
Image Diffusion Preview with Consistency Solver
Fu-Yun Wang*, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao
Sun, Jun 7 | 3:30PM — 5:30 PM, Exhibit Hall A (Poster Session 6, #63)
A Mixed Diet Makes DINO an Omnivorous Vision Encoder
Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, João Carreira, Niloy J. Mitra
Sun, Jun 7 | 3:30PM — 5:30 PM, Exhibit Hall A (Poster Session 6, #651)
Visual Diffusion Models are Geometric Solvers
Nir Goren, Shai Yehezkel, Omer Dahary, Andrey Voynov, Or Patashnik, Daniel Cohen-Or
Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Chong Bao*, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky
CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
Gabriel Fiastre, Antoine Yang, Cordelia Schmid
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo
Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
Peiyu Yu*, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie*, Shuyang Sun, Rahul Sukthankar, Joëlle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
Shoubin Yu*, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong
ESAM++: Efficient Online 3D Perception on the Edge
Qin Liu, Lavisha Aggarwal, Saptarashmi Bandyopadhyay, Vikas Bahirwani, Marc Niethammer, Ehsan Adeli, Andrea Colaco
Eulerian Gaussian Splatting using Hashed Probability Pyramids
Mia Gaia Polansky, George Kopanas, Stephan Garbin, Todd Zickler, Dor Verbin
Feed-forward Gaussian Registration for Head Avatar Creation and Editing
Malte Prinzler*, Paulo Gotardo, Siyu Tang, Timo Bolkart
Gaze Target Estimation Anywhere with Concepts
Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, Jim M. Rehg
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Arsha Nagrani, Jasper Uijlings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A. Ross, Cordelia Schmid
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Yu Jiang, Hanwen Jiang, Ahmed Abdelkader, Wen-Sheng Chu, Brandon Y. Feng, Zhangyang Wang, Qixing Huang
Mobile-VTON: High-Fidelity On-Device Virtual Try-On
Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong
MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
Svitlana Morkva, Maximum Wilder-Smith, Michael Oechsle, Alessio Tonioni, Marco Hutter, Vaishakh Patil
MotionV2V: Editing Motion in a Video
Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S. Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz
ORBIT: Benchmarking SfM in the Wild with 360° Video
Sara Sabour, Richard Tucker, Marcus Brubaker, Saurabh Saxena, Junhwa Hur, Andrea Tagliasacchi, Deqing Sun, David J. Fleet, Richard Szeliski, Noah Snavely
Physical Simulator In-the-Loop Video Generation
Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler, Christian Theobalt
POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval
Junfeng Zhang, Zhe Xue, Yuankai Qi, Junping Du, Xiangyang Kong, Yishuo Yan, Amin Beheshti, Jian Yang, Anton van den Hengel, Ming-Hsuan Yang
Progressive Neural Architecture Search
Chenxi Liu*, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, Kevin Murphy
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S. Dhillon, Ismini Lourentzou
Recurrent Video Masked Autoencoders
Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A. Hudson, João Carreira, Andrew Zisserman
Robust Promptable Video Object Segmentation
Sohyun Lee, Yeho Gwon, Lukas Hoyer, Konrad Schindler, Christos Sakaridis, Suha Kwak
SAGA: Source Attribution of Generative AI Videos
Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, Amit K. Roy-Chowdhury
Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
Shreshth Saini, Bowen Chen, Neil Birkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik
Seeing without Pixels: Perception from Camera Trajectories
Zihui Xue*, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng, Zehao Zhu, Meng-Li Shih, Xander Masotto, Shih-Yang Su, Kanaad Parvate, Tiancheng Ge, Linn Bieske, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
Jian Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan
Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi
Talking Together: Synthesizing Co-Located 3D Conversations from Audio
Mengyi Shan, Shouchieh Chang, Ziqian Bai, Shichen Liu, Yinda Zhang, Luchuan Song, Rohit Pandey, Sean Fanello, Zeng Huang
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen*, Arjun Karpur*, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han*, Joshua Ainslie, Alex Bewley, Mithun Jacob, René Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou, André Araujo
Understanding, Accelerating, and Improving MeanFlow Training
Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer
Unique Lives, Shared World: Learning From Single-Life Videos
Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Pătraucean, João Carreira, Andrew Zisserman, Dima Damen
VISTA: A Test-Time Self-Improving Video Generation Agent
Do Xuan Long*, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, Sercan Ö. Arık
VLIC: Vision-Language Models as Perceptual Judges for Human-Aligned Image Compression
Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Hołyński, Li Fei-Fei, Jiajun Wu, Jason Zhang
VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
Zhengfei Kuang*, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo
Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song*, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister
WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang, Jongheon Jeong, Seungryong Kim, Sangpil Kim
What Are You Doing? A Closer Look at Controllable Human Video Generation
Emanuele Bugliarello, Anurag Arnab, Roni Paiss, Pieter-Jan Kindermans, Cordelia Schmid
ZipMap: Linear-Time 3D Reconstruction via Test-Time Training
Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Hołyński
Sun, Jun 7 | 7:30AM — 9:00AM, Exhibit Hall A (Findings Posters, #179)
TAPNext++: What's Next for Tracking Any Point (TAP)?Sebastian Jung*, Artem Zholus, Martin Sundermeyer, Carl Doersch, Ross Goroshin, David Joseph Tan, Sarath Chandar, Rudolph Triebel, Federico Tombari
Sun, Jun 7 | 7:30AM — 9:00AM, Exhibit Hall A (Findings Posters, #223)
Di3PO - Diptych Diffusion DPO for Targeted Improvements in Image GenerationSanjana Reddy, Ishaan Malhi, Sally Ma, Praneet Dutta
Wed, Jun 3 | 7:00AM — 11:00AM, Room 201
Accelerated Diffusion Models: From Theory to Interactive World ModelsPanelists: Robin Rombach, Jiaming Song, Ruiqi Gao, Zhengyang Geng
Wed, Jun 3 | 12:30PM — 5:00PM, Room 301/302
From Perception to Simulation: The Emergence of World Models in Multi-Modal ReasoningOrganizers: Yujun Cai, Jianfei Cai, Yiwei Wang, Ming-Hsuan Yang
Thu, Jun 4 | 12:30PM — 5:00PM, Room 201
The Road to Convergence: Evolution of Unified Multimodal ModelsOrganizers: Jindong Wang, Hao Chen, Jiakui Hu, Zhaolong Su, Sharon Li
Fri, Jun 5 | 4:00PM — 6:00PM, Exhibit Hall F
FoundYou: A Unified Model for Personalized Segmentation and RetrievalPresenters: Gabriele Trivigno, Marcos Alfaro, Claudia Cuttano, Gabriele Berton, Luis Paya, Carlo Masone
Sun, Jun 7 | 11:45AM — 1:45PM, Exhibit Hall F
SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe PropagationPresenters: Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang, Balu Adsumilli, Zhengzhong Tu
Wednesday, June 3
Wed, Jun 3 | 7:30AM — 12:30PM, Room 506
GRAIL-V: Grounded Retrieval & Agentic Intelligence for Vision-Language
Panelist: Ming-Hsuan Yang
Wed, Jun 3 | 8:00AM — 12:55PM, Room 111
Generative AI for XR and Identity-based Applications
Speaker: Karan Ahuja
Wed, Jun 3 | 8:10AM — 12:35PM, Room 601
Multimodal Spatial Intelligence
Organizers: Phillip Y. Lee, Songyou Peng, Leonidas Guibas
Wed, Jun 3 | 8:20AM — 12:30PM, Room 113
Multimodal Alignment for a Pluralistic Society (MAPS)
Speaker: Lora Aroyo
Organizers: Negar Rostamzadeh, Aishwarya Agrawal
Wed, Jun 3 | 8:25AM — 1:00PM, Rooms 203
IPA: Interactive Physical AI
Speaker: Maja Matarić
Wed, Jun 3 | 8:30AM — 12:30PM, Room 607
Foundation Models for Medical Vision
Organizer: Yuyin Zhou
Wed, Jun 3 | 8:30AM — 12:30PM, Room 607
Generative AI for Sign Language
Organizer: Stefanos Zafeiriou
Wed, Jun 3 | 8:30AM — 12:00PM, Room 705/707
Video World Models: Interaction, Memory, and Efficiency
Speakers: Jack Parker-Holder, Sherry Yang
Wed, Jun 3 | 8:30AM — 1:00PM, Room 102/104
Vision-based Assistants in the Real-World
Speaker: Michael Ryoo, Yao Qin
Wed, Jun 3 | 8:30AM — 4:50PM, Room 703
Visual General Intelligence
Speaker: Robert Geirhos
Wed, Jun 3 | 9:00AM — 6:00PM, Mile High 3B
Urban Scene Modeling: Structured, Semantic, and Synthetic 3D Habitats
Speaker: Daniel Barath
Wed, Jun 3 | 9:15AM — 4:00PM, Room 605
Medical Computer Vision
Speaker: Maddie Traverse
Wed, Jun 3 | 9:30AM — 12:00PM, Room 109
Subtle Visual Computing @CVPR 2026
Speaker: Xin Liu
Wed, Jun 3 | 1:00PM — 6:00PM, Mile High 1AB
Machine Unlearning for Vision
Organizer: Bernt Schiele
Wed, Jun 3 | 1:00PM — 6:00PM, Room 709
MetaFood (MTF)
Speaker: Dima Damen
Organizer: Jinheng Xie
Wed, Jun 3 | 1:00PM — 4:40PM, Room 105
Monitoring the World through an Imperfect Lens
Organizer: Bill Freeman
Wed, Jun 3 | 1:00PM — 6:00PM, Four Seasons 4
"What is Next in Multimodal Foundation Models?”
Organizer: Sivan Doveh
Wed, Jun 3 | 1:20PM — 5:30PM, Mile High 4AB
Rediscovering Intelligence: Can AI Still Learn from Humans?
Speaker: Dima Damen
Wed, Jun 3 | 1:25PM — 5:15PM, Room 506
Test-Time Scaling for Computer Vision
Organizer: Jindong Gu
Wed, Jun 3 | 1:30PM — 5:00PM, Mile High 4EF
Multi-Agent Robotic Systems: Scaling with Compositional Intelligence
Speaker: Dhruv Shah
Organizer: Fangchen Liu
Wed, Jun 3 | 1:30PM - 5:45PM, Room 705/707
Open-World 3D Scene Understanding with Foundation Models
Speaker: Aleksander Hołyński
Organizers: Johanna Wald, Federico Tombari, Leonidas J. Guibas
Wed, Jun 3 | 1:45PM — 5:30PM, Room 607
Transformers for Vision and Multimodal AI
Speaker: Sherry Yang
Thursday, June 4
Thu, Jun 4 | 7:50AM — 12:30PM, Room 704/706
Long-Form Video Understanding, Generation and Action
Speaker: Ruben Villegas
Thu, Jun 4 | 7:55AM — 12:45PM, Room 502
Any-to-Any Multimodal Learning
Organizer: Chenyu Wang
Thu, Jun 4 | 8:00AM — 12:10PM, Exhibit Hall A 106
Computer Vision for Children
Speaker: Dima Damen
Panelist: Boqing Gong
Organizer: Zhongyi Zhou
Advisory Board: Yinda Zhang
Thu, Jun 4 | 8:00AM — 12:30PM, Room 607
Geometry-Free Novel View Synthesis and Controllable Video Models
Speaker: Aleksander Hołyński
Organizer: Leonidas Guibas
Thu, Jun 4 | 8:00AM — 12:00PM, Room 704/706
Knowledge-Intensive Multimodal Reasoning
Organizer: Wenhao Chai
Thu, Jun 4 | 8:00AM — 1:00PM, Room 504
Low‑Level Vision Frontiers with Generative AI, Preference Optimization, and Agentic Systems
Speaker: Kangfu Mei
Thu, Jun 4 | 8:00AM — 5:20PM, Mile High 3B
Video Generative Models: Benchmarks and Evaluation
Speaker: Ming-Hsuan Yang
Organizers: Sicong Jiang, Yilin Wang, Pooja Verlani
Thu, Jun 4 | 8:25AM — 12:35PM, Mile High 4CD
Personalization in Generative AI
Speaker: Nataniel Ruiz
Thu, Jun 4 | 8:30AM — 5:30PM, Room 107
Embodied Artificial Intelligence
Speakers: Lewis Chiang, Ruiqi Gao
Thu, Jun 4 | 8:30AM — 12:50PM, Room 110
Physically Grounded Human Perception and Modeling
Speaker: Dima Damen
Organizer: Thabo Beeler
Thu, Jun 4 | 8:30AM — 1:00PM, Room 103
Safe Artificial Intelligence for All Domains
Organizer: Larissa Triess
Thu, Jun 4 | 8:30AM — 5:00PM, Four Seasons 4
Video Large Language Models
Speaker: Ruben Villegas
Organizers: Venkata Sai Nikhil Thodupunuri, Ravi Vayuvegula
Thu, Jun 4 | 8:35AM — 12:15PM, Room 712
Open-World Vision
Speaker: Boqing Gong
Organizer: Yunhan Zhao
Thu, Jun 4 | 8:45AM — 5:00PM, Room 205
Generative Models for Computer Vision
Speaker: Sherry Yang
Thu, Jun 4 | 8:45AM — 1:00PM, Room 709
VizWiz Grand Challenge: Interpreting Images and Videos Taken by Blind People
Speakers: Cordelia Schmid, Shaun Kane
Thu, Jun 4 | 9:00AM — 5:00PM, Room 708
Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents
Speaker: Florian Tramèr
Thu, Jun 4 | 9:00AM — 5:00PM, Room 605
Embodied Reasoning in Action: Workshop and Challenge on Embodied Reasoning for Robotic Manipulation
Organizer: Wentao Yuan
Thu, Jun 4 | 9:00AM — 5:30PM, Room 109
Human-Interactive Generation and Editing
Speakers: Shuyang Sun, Zhengqi Li, Jack Parker-Holder
Thu, Jun 4 | 9:00AM — 11:30AM, Mile High 1CD
Sight and Sound
Organizers: Arsha Nagrani, William Freeman, Andrew Zisserman
Thu, Jun 4 | 9:05AM — 5:00PM, Room 501
Visual Concepts
Organizer: Shenhan Qian
Thu, Jun 4 | 9:35AM — 2:00PM, Mile High 4EF
UG2+ Workshop and Challenge: Bridging the Gap between Computational Photography and Visual Perception
Organizers: Patrick Rim, Hyoungseob Park
Thu, Jun 4 | 1:00PM — 6:00PM, Room 506
4D Vision: Modeling the Dynamic World
Speakers: Dima Damen, Noah Snavely
Organizer: Leonidas Guibas
Thu, Jun 4 | 1:00PM — 5:30PM, Room 2E/2H
BigMAC: Big Model Adaptation for Computer Vision
Speaker: Cordelia Schmid
Organizer: Aida Nematzadeh
Thu, Jun 4 | 1:00PM — 5:30PM, Room 709
CV4Science: Using Computer Vision for the Sciences
Speaker: Bill Freeman
Thu, Jun 4 | 1:00PM — 6:00PM, Room 603
Generative 3D Reconstruction
Speaker: Philipp Henzler
Organizers: Daniel Barath, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Niemeyer, Federico Tombari, Michael Oechsle, Keisuke Tateno
Thu, Jun 4 | 1:00PM — 5:55PM, Room 504
Image Matching: Local Features and Beyond
Speaker: Paul-Edouard Sarlin
Organizer: Eduard Tulls
Thu, Jun 4 | 1:00PM — 6:00PM, Mile High 4CD
Journey to the Awards: Generative AI for Movie-Grade Video Production (J2A)
Speaker: Janne Kontkanen
Thu, Jun 4 | 1:00PM — 6:00PM, Room 110
Medical Reasoning with Vision Language Foundation Models
Organizer: Xiaoxiao Li
Thu, Jun 4 | 1:00PM — 4:50PM, Mile High 3A
Multi-Modal Reasoning for AI Agents
Organizer: Annie Chen
Thu, Jun 4 | 1:30PM — 5:30PM, Room 4AB
Appearance Understanding and Generation
Speaker: Dor Verbin
Thu, Jun 4 | 1:30PM — 5:15PM, Room 102/104
Simulation for Autonomous Driving
Organizer: Shimon Whiteson
Thu, Jun 4 | 2:00PM — 5:30PM, Room 210/212
See the World in a Different Light: Physical Appearance Modeling and Relighting in the Age of Generative AI
Speakers: Ira Kemelmacher-Shlizerman, Dor Verbin
Organizers: Jianchun Chen, Yingyan Xu
Boqing Gong
Yale Song