May 25, 2023
Posted by James Manyika, SVP Google Research and Technology & Society, and Jeff Dean, Chief Scientist, Google DeepMind and Google Research
Wednesday, May 10th was an exciting day for the Google Research community as we watched the results of months and years of our foundational and applied work get announced on the Google I/O stage. With the quick pace of announcements on stage, it can be difficult to convey the substantial effort and unique innovations that underlie the technologies we presented. So today, we’re excited to reveal more about the research efforts behind some of the many compelling announcements at this year's I/O.
Our next-generation large language model (LLM), PaLM 2, is built on advances in compute-optimal scaling, scaled instruction-fine tuning and improved dataset mixture. By fine-tuning and instruction-tuning the model for different purposes, we have been able to integrate state-of-the-art capabilities into over 25 Google products and features, where it is already helping to inform, assist and delight users. For example:
Perhaps even more exciting for developers, we have opened up the PaLM APIs & MakerSuite to provide the community opportunities to innovate using this groundbreaking technology.
![]() |
| PaLM 2 has advanced coding capabilities that enable it to find code errors and make suggestions in a number of different languages. |
Our Imagen family of image generation and editing models builds on advances in large Transformer-based language models and diffusion models. This family of models is being incorporated into multiple Google products, including:
![]() |
| I/O Flip presents custom card decks designed using DreamBooth. |
Phenaki, Google’s Transformer-based text-to-video generation model was featured in the I/O pre-show. Phenaki is a model that can synthesize realistic videos from textual prompt sequences by leveraging two main components: an encoder-decoder model that compresses videos to discrete embeddings and a transformer model that translates text embeddings to video tokens.
![]() |
![]() |
Among the new features of ARCore announced by the AR team at I/O, the Scene Semantic API can recognize pixel-wise semantics in an outdoor scene. This helps users create custom AR experiences based on the features in the surrounding area. This API is empowered by the outdoor semantic segmentation model, leveraging our recent works around the DeepLab architecture and an egocentric outdoor scene understanding dataset. The latest ARCore release also includes an improved monocular depth model that provides higher accuracy in outdoor scenes.
![]() |
| Scene Semantics API uses DeepLab-based semantic segmentation model to provide accurate pixel-wise labels in a scene outdoors. |
Chirp is Google's family of state-of-the-art Universal Speech Models trained on 12 million hours of speech to enable automatic speech recognition (ASR) for 100+ languages. The models can perform ASR on under-resourced languages, such as Amharic, Cebuano, and Assamese, in addition to widely spoken languages like English and Mandarin. Chirp is able to cover such a wide variety of languages by leveraging self-supervised learning on unlabeled multilingual dataset with fine-tuning on a smaller set of labeled data. Chirp is now available in the Google Cloud Speech-to-Text API, allowing users to perform inference on the model through a simple interface. You can get started with Chirp here.
At I/O, we launched MusicLM, a text-to-music model that generates 20 seconds of music from a text prompt. You can try it yourself on AI Test Kitchen, or see it featured during the I/O preshow, where electronic musician and composer Dan Deacon used MusicLM in his performance.
MusicLM, which consists of models powered by AudioLM and MuLAN, can make music (from text, humming, images or video) and musical accompaniments to singing. AudioLM generates high quality audio with long-term consistency. It maps audio to a sequence of discrete tokens and casts audio generation as a language modeling task. To synthesize longer outputs efficiently, it used a novel approach we’ve developed called SoundStorm.
Our dubbing efforts leverage dozens of ML technologies to translate the full expressive range of video content, making videos accessible to audiences across the world. These technologies have been used to dub videos across a variety of products and content types, including educational content, advertising campaigns, and creator content, with more to come. We use deep learning technology to achieve voice preservation and lip matching and enable high-quality video translation. We’ve built this product to include human review for quality, safety checks to help prevent misuse, and we make it accessible only to authorized partners.
We are applying our AI technologies to solve some of the biggest global challenges, like mitigating climate change, adapting to a warming planet and improving human health and wellbeing. For example:
With our continued investment in AI technologies, we are emphasizing responsible AI development with the goal of making our models and tools useful and impactful while also ensuring fairness, safety and alignment with our AI Principles. Some of these efforts were highlighted at I/O, including:
It’s inspiring to be part of a community of so many talented individuals who are leading the way in developing state-of-the-art technologies, responsible AI approaches and exciting user experiences. We are in the midst of a period of incredible and transformative change for AI. Stay tuned for more updates about the ways in which the Google Research community is boldly exploring the frontiers of these technologies and using them responsibly to benefit people’s lives around the world. We hope you're as excited as we are about the future of AI technologies and we invite you to engage with our teams through the references, sites and tools that we’ve highlighted here.