
Peike Li
Peike Li is a Research Scientist dedicated to building the next generation of general-purpose artificial intelligence. His research focuses on the core challenge of real-time multimodal reasoning, aiming to create foundation models that can natively understand, reason about, and generate information across a seamless spectrum of modalities, including streaming video, audio, and text.
At Google, he has developed and scaled novel, efficient multi-modal generative models and large language models, pushing the boundaries of low-latency multimodal perception and creative generation. He holds a PhD in Computer Science from the University of Technology Sydney. For more information, please visit: https://gogoduck912.github.io
Research Areas
Authored Publications
Sort By
Bridging Sign and Spoken Languages: Pseudo GlossGeneration for Sign Language Translation
Trevor Cohn
Jianyuan Guo
Advances in Neural Information Processing Systems (NeurIPS) (2025)
Preview abstract
Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach leverages gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm relies on expert-annotated gloss labels, which are costly and increasingly unavailable in many datasets, limiting scalability.
To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with example text-gloss pairs to extract potential sign-related gloss words from the text by leveraging its in-context learning capability.
To mitigate the inherent misalignment between generated pseudo glosses and sign sequences in the video, we further refine their order by formulating the alignment as a weakly supervised learning problem.
With the reordered pseudo-glosses, additional alignment losses such as CTC can be incorporated to enhance supervision. We train our SLT model—comprising a vision encoder and a translator—under a three-stage pipeline, effectively bridging the gap between sign and spoken language.
Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks across three SLT benchmarks and achieves competitive results with gloss-based methods.
View details