Bridging Sign and Spoken Languages: Pseudo GlossGeneration for Sign Language Translation

Trevor Cohn
Jianyuan Guo
Advances in Neural Information Processing Systems (NeurIPS) (2025)

Abstract

Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach leverages gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm relies on expert-annotated gloss labels, which are costly and increasingly unavailable in many datasets, limiting scalability.
To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with example text-gloss pairs to extract potential sign-related gloss words from the text by leveraging its in-context learning capability.
To mitigate the inherent misalignment between generated pseudo glosses and sign sequences in the video, we further refine their order by formulating the alignment as a weakly supervised learning problem.
With the reordered pseudo-glosses, additional alignment losses such as CTC can be incorporated to enhance supervision. We train our SLT model—comprising a vision encoder and a translator—under a three-stage pipeline, effectively bridging the gap between sign and spoken language.
Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks across three SLT benchmarks and achieves competitive results with gloss-based methods.