Making group conversations more accessible with sound localization

Speech-to-text capabilities on mobile devices, such as Live Transcribe, have become invaluable for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, when multiple people participate in a conversation, existing mobile automatic speech recognition (ASR) apps typically concatenate all transcribed speech together, making it difficult to follow who is saying what. This limitation creates cognitive overload for users who need to simultaneously process the transcript, try to identify speakers, and participate in the conversation. The current solutions that rely on machine learning (ML) are difficult to set up in mobile scenarios. For example, audio-visual speech separation requires speakers to be visible to a camera and speaker embedding approaches require a model to determine the unique voiceprint of each speaker.

In “SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization”, recipient of a Best Paper Award at CHI 2025, we explore an approach that enhances mobile captioning with speaker diarization (separating speakers in an ASR transcript) and real-time localization of incoming sound. SpeechCompass creates user-friendly transcripts for group conversations by providing color-coded visual separation for each speaker and directional indicators (arrows) to help users determine the direction from which speech is coming. This multi-microphone approach lowers computational costs, reduces latency, and enhances privacy preservation.

Left: Existing mobile transcription apps concatenate transcribed text together. Right: SpeechCompass indicates the direction of incoming speech, allowing visually separated transcripts with colors and directional indicators (such as arrows) in the user interface.

Efficient real-time audio localization

We implement SpeechCompass in two different forms: as a phone case prototype with four microphones connected to a low-power microcontroller and as software for existing phones with two microphones. The phone case design provides optimal microphone placement to enable 360-degree sound localization. The software implementation offers only 180-degree localization on devices with two or more microphones, such as the Pixel phone. In both implementations, the phone is used for speech recognition and transcripts are visualized using a mobile application.

Implementation of the prototype phone-case and its internal electronics. A) Mobile application interface with a mounted prototype case. B) Prototype case with a flexible PCB microphone mount and a main PCB. C) Top and bottom view of the main PCB (STM32).

Because sound has low frequency, it bounces around indoor environments causing reverberations and making audio, especially speech, difficult to localize precisely. To address this challenge, we apply a localization algorithm based on time-difference of arrival (TDOA). Audio signals arrive at each microphone at slightly different times, so the algorithm estimates the TDOA between microphone pairs with cross-correlation to predict the angle of arrival for the sound. Specifically, we use Generalized Cross Correlation with Phase Transform (GCC-PHAT) to improve noise robustness and increase compute speed. We then apply statistical estimations, such as kernel density estimation, to improve localizer precision. The use of two omnidirectional microphones will always have “front–back” confusion (i.e., when the signals in front or in the back of the array appear identical to the microphone array), thus allowing only 180 degree localization. This issue is solved by using three or more microphones, making 360 degree localization possible.

SpeechCompass system diagram, including the phone case hardware and the phone application.

Unlike ML approaches to single-source speaker diarization, the SpeechCompass multi-microphone approach offers several advantages:

Lower computational and memory costs: Since there is no model nor weights, the algorithm can run on small microcontrollers with limited memory and compute.
Reduced latency: Extracting speaker embeddings requires sufficient information to cluster the speakers, which may cause lag.
Greater privacy preservation: SpeechCompass assumes that different speakers are physically in separate places and does not require video or any unique personally identifying information, like speaker embeddings (unique identity of individual’s voice).
Language-agnostic operation: SpeechCompass looks at differences between the audio waveforms, without prior assumptions about the content and works for sounds beyond speech.
Instant reconfiguration: SpeechCompass can be reconfigured instantly by moving the phone.

User interface for visualizing speaker direction

We used Android’s speech-to-text capabilities to develop a mobile application that augments speech transcripts with localization data sent from the prototype phone case’s microphones via USB. The Android application provides multiple speech transcript visualization styles to indicate speaker direction:

Colored text: Speakers are separated using different colored text.
Directional glyphs: Arrows, dials in a circle and color highlights on the boxes around the text point to the location of each speaker.
Minimap: A small radar-like display shows the current speaker's position.
Edge indicators: Visual cues around the screen edges highlight speaker direction.
Unwanted speech suppression: The user can click on the sides of the screen to suppress speech coming from those directions. This can be used to remove their own speech. Irrelevant nearby conversation can be removed from the transcript, which enhances the privacy of nearby speakers.

Various visualization styles augment speech transcripts.

Technical evaluation

To evaluate the SpeechCompass software, we placed a phone case on a rotating platform with a stationary speaker playing speech or noise. The platform was rotated at 10 degree increments and the angle of arrival was measured for each angle. Our evaluation shows that SpeechCompass can accurately localize sound direction with an average error of 11°–22° for normal conversational loudness (60–65 dB). The accuracy is roughly comparable to human localization abilities. For example, if a person were asked where sound was heard behind them, their answer would typically have up to 20 degrees error. The SpeechCompass system performs well across different materials and under varying ambient noise conditions, see the paper for more details.

Error in localization at different audio levels and source angles.

For diarization, we used diarization error rate (DER), a standard metric for diarization that corresponds to correctness of color coded speaker diarization in the interface. Our tests showed the four-microphone configuration consistently outperformed the three-microphone setup, with relative DER improvements of 23%–35% across different signal-to-noise (SNR) conditions.

Diarization error rate (DER) with 3- and 4-microphone configurations at different signal-to-noise ratios.

User evaluation and feedback

To understand the limitations of current mobile captioning technology, we conducted an online survey with 263 frequent users of captioning technology. The results show that current solutions struggle with a significant limitation — the inability to distinguish between speakers makes them challenging to use in group conversations.

The results of the survey with frequent users of mobile captioning.

Second, we demonstrated the prototype to eight frequent users of mobile speech-to-text and gathered feedback. The prototype was used to diarize and visualize a conversation between the researchers. We found that colored text and directional arrows were the most preferred visualization methods. All participants agreed on the value of directional guidance for group conversations.

Results of a user study with the working prototype. A) Preferences for the different visualization techniques. B) Value of the directional feedback to the users.

What's next?

We imagine that multi-microphone localization for mobile transcription could have numerous practical applications. One example could be in the classroom setting, where students could more easily follow discussions between instructors and classmates. Similarly in business meetings, interviews or social gatherings, users could track speaker changes in multi-person conversations.

SpeechCompass demonstrates significant improvements for mobile captioning in group conversations, and there are numerous possible directions for additional development:

Integration with additional wearable form factors like smart glasses and smartwatches
Enhanced noise robustness through machine learning approaches
Further customization of visualization preferences
Longitudinal studies to understand adoption and behavior in everyday scenarios

We hope that this research inspires continued innovation in making communication more accessible and inclusive for everyone.

Acknowledgments

We thank Artem Dementyev, Alex Olwal, Mathieu Parvaix, Chiong Lai and Dimitri Kanevsky for their work on the SpeechCompass publication and research. Dmitrii Votintcev for ideas on prototypes and interaction designs. We are grateful to Pascal Getreuer, Richard Lyon, Alex Huang, Shao-Fu Shih, and Chet Gnegy for their help with algorithms. We also thank Shaun Kane, James Landay, Malcolm Slaney, and Meredith Morris for their feedback on this paper. We appreciate the contributions of Carson Lau for the phone case mechanical design and Ngan Nguyen for electronics assembly. Finally, we thank Mei Lu, Don Barnett, Ryan Geraghty, and Sanjay Batra for UX research and design.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Making group conversations more accessible with sound localization

Quick links

Efficient real-time audio localization

User interface for visualizing speaker direction

Technical evaluation

User evaluation and feedback

What's next?

Acknowledgments

Quick links

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Making group conversations more accessible with sound localization

Quick links

Efficient real-time audio localization

User interface for visualizing speaker direction

Technical evaluation

User evaluation and feedback

What's next?

Acknowledgments

Quick links

Other posts of interest