StreetReaderAI: Towards making street view accessible via context-aware multimodal AI

Interactive streetscape tools, available today in every major mapping service, have revolutionized how people virtually navigate and explore the world — from previewing routes and inspecting destinations to remotely visiting world-class tourist locations. But to date, screen readers have not been able to interpret street view imagery, and alt text is unavailable. We now have an opportunity to redefine this immersive streetscape experience to be inclusive for all with multimodal AI and image understanding. This could eventually allow a service like Google Street View, which has over 220 billion images spanning 110+ countries and territories, to be more accessible to people in the blind and low-vision community, offering an immersive visual experience and opening up new possibilities for exploration.

In “StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI”, presented at UIST’25, we introduce StreetReaderAI, a proof-of-concept accessible street view prototype that uses context-aware, real-time AI and accessible navigation controls. StreetReaderAI was designed iteratively by a team of blind and sighted accessibility researchers, drawing on previous work in accessible first-person gaming and navigation tools, such as Shades of Doom, BlindSquare, and SoundScape. Key capabilities include:

Real-time AI-generated descriptions of nearby roads, intersections, and places.
Dynamic conversation with a multimodal AI agent about scenes and local geography.
Accessible panning and movement between panoramic images using voice commands or keyboard shortcuts.

StreetReaderAI provides a context-aware description of the street view scene by inputting geographic information sources and the user’s current field-of-view into Gemini. For the full audio-video experience, including sound, please refer to this YouTube video.

StreetReaderAI uses Gemini Live to provide a real-time, interactive conversation about the scene and local geographic features. For the full audio-video experience, including sound, please refer to this YouTube video.

Navigating in StreetReaderAI

StreetReaderAI offers an immersive, first-person exploration experience, much like a video game where audio is the primary interface.

StreetReaderAI provides seamless navigation through both keyboard and voice interaction. Users can explore their surroundings using the left and right arrow keys to shift their view. As the user pans, StreetReaderAI shares audio feedback, voicing the current heading as a cardinal or intercardinal direction (e.g., “Now facing: North” or “Northeast”). It also expresses whether the user can move forward and if they are currently facing a nearby landmark or place.

To move, the user can take “virtual steps” using the up arrow or move backward with the down arrow. As a user moves through the virtual streetscape, StreetReaderAI describes how far the user traveled and key geographic information, such as nearby places. Users can also use “jump” or “teleport” features to quickly move to new locations.

How StreetReaderAI serves as a virtual guide

The core of StreetReaderAI is its two underlying AI subsystems backed by Gemini: AI Describer and AI Chat. Both subsystems take in a static prompt and optional user profile as well as dynamic information about the user’s current location, such as nearby places, road information, and the current field-of-view image (i.e., what’s being shown in Street View).

AI Describer

AI Describer functions as a context-aware scene description tool that combines dynamic geographic information about the user’s virtual location along with an analysis of the current Street View image to generate a real-time audio description.

It has two modes: a “default” prompt emphasizing navigation and safety for blind pedestrians, and a “tour guide” prompt that provides additional tourism information (e.g., historic and architectural context). We also use Gemini to predict likely follow-up questions specific to the current scene and local geography that may be of interest to blind or low-vision travelers.

A diagram of how AI Describer combines multimodal data to support context-aware scene descriptions.

AI Chat

AI Chat builds on AI Describer but allows users to ask questions about their current view, past views, and nearby geography. The chat agent uses Google's Multimodal Live API, which supports real-time interaction, function calling, and temporarily retains memory of all interactions within a single session. We track and send each pan or movement interaction along with the user's current view and geographic context (e.g., nearby places, current heading).

What makes AI Chat so powerful is its ability to hold a temporary “memory” of the user's session — the context window is set to a maximum of 1,048,576 input tokens, which is roughly equivalent to over 4k input images. Because AI Chat receives the user's view and location with every virtual step, it collects information about the user’s location and context. A user can virtually walk past a bus stop, turn a corner, and then ask, “Wait, where was that bus stop?” The agent can recall its previous context, analyze the current geographic input, and answer, “The bus stop is behind you, approximately 12 meters away.”

Testing StreetReaderAI with blind users

To evaluate StreetReaderAI, we conducted an in-person lab study with eleven blind screen reader users. During the sessions, participants learned about StreetReaderAI and used it to explore multiple locations and evaluate potential walking routes to destinations.

A blind participant using StreetReaderAI to explore potential travel to a bus stop and inquire about bus stop features, such as the existence of benches and a shelter. For the full audio-video experience, including sound, please refer to this YouTube video.

Overall, participants reacted positively to StreetReaderAI, rating the overall usefulness 6.4 (median=7; SD=0.9) on a Likert scale from 1–7 (where 1 was ‘not at all useful’ and 7 was ‘very useful’), emphasizing the interplay between virtual navigation and AI, the seamlessness of the interactive AI Chat interface, and the value of information provided. Qualitative feedback from participants consistently highlighted StreetReaderAI's significant accessibility advancement for navigation, noting that existing street view tools lack this level of accessibility. The interactive AI chat feature was also described as making conversations about streets and places both engaging and helpful.

During the study, participants visited over 350 panoramas and made over 1,000 AI requests. Interestingly, AI Chat was used six times more often than AI Describer, indicating a clear preference for personalized, conversational inquiries. While participants found value in StreetReaderAI and adeptly combined virtual world navigation with AI interactions, there is room for improvement: participants sometimes struggled with properly orienting themselves, distinguishing the veracity of AI responses, and determining the limits of AI knowledge.

In one study task, participants were given the instruction, “Find out about an unfamiliar playground to plan a trip with your two young nieces.” This video clip illustrates the diversity of questions asked and the responsiveness of StreetReaderAI. For the full audio-video experience, including sound, please refer to this YouTube video.

Results

As the first study of an accessible street view system, our research also provides the first-ever analysis of the types of questions blind people ask about streetscape imagery. We analyzed all 917 AI Chat interactions and annotated each with up to three tags drawn from an emergent list of 23 question type categories. The four most common question types included:

Spatial orientation: 27.0% of participants were most interested in the location and distance of objects, e.g., “How far is the bus stop from where I'm standing?” and “Which side are the garbage cans next to the bench?”
Object existence: 26.5% of participants queried for the presence of key features like sidewalks, obstacles, and doors; “Is there a crosswalk here?”
General description: 18.4% of participants started AI Chat by requesting a summary of the current view, often asking, “What's in front of me?”
Object/place location: 14.9% of participants asked where things were, such as, “Where is the nearest intersection?” or “Can you help me find the door?”

StreetReaderAI accuracy

Because StreetReaderAI relies so significantly on AI, a critical challenge is response accuracy. Of the 816 questions that participants asked AI Chat:

703 (86.3%) were correctly answered.
32 (3.9%) were incorrect (3.9%).
The remaining were either: partially correct (26; 3.2%) or the AI refused to answer (54; 6.6%).

Of the 32 incorrect responses:

20 (62.5%) were false negatives, e.g., stating that a bike rack did not exist when it did.
12 (37.5%) were misidentifications (e.g., a yellow speed bump interpreted as a crosswalk) or misc errors due to AI Chat not yet seeing the target in street view.

More work is necessary to explore how StreetReaderAI performs in other contexts and beyond lab settings.

What’s next?

StreetReaderAI is a promising first step toward making streetscape tools accessible to all. Our study highlights what information blind users desire from and ask about streetscape imagery and the potential for multimodal AI to answer their questions.

There are several other opportunities to expand on this work:

Towards Geo-visual Agents: We envision a more autonomous AI Chat agent that can explore on its own. For example, a user could ask, “What’s the next bus stop down this road?” and the agent could automatically navigate the Street View network, find the stop, analyze its features (benches, shelters), and report back.
Supporting Route Planning: Similarly, StreetReaderAI does not yet support full origin-to-destination routing. Imagine asking, “What’s the walk like from the nearest subway station to the library?” A future AI agent could “pre-walk” the route, analyzing every Street View image to generate a blind-friendly summary, noting potential obstacles, and identifying the exact location of the library’s door.
Richer Audio Interface: The primary output of StreetReaderAI is speech. We are also exploring richer, non-verbal feedback, including spatialized audio and fully immersive 3D audio soundscapes synthesized from the images themselves.

Though a “proof-of-concept” research prototype, StreetReaderAI helps demonstrate the potential of making immersive streetscape environments accessible.

Acknowledgements

This research was conducted by Jon E. Froehlich, Alexander J. Fiannaca, Nimer Jaber, Victor Tsaran, Shaun K. Kane, and Philip Nelson. We thank Project Astra and the Google Geo teams for their feedback as well as our participants. Diagram icons are from Noun Project, including: “prompt icon” by Firdaus Faiz, “command functions” by Kawalan Icon, “dynamic geo-context” by Didik Darmanto, and “MLLM icon” by Funtasticon.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI

Quick links

Navigating in StreetReaderAI