Latent State Estimation Helps UI Agents to Reason
Abstract
A common problem for agents operating in real environments is that the response of an environment to their actions may be non-deterministic and observed through noise. This renders environmental state and progress towards completing a task latent. Despite recent impressive demonstrations of reasoning abilities for other problems, the ability of LLMs to estimate latent state has not been explicitly studied. Therefore, we investigate this problem and incorporate latent state estimates to improve agent performance in a real-world domain: autonomous UI agents. We establish that appropriately prompting LLMs in a zero-shot manner can be formally understood as forming point estimates of latent state in a textual space. In the context of autonomous UI agents we then show that LLMs used in this manner are greater than 76% accurate at inferring various aspects of latent state, such as performed (vs. commanded) actions and task progression. Finally, using both public and internal benchmarks and a variety of reasoning methods (zero-shot, CoT-SC & ReAct), we show that LLM-powered agents that explicitly estimate and reason about latent state are able to successfully complete up to 1.6x more tasks than those that do not.