Oriana Riva

Oriana Riva

Oriana joined Google Research in Zurich in 2022 where she leads the Automation ML team. Prior to Google, she worked at Microsoft Research in Redmond for 12 years.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Visual Grounding for User Interfaces
    Yijun Qian
    Yujie Lu
    Alexander G. Hauptmann
    2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) - Industry Track
    Preview abstract Enabling autonomous language agents to drive application user interfaces (UIs) as humans do can significantly expand the capability of today's API-based agents. Essential to this vision is the ability of agents to ground natural language commands to on-screen UI elements. Prior UI grounding approaches work by relaying on developer-provided UI metadata (UI trees, such as web DOM, and accessibility labels) to detect on-screen elements. However, such metadata is often unavailable or incomplete. Object detection techniques applied to UI screens remove this dependency, by inferring location and types of UI elements directly from the UI's visual appearance. The extracted semantics, however, are too limited to directly enable grounding. We overcome the limitations of both approaches by introducing the task of visual UI grounding, which unifies detection and grounding. A model takes as input a UI screenshot and a free-form language expression, and must identify the referenced UI element. We propose a solution to this problem, LVG, which learns UI element detection and grounding using a new technique called layout-guided contrastive learning, where the semantics of individual UI objects are learned also from their visual organization. Due to the scarcity of UI datasets, LVG integrates synthetic data in its training using multi-context learning. LVG outperforms baselines pre-trained on much larger datasets by over 4.9 points in top-1 accuracy, thus demonstrating its effectiveness. View details
    UINav: A Practical Approach to Train On-Device Automation Agents
    Wei Li
    Fu-Lin Hsu
    Will Bishop
    Folawiyo Campbell-Ajala
    Max Lin
    2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) - Industry Track
    Preview abstract Automation systems that can autonomously drive application user interfaces to complete user tasks are of great benefit, especially when users are situationally or permanently impaired. Prior automation systems do not produce generalizable models while AI-based automation agents work reliably only in simple, hand-crafted applications or incur high computation costs. We propose UINav, a demonstration-based approach to train automation agents that fit mobile devices, yet achieving high success rates with modest numbers of demonstrations. To reduce the demonstration overhead, UINav, uses a referee model that provides users with immediate feedback on tasks where the agent fails, and automatically augments human demonstrations to increase diversity in training data. Our evaluation shows that with only 10 demonstrations UINav, can achieve 70% accuracy, and that with enough demonstrations it can surpass 90% accuracy. View details
    Android in the Wild: A Large-Scale Dataset for Android Device Control
    Chris Rawles
    Alice Li
    Timothy Lillicrap
    NeurIPS 2023 Datasets and Benchmarks Track
    Preview abstract There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AitW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13), and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance, and, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild. View details
    No Results Found