ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Gabriel Schubiner
Ruby Lee
AAAI-21 (2020)

Abstract

As mobile devices are becoming ubiquitous, regularly interacting with a variety of user interfaces (UIs) is a common
aspect of daily life for many people. To improve the accessibility of these devices and to enable their usage in a variety
of settings, building models that can assist users and accomplish tasks through the UI is vitally important. However, there
are several challenges to achieve this. First, UI components of
similar appearance can have different functionalities, making
understanding their function more important than just analyzing their appearance. Second, domain-specific features like
Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile applications provide important signals about the semantics of UI elements, but these features
are not in a natural language format. Third, owing to a large
diversity in UIs and absence of standard DOM or VH representations, building a UI understanding model with high coverage requires large amounts of training data.
Inspired by the success of pre-training based approaches in
NLP for tackling a variety of problems in a data-efficient
way, we introduce a new pre-trained UI representation model
called ActionBert. Our methodology is designed to leverage
visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs
and their components. Our key intuition is that user actions,
e.g., a sequence of clicks on different UI components, reveals
important information about their functionality. We evaluate
the proposed model on a wide variety of downstream tasks,
ranging from icon classification to UI component retrieval
based on its natural language description. Experiments show
that the proposed ActionBert model outperforms multi-modal
baselines across all downstream tasks by up to 15.5%.