Nevan Wichers
Research Areas
Authored Publications
Sort By
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Gabriel Schubiner
Ruby Lee
AAAI-21 (2020)
Preview abstract
As mobile devices are becoming ubiquitous, regularly interacting with a variety of user interfaces (UIs) is a common
aspect of daily life for many people. To improve the accessibility of these devices and to enable their usage in a variety
of settings, building models that can assist users and accomplish tasks through the UI is vitally important. However, there
are several challenges to achieve this. First, UI components of
similar appearance can have different functionalities, making
understanding their function more important than just analyzing their appearance. Second, domain-specific features like
Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile applications provide important signals about the semantics of UI elements, but these features
are not in a natural language format. Third, owing to a large
diversity in UIs and absence of standard DOM or VH representations, building a UI understanding model with high coverage requires large amounts of training data.
Inspired by the success of pre-training based approaches in
NLP for tackling a variety of problems in a data-efficient
way, we introduce a new pre-trained UI representation model
called ActionBert. Our methodology is designed to leverage
visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs
and their components. Our key intuition is that user actions,
e.g., a sequence of clicks on different UI components, reveals
important information about their functionality. We evaluate
the proposed model on a wide variety of downstream tasks,
ranging from icon classification to UI component retrieval
based on its natural language description. Experiments show
that the proposed ActionBert model outperforms multi-modal
baselines across all downstream tasks by up to 15.5%.
View details
Preview abstract
Images may have elements containing text and a bounding box associated with them, for example, text identified via optical character recognition on a computer screen image, or a natural image with labeled objects. We present an end-to-end trainable architecture to incorporate the information from these elements and the image to segment/identify the part of the image a natural language expression is referring to. We calculate an embedding for each element and then project it onto the corresponding location (i.e., the associated bounding box) of the image feature map. We show that this architecture gives an improvement in resolving referring expressions, over only using the image, and other methods that incorporate the element information. We demonstrate experimental results on the referring expression datasets based on COCO, and on a webpage image referring expression dataset that we developed.
View details
Preview abstract
Much of recent research has been devoted to video
prediction and generation, yet most of the previous
works have demonstrated only limited success
in generating videos on short-term horizons. The
hierarchical video prediction method by Villegas
et al. (2017b) is an example of a state-of-the-art
method for long-term video prediction, but their
method is limited because it requires ground truth
annotation of high-level structures (e.g., human
joint landmarks) at training time. Our network
encodes the input frame, predicts a high-level encoding
into the future, and then a decoder with
access to the first frame produces the predicted
image from the predicted encoding. The decoder
also produces a mask that outlines the predicted
foreground object (e.g., person) as a by-product.
Unlike Villegas et al. (2017b), we develop a novel
training method that jointly trains the encoder, the
predictor, and the decoder together without highlevel
supervision; we further improve upon this
by using an adversarial loss in the feature space to
train the predictor. Our method can predict about
20 seconds into the future and provides better results
compared to Denton and Fergus (2018) and
Finn et al. (2016) on the Human 3.6M dataset.
View details