Shumin Zhai

Shumin Zhai

Shumin Zhai is a Human-Computer Interaction research scientist at Google where he leads and directs research, design, and development of input methods and haptics systems on Google’s and its partner’s flagship products. His research career has contributed to foundational models and understandings of human-computer interaction as well as practical user interface inventions and products based on his scientific and technical insights. He originated and led the SHARK/ShapeWriter project at IBM Research and a start-up company that pioneered the touchscreen word-gesture keyboard paradigm, filing the first patents of this paradigm, publishing the first generation of scientific papers, releasing the first word-gesture keyboard in 2004 and a top ranked (6th) iPhone app called ShapeWriter WritingPad in 2008. His publications have won the ACM UIST Lasting Impact Award and a IEEE Computer Society Best Paper Award, among others. He served as the 4th Editor-in-Chief of ACM Transactions on Computer-Human Interaction, and frequently contributes to other academic boards and program committees. He received his Ph.D. degree at the University of Toronto in 1995. In 2006, he was selected as one of ACM's inaugural class of Distinguished Scientists. In 2010 he was named Member of the CHI Academy and Fellow of the ACM.

His external web page is at www.shuminzhai.com.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Rambler: Supporting Writing With Speech via LLM-Assisted Gist Manipulation
    Susan Lin
    Jeremy Warner
    J.D. Zamfirescu-Pereira
    Matthew G Lee
    Sauhard Jain
    Michael Xuelin Huang
    Bjoern Hartmann
    Can Liu
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, New York, NY, USA
    Preview abstract Dictation enables efficient text input on mobile devices. However, writing with speech can produce disfluent, wordy, and incoherent text and thus requires heavy post-processing. This paper presents Rambler, an LLM-powered graphical user interface that supports gist-level manipulation of dictated text with two main sets of functions: gist extraction and macro revision. Gist extraction generates keywords and summaries as anchors to support the review and interaction with spoken text. LLM-assisted macro revisions allow users to respeak, split, merge, and transform dictated text without specifying precise editing locations. Together they pave the way for interactive dictation and revision that help close gaps between spontaneously spoken words and well-structured writing. In a comparative study with 12 participants performing verbal composition tasks, Rambler outperformed the baseline of a speech-to-text editor + ChatGPT, as it better facilitates iterative revisions with enhanced user control over the content while supporting surprisingly diverse user strategies. View details
    Preview abstract Large Language Models (LLMs) may offer transformative opportunities for text input, especially for physically demanding modalities like handwriting. We studied a form of abbreviated handwriting by designing, developing and evaluating a prototype, named SkipWriter, that convert handwritten strokes of a variable-length, prefix- based abbreviation (e.g., “ho a y” as handwritten strokes) into the intended full phrase (e.g., “how are you” in the digital format) based on preceding context. SkipWriter consists of an in-production hand-writing recognizer and a LLM fine-tuned on this skip-writing task. With flexible pen input, SkipWriter allows the user to add and revise prefix strokes when predictions don’t match the user’s intent. An user evaluation demonstrated a 60% reduction in motor movements with an average speed of 25.78 WPM. We also showed that this reduction is close to the ceiling of our model in an offline simulation. View details
    Preview abstract Capacitive touch sensors capture the two-dimensional spatial profile (referred to as a touch heatmap) of a finger's contact with a mobile touchscreen. However, the research and design of touchscreen mobile keyboards - one of the most speed- and accuracy-demanding touch interfaces - has focused on the location of the touch centroid derived from the touch image heatmap as the input, discarding the rest of the raw spatial signals. In this paper, we investigate whether touch heatmaps can be leveraged to further improve the tap decoding accuracy for mobile touchscreen keyboards. Specifically, we compared machine-learning models that decode user taps by using the centroids and/or the heatmaps as their input and studied the contribution due to the heatmap. The results show that adding the heatmap into the input feature set led to 21.4% relative reduction of character error rates on average, compared to using the centroid alone. Furthermore, we conducted online deployment testing of the heatmap-based decoder in a user study with 16 participants and observed lower error rate, faster typing speed, and higher self-reported satisfaction score based on the heatmap-based decoder than the centroid-based decoder. These findings underline the promise of utilizing touch heatmaps for improving typing experience in mobile keyboards. View details
    TapNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input
    Michael Xuelin Huang
    Nazneen Nazneen
    Alex Chao
    ACM CHI Conference on Human Factors in Computing Systems, ACM (2021)
    Preview abstract Off-screen interaction offers great potential for one-handed and eyes-free mobile interaction. While a few existing studies have explored the built-in mobile phone sensors to sense off-screen signals, none met practical requirement. This paper discusses the design, training, implementation and applications of TapNet, a multi-task network that detects tapping on the smartphone using built-in accelerometer and gyroscope. With sensor location as auxiliary information, TapNet can jointly learn from data across devices and simultaneously recognize multiple tap properties, including tap direction and tap location. We developed four datasets consisting of over 180K training samples, 38K testing samples, and 87 participants in total. Experimental evaluation demonstrated the effectiveness of the TapNet design and its significant improvement over the state of the art. Along with the datasets, codebase, and extensive experiments, TapNet establishes a new technical foundation for off-screen mobile input. View details
    Active Edge: Designing Squeeze Gestures for the Google Pixel 2
    Claire Lee
    Melissa Barnhart
    Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, 274:1-274:13
    Preview abstract Active Edge is a feature of Google Pixel 2 smartphone devices that creates a force-sensitive interaction surface along their sides, allowing users to perform gestures by holding and squeezing their device. Supported by strain gauge elements adhered to the inner sidewalls of the device chassis, these gestures can be more natural and ergonomic than on-screen (touch) counterparts. Developing these interactions is an integration of several components: (1) an insight and understanding of the user experiences that benefit from squeeze gestures; (2) hardware with the sensitivity and reliability to sense a user's squeeze in any operating environment; (3) a gesture design that discriminates intentional squeezes from innocuous handling; and (4) an interaction design to promote a discoverable and satisfying user experience. This paper describes the design and evaluation of Active Edge in these areas as part of the product's development and engineering. View details
    i’sFree: Eyes-Free Gesture Typing via a Touch-Enabled Remote Control
    Suwen Zhu
    Xiaojun Bi
    Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 448:1-448:12 (to appear)
    Preview abstract Entering text without having to pay attention to the keyboard is compelling but challenging due to the lack of visual guidance. We propose i'sFree to enable eyes-free gesture typing on a distant display from a touch-enabled remote control. i'sFree does not display the keyboard or gesture trace but decodes gestures drawn on the remote control into text according to an invisible and shifting Qwerty layout. i'sFree decodes gestures similar to a general gesture typing decoder, but learns from the instantaneous and historical input gestures to dynamically adjust the keyboard location. We designed it based on the understanding of how users perform eyes-free gesture typing. Our evaluation shows eyes-free gesture typing is feasible: reducing visual guidance on the distant display hardly affects the typing speed. Results also show that the i’sFree gesture decoding algorithm is effective, enabling an input speed of 23 WPM, 46% faster than the baseline eyes-free condition built on a general gesture decoder. Finally, i'sFree is easy to learn: participants reached 22 WPM in the first ten minutes, even though 40% of them were first-time gesture typing users. View details
    Modeling Gesture-Typing Movements
    Human-Computer Interaction, 33 (2018), pp. 234-280
    Preview abstract Word–Gesture keyboards allow users to enter text using continuous input strokes (also known as gesture typing or shape writing). We developed a production model of gesture typing input based on a human motor control theory of optimal control (specifically, modeling human drawing movements as a minimization of jerk—the third derivative of position). In contrast to existing models, which consider gestural input as a series of concatenated aiming movements and predict a user’s time performance, this descriptive theory of human motor control predicts the shapes and trajectories that users will draw. The theory is supported by an analysis of user-produced gestures that found qualitative and quantitative agreement between the shapes users drew and the minimum jerk theory of motor control. Furthermore, by using a small number of statistical via-points whose distributions reflect the sensorimotor noise and speed–accuracy trade-off in gesture typing, we developed a model of gesture production that can predict realistic gesture trajectories for arbitrary text input tasks. The model accurately reflects features in the figural shapes and dynamics observed from users and can be used to improve the design and evaluation of gestural input systems. View details
    M3 Gesture Menu: Design and Experimental Analyses of Marking Menus for Touchscreen Mobile Interaction
    Kun Li
    Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 249:1-249:14
    Preview abstract Despite their learning advantages in theory, marking menus have faced adoption challenges in practice, even on today's touchscreen-based mobile devices. We address these challenges by designing, implementing, and evaluating multiple versions of M3 Gesture Menu (M3), a reimagination of marking menus targeted at mobile interfaces. M3 is defined on a grid rather than in a radial space, relies on gestural shapes rather than directional marks, and has constant and stationary space use. Our first controlled experiment on expert performance showed M3 was faster and less error-prone by a factor of two than traditional marking menus. A second experiment on learning demonstrated for the first time that users could successfully transition to recall-based execution of a dozen commands after three ten-minute practice sessions with both M3 and Multi-Stroke Marking Menu. Together, M3, with its demonstrated resolution, learning, and space use benefits, contributes to the design and understanding of menu selection in the mobile-first era of end-user computing. View details
    A Cost–Benefit Study of Text Entry Suggestion Interaction
    Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, pp. 83-88
    Preview abstract Mobile keyboards often present error corrections and word completions (suggestions) as candidates for anticipated user input. However, these suggestions are not cognitively free: they require users to attend, evaluate, and act upon them. To understand this trade-off between suggestion savings and interaction costs, we conducted a text transcription experiment that controlled interface assertiveness: the tendency for an interface to present itself. Suggestions were either always present (extraverted), never present (introverted), or gated by a probability threshold (ambiverted). Results showed that although increasing the assertiveness of suggestions reduced the number of keyboard actions to enter text and was subjectively preferred, the costs of attending to and using the suggestions impaired average time performance. View details
    Long-Short Term Memory Neural Network for Keyboard Gesture Recognition
    Thomas Breuel
    Johan Schalkwyk
    International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015)
    Preview