Improving Gboard language models via private federated analytics

April 19, 2024

Ziteng Sun, Research Scientist, Google Research, and Haicheng Sun, Software Engineer, Android

To boost Google's keyboard performance while keeping user data private, we have: worked with language experts to refine its dictionaries, developed novel privacy-preserving techniques based on federated analytics and differential privacy to discover out-of-vocab words directly on user devices, and employed secure hardware infrastructure for confidential and externally verifiable server-side data processing.

Google’s keyboard application, Gboard, leverages language models (LMs) to improve users’ typing experience via features like next word prediction, autocorrection, smart compose, slide to type, and proofread. Our researchers prioritize developing responsible approaches that uphold the highest privacy standards while improving Gboard’s LM performance. We have made substantial progress in recent years, including providing data usage disclosures and configuration controls to our users and using federated learning to train Gboard's LMs with differential privacy (DP) to provide a quantifiable and rigorous measure of data anonymization.

Gboard’s LMs are designed to work with a predefined list of frequently-used words, referred to as vocabulary. The performance of LMs depends on the quality of this vocabulary, which could change over time. Words that are not part of the vocabulary are referred to as out-of-vocabulary (OOV) words. OOV words can occur in Gboard for several reasons. For example, the vocabulary for some languages is still under development in Gboard, so the fraction of OOV words can be higher. For languages where Gboard has a relatively complete vocabulary, such as English in the U.S., OOV words often appear due to newly emerged trending words, such as “COVID-19” and “Wordle”, atypical capitalization, such as “tuesday”, and unusual spelling due to users’ preferences, such as “cooool”, or even typos. OOV discovery is a challenging task due to the sensitive nature of information that users type on their keyboard.

Today, we are excited to share a number of approaches that improve the performance of LMs by enabling the discovery of new frequently-used words, while maintaining strong data minimization and DP guarantees. These research efforts include collaboration with linguists to surface novel OOV words, employment of privacy-preserving federated analytics and other DP algorithms, and use of trusted execution environments (TEEs).

Collaboration with linguists

One way to discover OOV words is to obtain a vetted list of words through responsible collaborations with external parties. For example, we worked with Real Academia Española (RAE), a royal institution whose mission is to ensure the stability of the Spanish language, to create a more refined dictionary of the Spanish language and incorporate it into Gboard. This has enabled faster autocorrections and better word recommendations, improving the Gboard experience for users that type in Spanish from Spain. Many of the previously-missing words included common names, brand names, and location names; relatively technical words ("euribor," "dopamina," "tensiómetro"); and conjugations specific to speakers in Spain ("cuidáis," "invitáis," "tiráis").

Retraining our Spanish LM with the previous training data and augmenting it with federated retraining of downstream models yielded significant quality improvements. The overall fraction of OOV words saw a drop of 7.3%. The rate at which typed words were modified after the initial commit was lower, and the typing speed was improved as a consequence of using a larger vocabulary.

Privacy-preserving federated analytics

Another approach to improve vocabulary is to discover frequent OOV words from user devices. This is inherently a challenging task due to the sensitive nature of what users type on their devices. Hence we need to carefully design mechanisms that protect users’ sensitive information during both the data collection and processing phases. To achieve this, we employ federated analytics, a data minimization method for computing statistical queries on distributed datasets without sharing sensitive data, and extend it with novel algorithms for open-set domains. This enables dynamic OOV word discovery while protecting user contributions through data minimization techniques, such as secure aggregation (SecAgg), and data anonymization techniques, such as DP.

One technique we developed is SecAggIBLT, which combines invertible Bloom lookup tables (IBLTs) with SecAgg. IBLTs are linear data structures that allow for efficient insertion, deletion, and lookup of key-value pairs. Here, users insert their OOV words into zero-initialized IBLTs, which are then aggregated using SecAgg. This guarantees that an honest-but-curious server is only able to see the aggregated IBLTs (all OOV words and their frequency across all devices), not individual user contributions. This approach provides anonymity for user contributions and prevents the server from linking specific words to a single user. During the data processing phase, central DP is applied to the discovered OOV words and their counts to ensure that OOV words that are unique to a few individuals are never released. DP uses parameters (ε, δ) to quantify privacy protection (smaller values indicate stronger protection). It provides a formal guarantee that released data patterns are common enough across devices, preventing individual identification.

play silent looping video pause silent looping video

Discovering frequent words with SecAggIBLT.

For the Gboard use case, stronger privacy is desired as user inputs may come from a large set of possibilities that potentially contain sensitive information. For example, English language users may type words or phrases of arbitrary length with characters from the Latin alphabet, list of digits, or other special characters. These inputs could contain their personal information like usernames and credit card numbers. Because SecAggIBLT enables the discovery of such unique words, it depends on the server’s proper application of central DP after SecAgg to ensure user privacy. On its own, it doesn’t prevent a curious server from inspecting discovered OOV words and potentially accessing sensitive information. This required us to develop algorithms that discover frequent OOV words with stronger data minimization and DP guarantees.

To that end, we built on an existing line of work to develop LDP-TrieHH, which learns frequent words by iteratively building a trie (prefix tree) data structure. LDP-TrieHH provides strong data minimization and strict local DP (LDP) guarantees during the data collection process. When applying the LDP-TrieHH algorithm to a specific language, such as English spoken in the U.S. and Indonesian, each layer of the trie stores a set of common prefixes corresponding to the depth of that layer. The trie is built iteratively, starting from the root up to its maximum length, which is 15 for both English and Indonesian. At each layer, we collect responses from a group of users, who only contribute by indicating one character after a common prefix from the previous layer. For example, if “CO” is a common prefix that the algorithm learned in the previous layer, and a user types a word “COVID-19”, the user will only contribute their data by submitting a vote on “COV” instead of the entire word “COVID-19”, which reduces the amount of information that is revealed from the voting process.

play silent looping video pause silent looping video

Discovering the frequent word “COVID-19” from user inputs with LDP-TrieHH.

On top of this, we further protect the privacy of user votes by minimizing users’ participation (each user participates in the voting phase of at most one layer), bounding the number of votes each user can contribute (an average of one word per day for a 60 day period), and adding local noise to user’s votes to provide a rigorous LDP guarantee (ε = 10.0 per word). For this, we use the privacy protection mechanism of Subset Selection, which offers the optimal utility-privacy tradeoff under LDP. At each layer, we collect votes from a large group of users (500K per layer) and the votes are aggregated and thresholded to filter out the infrequent prefixes. With this additional data processing step, through the analysis of privacy amplification via aggregation, LDP-TrieHH offers a central DP guarantee of (ε = 0.315, δ = 1e-10) per word with each user contributing at most 60 words in 60 days (i.e., an average of one word per day). To improve the coverage of discovered words, we sequentially run LDP-TrieHH multiple times to build several tries with disjoint sets of users. In later runs, we ask users to only contribute OOV words that have not been learned from previous runs to utilize each user’s contribution budget more efficiently. With LDP-TrieHH, we are able to discover words that account for 16.8% of the OOV words for English and 17.5% of the OOV words for Indonesian. More details are provided in this report.

Scaling to more languages with verifiable privacy via TEEs

The ability to privately discover OOV words using the LDP-TrieHH approach relies on the large number (millions) of active English and Indonesian language users in Gboard. However, for languages with smaller populations, LDP-TrieHH will necessarily have lower accuracy. To better scale across languages, including lower usage languages, Gboard is now leveraging server-side processing of federated data in trusted execution environments (TEEs), beginning with experiments to validate the approach on both synthetic and real data.

TEEs are secure extensions of common processors that facilitate confidentiality, integrity, and verifiability of workloads through embedded secret cryptographic keys signed by a hardware manufacturer. Our in-development system, described in this white paper, enables devices to verify that securely uploaded data can be decrypted only within a TEE-protected process, that this process only releases privatized aggregates, and that the data cannot be accessed for any other purpose. The TEE approach is augmented with DP to provide privacy similar to LDP-TrieHH, with improved scalability, and with robustness to similar privacy threats. More updates will come over the coming months.

Acknowledgements

The authors would like to thank Adria Gascon, Peter Kairouz, Gary Sivek, and Ananda Theertha Suresh for their extensive feedback and editing on the blog post itself, Tom Small and John Guilyard for helping with the animated figures, and the teams at Google that helped with algorithm design, infrastructure implementation, and production maintenance.

In particular, we would like to thank the collaborators who directly contributed to this effort: Eugene Bagdasaryan, Adria Gascon, Peter Kairouz, Ananda Theertha Suresh, and Wennan Zhu for their extensive support in research and development of federated analytics algorithms; Carbo Kuo, and Gary Sivek for their contribution in the collaboration with linguists; Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha, and Ameya Velingker for their contribution in the development of SecAggIBLT; Marco Gruteser, Brendan McMahan, Daniel Ramage, Michael Riley, Shumin Zhai, and Yuanbo Zhang for their leadership and support.