
Shanshan Wu
Currently working on federated learning and personalization. Previously received my PhD from UT Austin: https://wushanshan.github.io/
Authored Publications
Sort By
Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications
Yanxiang Zhang
Zheng Xu
Yuanbo Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) (2025)
Preview abstract
Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.
View details
Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
Bowen Tan
Zheng Xu
Eric Xing
Zhiting Hu
International Conference on Machine Learning (ICML) (2025)
Preview abstract
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
View details
Preview abstract
Personalization methods in federated learning aim to balance the benefits of federated and local training for data availability, communication cost, and robustness to client heterogeneity. Approaches that require clients to communicate all model parameters can be undesirable due to privacy and communication constraints. Other approaches require always-available or stateful clients, impractical in large-scale cross-device settings. We introduce Federated Reconstruction, the first model-agnostic framework for partially local federated learning suitable for training and inference at scale. We motivate the framework via a connection to model-agnostic meta learning, empirically demonstrate its performance over existing approaches for collaborative filtering and next word prediction, and release an open-source library for evaluating approaches in this setting. We also describe the successful deployment of this approach at scale for federated collaborative filtering in a mobile keyboard application.
View details
A Field Guide to Federated Optimization
Jianyu Wang
Zheng Xu
Gauri Joshi
Maruan Al-Shedivat
Galen Andrew
A. Salman Avestimehr
Katharine Daly
Deepesh Data
Suhas Diggavi
Hubert Eichner
Advait Gadhikar
Antonious M. Girgis
Filip Hanzely
Chaoyang He
Samuel Horvath
Martin Jaggi
Tara Javidi
Satyen Chandrakant Kale
Sai Praneeth Karimireddy
Jakub Konečný
Sanmi Koyejo
Tian Li
Peter Richtarik
Karan Singhal
Virginia Smith
Mahdi Soltanolkotabi
Weikang Song
Sebastian Stich
Ameet Talwalkar
Hongyi Wang
Blake Woodworth
Honglin Yuan
Manzil Zaheer
Mi Zhang
Tong Zhang
Chunxiang (Jake) Zheng
Chen Zhu
arxiv (2021)
Preview abstract
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
View details