Jump to Content
Mike Colagrosso

Mike Colagrosso

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Improving Recommendation Quality at Google Drive
    Suming Jeremiah Chen
    Zachary Teal Wilson
    Brian Lee Calaci
    Ryan Lee Evans
    Sean Robert Abraham
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2020)
    Preview abstract Quick Access is a machine-learned system in Google Drive that predicts which files a user wants to open. Adding Quick Access recommendations to the Drive homepage cut the amount of time that users spend locating their files in half. Aggregated over the ~1 billion users of Drive, the time saved up adds up to ~1000 work weeks every day. In this paper, we discuss both the challenges of iteratively improving the quality of a personal recommendation system as well as the variety of approaches that we took in order to improve this feature. We explored different deep network architectures, novel modeling techniques, additional data sources, and the effects of latency and biases in the UX. We share both pitfalls as well as successes in our attempts to improve this product, and also discuss how we scaled and managed the complexity of the system. We believe that these insights will be especially useful to those who are working with private corpora as well as those who are building a large-scale production recommendation system. View details
    Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs
    Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20), ACM (2020), 2416–2424
    Preview abstract Google Drive is widely used for managing personal and work-related documents in the cloud. To help users organize their documents in Google Drive, we develop a new feature to allow users to create a set of working files for ongoing easy access, called workspace. A workspace is a cluster of documents, but unlike a typical document cluster, it contains documents that are not only topically coherent, but are also useful in the ongoing user tasks. To alleviate the burden of creating workspaces manually, we automatically cluster documents into suggested workspaces. We go beyond the textual similarity-based unsupervised clustering paradigm and instead directly learn from users’ activity for document clustering. More specifically, we extract co-access signals (i.e., whether a user accessed two documents around the same time) to measure document relatedness. We then use a neural document similarity model that incorporates text, metadata, as well as co-access features. Since human labels are often difficult or expensive to collect, we extract weak labels based on co-access data at large scale for model training. Our offline and online experiments based on Google Drive show that (a) co-access features are very effective for document clustering; (b) our weakly supervised clustering achieves comparable or even better performance compared to the models trained with human labels; and (c) the weakly supervised method leads to better workspace suggestions that the users accept more often in the production system than baseline approaches. View details
    Preview abstract Machine Learning (ML) is a critical component of several novel applications and intelligent features in existing applications. Recent advances in deep learning have fundamentally advanced the state- of-the-art in several areas of research and made it easier to apply ML to a wide variety of problems. However, applied ML projects in industry, where the objective is to build and improve a production feature that uses ML continues to be complicated and often bottlenecked by data management challenges. In this paper, we describe the design and implementation of a machine learning platform for building learned ranking services that leverages key ideas from data management. The platform allows engineers to focus on application-specific modeling and simplifies key tasks of 1) gathering training data, 2) cleaning, validating, and monitoring data quality, 3) training and evaluating models, 4) feature lifecycle management, 5) and infrastructure for A/B tests. We describe key design choices anchored around the core idea of optimizing for experiment velocity. We describe lessons learned from applications built on this platform that have been in production serving hundreds of millions of users for over a year. Finally, we identify two key components of the platform where data management research can have a major impact. We believe such platforms have the potential to accelerate and simplify ML applications the same way data warehouses radically simplified complex reporting applications. View details
    Quick Access: Building a Smart Experience for Google Drive
    Alexandrin Popescul
    Julian Gibbons
    Alan Green
    Michael James Smith
    Cayden Meyer
    Reuben Kan
    Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), pp. 1643-1651
    Preview abstract Google Drive is a cloud storage and collaboration service used by hundreds of millions of users around the world. Quick Access is a new feature in Google Drive that surfaces the relevant documents to the user on the home page. We describe the development of a machine-learned service behind this feature. Our metrics show that this feature cuts the time it takes for users to locate their documents in half. The development of this product feature is an illustration of a number of more general challenges and constraints associated with machine learning product deployment such as dealing with private corpora and protecting user privacy, working with data services that are not designed with machine-learning in mind and may be owned and operated by different teams with different constraints, and evolving product definitions which inform the metric being optimized. We believe that the lessons learned from this experience will be useful to practitioners tackling a wide range of applied machine-learning problems. View details
    No Results Found