Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs

Weize Kong; Mike Bendersky; Marc Najork; Brandon Vargo; Mike Colagrosso

Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs

Weize Kong

Mike Bendersky

Marc Najork

Brandon Vargo

Mike Colagrosso

Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20), ACM (2020), 2416–2424

Download Google Scholar

Abstract

Google Drive is widely used for managing personal and work-related documents in the cloud. To help users organize their documents in Google Drive, we develop a new feature to allow users to create a set of working files for ongoing easy access, called workspace. A workspace is a cluster of documents, but unlike a typical document cluster, it contains documents that are not only topically coherent, but are also useful in the ongoing user tasks.
To alleviate the burden of creating workspaces manually, we automatically cluster documents into suggested workspaces. We go beyond the textual similarity-based unsupervised clustering paradigm and instead directly learn from users’ activity for document clustering. More specifically, we extract co-access signals (i.e., whether a user accessed two documents around the same time) to measure document relatedness. We then use a neural document similarity model that incorporates text, metadata, as well as co-access features. Since human labels are often difficult or expensive to collect, we extract weak labels based on co-access data at large scale for model training. Our offline and online experiments based on Google Drive show that (a) co-access features are very effective for document clustering; (b) our weakly supervised clustering achieves comparable or even better performance compared to the models trained with human labels; and (c) the weakly supervised method leads to better workspace suggestions that the users accept more often in the production system than baseline approaches.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs