Jump to Content

Data Mining and Modeling

The proliferation of machine learning means that learned classifiers lie at the core of many products across Google. However, questions in practice are rarely so clean as to just to use an out-of-the-box algorithm. A big challenge is in developing metrics, designing experimental methodologies, and modeling the space to create parsimonious representations that capture the fundamentals of the problem. These problems cut across Google’s products and services, from designing experiments for testing new auction algorithms to developing automated metrics to measure the quality of a road map.

Data mining lies at the heart of many of these questions, and the research done at Google is at the forefront of the field. Whether it is finding more efficient algorithms for working with massive data sets, developing privacy-preserving methods for classification, or designing new machine learning approaches, our group continues to push the boundary of what is possible.

Recent Publications

Preview abstract Many geographic information systems applications rely on the data provided by user devices in the road network. Such applications include traffic monitoring, driving navigation, detecting road closures or the construction of new roads, etc. This signal is collected by sampling locations from the user trajectories and is a critical process for all such systems. Yet, it has not been sufficiently studied in the literature. The most natural way to sample a trajectory is perhaps using a frequency based algorithm, e.g., sample every $x$ seconds. However, as we argue in this paper, such a simple strategy can be very wasteful in terms of resources (e.g., server-side processing, user battery) and in terms of the amount of user data that it maintains. In this work we conduct a horizontal study of various location sampling algorithms (including frequency-based, road geography-based, reservoir-sampling based, etc.) and extract their trade-offs in terms of various metrics of interest, such as, the size of the stored data and the induced quality of training for prediction tasks (e.g., predicting speeds) using the road network of New York City. View details
First Passage Percolation with Queried Hints
Kritkorn Karntikoon
Aaron Schild
Yiheng Shen
Ali Sinop
AISTATS (2024)
Preview abstract Optimization problems are ubiquitous throughout the modern world. In many of these applications, the input is inherently noisy and it is expensive to probe all of the noise in the input before solving the relevant optimization problem. In this work, we study how much of that noise needs to be queried in order to obtain an approximately optimal solution to the relevant problem. We focus on the shortest path problem in graphs, where one may think of the noise as coming from real-time traffic. We consider the following model: start with a weighted base graph $G$ and multiply each edge weight by an independently chosen, uniformly random number in $[1,2]$ to obtain a random graph $G'$. This model is called \emph{first passage percolation}. Mathematicians have studied this model extensively when $G$ is a $d$-dimensional grid graph, but the behavior of shortest paths in this model is still poorly understood in general graphs. We make progress in this direction for a class of graphs that resembles real-world road networks. Specifically, we prove that if the geometric realization of $G$ has constant doubling dimension, then for a given $s-t$ pair, we only need to probe the weights on $((\log n) / \epsilon)^{O(1)}$ edges in $G'$ in order to obtain a $(1 + \epsilon)$-approximation to the $s-t$ distance in $G'$. We also demonstrate experimentally that this result is pessimistic -- one can even obtain a short path in $G'$ with a small number of probes to $G'$. View details
Preview abstract Computing efficient traffic signal plans is often based on the amount of traffic in an intersection, its distribution over the various intersection movements and hours as well as on performance metrics such as traffic delay. In their simple and typical form plans are fixed in the same hour over weekdays. This allows low operation costs without the necessity for traffic detection and monitoring tools. A critical factor on the potential efficiency of such plans is the similarity of traffic patterns over the days along each of the intersection movements. We refer to such similarity as the traffic stability of the intersection and define simple metrics to measure it based on traffic volume and traffic delay. In this paper, we propose an automatic probe data based method, for city-wide estimation of traffic stability. We discuss how such measures can be used for signal planning such as in selecting plan resolution or as an indication as which intersections can benefit from dynamic but expensive traffic detection tools. We also identify events of major changes in traffic characteristics of an intersection. We demonstrate the framework by using real traffic statistics to study the traffic stability in the city of Haifa along its 162 intersections. We study the impact of the time of day on the stability, detect major changes in traffic and find intersections with high and low stability. View details
Unsupervised Event Chain Mining from Multiple Documents
Yizhu Jiao
Ming Zhong
Yunyi Zhang
Chao Zhang
Jiawei Han
Proceedings of the ACM Web Conference 2023
Preview abstract Massive and fast-evolving news articles keep emerging on the web. To effectively summarize and provide concise insights into real-world events, we propose a new event knowledge extraction task Event Chain Mining in this paper. Given multiple documents about a super event, it aims to mine a series of salient events in temporal order. For example, the event chain of super event "Mexico Earthquake in 2017" is {"earthquake hit Mexico", "destroy houses", "kill people", "block roads"}. This task can help readers capture the gist of texts quickly, thereby improving reading efficiency and deepening text comprehension. To address this task, we regard an event as a cluster of different mentions of similar meanings. In this way, we can identify the different expressions of events, enrich their semantic knowledge and replenish relation information among them. Taking events as the basic unit, we present a novel unsupervised framework, EMiner. Specifically, we extract event mentions from texts and merge them with similar meanings into a cluster as a single event. By jointly incorporating both content and commonsense, essential events are then selected and arranged chronologically to form an event chain. Meanwhile, we annotate a multi-document benchmark to build a comprehensive testbed for the proposed task. Extensive experiments are conducted to verify the effectiveness of EMiner in terms of both automatic and human evaluations. View details
Preview abstract In this study, we explore the use of a hybrid approach in online surveys, combining traditional form-based closed-ended questions with open-ended questions administered by a chatbot. We trained a chatbot using OpenAI's GPT-3 language model to produce context-dependent probes to responses given to open-ended questions. The goal was to mimic a typical professional survey interviewer scenario where the interviewer is trained to probe the respondent when answering an open-ended question. For example, assume this initial exchange: “What did you find hard to use or frustrating when using Google Maps?” “It wasn't easy to find the address we were looking for” The chatbot would follow-up with “What made it hard to find the address?” or “What about it made it difficult to find?” or “What steps did you take to find it?”. The experiment consisted of a Qualtrics survey with 1,200 participants, who were randomly assigned to one of two groups. Both groups answered closed-ended questions, but the final open-ended question differed between the groups, with one group receiving a chatbot and the other group receiving a single open-ended question. The results showed that using a chatbot resulted in higher quality and more detailed responses compared to the single open-ended question approach, and respondents indicated a preference towards using a chatbot to open-ended questions. However, respondents also noted the importance of avoiding repetitive probes and expressed dislike for the uncertainty around the number of required exchanges. This hybrid approach has the potential to provide valuable insights for survey practitioners, although there is room for improvement in the conversation flow. View details
Business Intelligence Career Master Plan
Danny Moncada
Eduardo Chavez
Packt (2023) (to appear)
Preview abstract Navigating the challenging path of a business intelligence career requires considering individual expertise, interest, and skills. This book explores key skills like data modeling, visualization, and warehousing, alongside organizational structures, technology stacks, coursework, certifications, and interview advice, thus enabling readers to make informed decisions about their BI journey. The book will begin by assessing the different roles of BI and provide an in-depth walkthrough of the roadmap while helping you match your skills and career with the tech stack in business intelligence. The book will then teach you to build taxonomy and a data story using visualization types. You would be learning the fundamentals of programming, frontend development, backend development, software development lifecycle, and project management to give you a broad-level view of the end-to-end BI process. This book will also help you to identify what subjects and areas are crucial to study and what does not add much value to your BI skill set. You would be able to find the right job fit and build a business problem and data solutions matrix. By the end of this book, you would be able to make an informed, well-thought-out decision on which of the myriad paths to choose in your business intelligence journey. View details