Jayant Madhavan
Authored Publications
Sort By
Ten Years of Web Tables
Michael J. Cafarella
Alon Halevy
Cong Yu
Daisy Zhe Wang
Eugene Wu
PVLDB (2018)
Preview abstract
In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The pastdecade has seen a flurry of research and commercial activity around the WebTables project itself, as well as the broad topic of informal online structured data. As exciting as the past decade as been, we think the next ten years hold evenmore promise. In this paper, we will review the WebTables project, and try to place it in the broader context ofthe decade of work that followed. We will also propose an agenda for the next ten exciting years of work, a project that can draw upon many unexpected corners of the data management community
View details
Using SSDs to scale up Google Fusion Tables, a Database-in-the-Cloud
Yingyi Bu
Changkyu Kim
32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, {IEEE} Computer Society, pp. 1263-1274
Preview abstract
Flash memory solid state drives (SSDs) have increasingly been advocated and adopted as a means of speeding up and scaling up data-driven applications. SSDs are becoming more widely available as an option in the cloud. However, when an application considers SSDs in the cloud, the best option for the application may not be immediate, among a number of choices for placing SSDs in the layers of the cloud. Although there have been many studies on SSDs, they often concern a specific setting, and how different SSD options in the cloud compare with each other is less well understood. In this paper, we describe how Google Fusion Tables (GFT) used SSDs and what optimizations were implemented to scale up its in-memory processing, clearly showing opportunities and limitations of SSDs in the cloud with quantitative analyses. We first discuss various SSD placement strategies and compare them with low-level measurements, and propose SSD-placement guidelines for a variety of cloud data services. We then present internals of our column engine and optimizations to better use the performance characteristics of SSDs. We empirically demonstrate that the optimizations enable us to scale our application to much larger datasets while retaining the low-latency and simple query processing architecture.
View details
Applying WebTables in Practice
Preview
Sreeram Balakrishnan
Alon Halevy
Boulos Harb
Warren Shen
Kenneth Wilder
Fei Wu
Cong Yu
Conference on Innovative Data Systems Research (2015)
Recent Progress Towards an Ecosystem of Structured Data on the Web
Preview
Nitin Gupta
Alon Y. Halevy
Boulos Harb
Fei Wu
Cong Yu
ICDE (2013), pp. 5-8
Efficient spatial sampling of large geographical tables
Preview
Anish Das Sarma
Hector Gonzalez
Alon Halevy
Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, pp. 193-204
Preview abstract
Large-scale map visualization systems play an increasingly important role in presenting geographic datasets to end users. Since these datasets can be
extremely large, a map rendering system often needs to select a small
fraction of the data to visualize them in a limited space. This paper addresses the fundamental challenge of {\em thinning}:
determining appropriate samples of data to be shown on specific geographical
regions and zoom levels. Other than the sheer scale of the data, the thinning
problem is challenging because of a number of other reasons: (1) data can
consist of complex geographical shapes, (2) rendering of data needs to satisfy
certain constraints, such as data being preserved across zoom levels and
adjacent regions, and (3) after satisfying the constraints, an {\em optimal}
solution needs to be chosen based on {\em objectives} such as {\em
maximality}, {\em fairness}, and {\em importance} of data.
This paper formally defines and presents a complete solution to the thinning
problem. First, we express the problem as an integer programming formulation
that efficiently solves thinning for desired objectives. Second, we
present more efficient solutions for maximality, based on DFS traversal of a
spatial tree. Third, we consider the common special case of point datasets,
and present an even more efficient randomized algorithm. Finally, we have
implemented all techniques from this paper in Google Maps visualizations of
Fusion Tables, and we describe a set of experiments that demonstrate the
tradeoffs among the algorithms.
View details
Big Data Storytelling Through Interactive Maps
Preview
Sreeram Balakrishnan
Kathryn Hurley
Hector Gonzalez
Nitin Gupta
Alon Halevy
Karen Jacqmin-Adams
Anno Langen
Rod McChesney
Rebecca Shapley
Warren Shen
IEEE Data Engineering Bulletin, 35 (2012), pp. 46-54
Recovering Semantics of Tables on the Web
Preview
Petros Venetis
Alon Y. Halevy
Marius Pasca
Warren Shen
Fei Wu
Gengxin Miao
Proceedings of the VLDB Endowment, 4 (2011), pp. 528-538
Clustering Query Refinements by User Intent
Eldar Sadikov
Lu Wang
Alon Halevy
Proceedings of the International World Wide Web Conference (WWW) (2010)
Preview abstract
We address the problem of clustering the refinements of a user search query. The clusters computed by our proposed algorithm can be used to improve the selection and placement of the query suggestions proposed by a search engine, and can also serve to summarize the different aspects of information relevant to the original user query. Our algorithm clusters refinements based on their likely underlying user intents by combining document click and session co-occurrence information. At its core, our algorithm operates by performing multiple random walks on a Markov graph that approximates user search behavior. A user study performed on top search engine queries shows that our clusters are rated better than corresponding clusters computed using approaches that use only document click or only sessions co-occurrence information.
View details
Google Fusion Tables: Data Management, Integration, and Collaboration in the Cloud
Hector Gonzalez
Alon Halevy
Christian Jensen
Anno Langen
Rebecca Shapley
Warren Shen
Proceedings of the ACM Symposium on Cloud Computing (SOCC) (2010)
Preview abstract
Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data les (spreadsheets, CSV, KML), currently of up to 100MB. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the integration of data from multiple sources by performing joins across tables that may belong to dierent users. Users can keep the data private, share it with a select set of collaborators, or make it public and thus crawlable by search engines. The discussion feature of Fusion Tables allows collaborators to conduct detailed discussions of the data at the level of tables and individual rows, columns, and cells. This paper describes the inner workings of Fusion Tables, including the storage of data in the system and the tight integration with
the Google Maps infrastructure.
View details