Jump to Content
Jayant Madhavan

Jayant Madhavan

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Ten Years of Web Tables
    Michael J. Cafarella
    Alon Halevy
    Cong Yu
    Daisy Zhe Wang
    Eugene Wu
    PVLDB (2018)
    Preview abstract In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The pastdecade has seen a flurry of research and commercial activity around the WebTables project itself, as well as the broad topic of informal online structured data. As exciting as the past decade as been, we think the next ten years hold evenmore promise. In this paper, we will review the WebTables project, and try to place it in the broader context ofthe decade of work that followed. We will also propose an agenda for the next ten exciting years of work, a project that can draw upon many unexpected corners of the data management community View details
    Using SSDs to scale up Google Fusion Tables, a Database-in-the-Cloud
    Yingyi Bu
    Changkyu Kim
    32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, {IEEE} Computer Society, pp. 1263-1274
    Preview abstract Flash memory solid state drives (SSDs) have increasingly been advocated and adopted as a means of speeding up and scaling up data-driven applications. SSDs are becoming more widely available as an option in the cloud. However, when an application considers SSDs in the cloud, the best option for the application may not be immediate, among a number of choices for placing SSDs in the layers of the cloud. Although there have been many studies on SSDs, they often concern a specific setting, and how different SSD options in the cloud compare with each other is less well understood. In this paper, we describe how Google Fusion Tables (GFT) used SSDs and what optimizations were implemented to scale up its in-memory processing, clearly showing opportunities and limitations of SSDs in the cloud with quantitative analyses. We first discuss various SSD placement strategies and compare them with low-level measurements, and propose SSD-placement guidelines for a variety of cloud data services. We then present internals of our column engine and optimizations to better use the performance characteristics of SSDs. We empirically demonstrate that the optimizations enable us to scale our application to much larger datasets while retaining the low-latency and simple query processing architecture. View details
    Applying WebTables in Practice
    Sreeram Balakrishnan
    Alon Halevy
    Boulos Harb
    Warren Shen
    Kenneth Wilder
    Fei Wu
    Cong Yu
    Conference on Innovative Data Systems Research (2015)
    Preview
    Big Data Storytelling Through Interactive Maps
    Sreeram Balakrishnan
    Kathryn Hurley
    Hector Gonzalez
    Nitin Gupta
    Alon Halevy
    Karen Jacqmin-Adams
    Anno Langen
    Rod McChesney
    Rebecca Shapley
    Warren Shen
    IEEE Data Engineering Bulletin, vol. 35 (2012), pp. 46-54
    Preview
    Efficient Spatial Sampling of Large Geographical Tables
    Anish Das Sarma
    Hector Gonzalez
    Alon Y. Halevy
    SIGMOD (2012)
    Preview abstract Large-scale map visualization systems play an increasingly important role in presenting geographic datasets to end users. Since these datasets can be extremely large, a map rendering system often needs to select a small fraction of the data to visualize them in a limited space. This paper addresses the fundamental challenge of {\em thinning}: determining appropriate samples of data to be shown on specific geographical regions and zoom levels. Other than the sheer scale of the data, the thinning problem is challenging because of a number of other reasons: (1) data can consist of complex geographical shapes, (2) rendering of data needs to satisfy certain constraints, such as data being preserved across zoom levels and adjacent regions, and (3) after satisfying the constraints, an {\em optimal} solution needs to be chosen based on {\em objectives} such as {\em maximality}, {\em fairness}, and {\em importance} of data. This paper formally defines and presents a complete solution to the thinning problem. First, we express the problem as an integer programming formulation that efficiently solves thinning for desired objectives. Second, we present more efficient solutions for maximality, based on DFS traversal of a spatial tree. Third, we consider the common special case of point datasets, and present an even more efficient randomized algorithm. Finally, we have implemented all techniques from this paper in Google Maps visualizations of Fusion Tables, and we describe a set of experiments that demonstrate the tradeoffs among the algorithms. View details
    Efficient spatial sampling of large geographical tables
    Anish Das Sarma
    Hector Gonzalez
    Alon Halevy
    Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, pp. 193-204
    Preview
    Recovering Semantics of Tables on the Web
    Petros Venetis
    Alon Y. Halevy
    Warren Shen
    Fei Wu
    Gengxin Miao
    Proceedings of the VLDB Endowment, vol. 4 (2011), pp. 528-538
    Preview
    Google Fusion Tables: Data Management, Integration, and Collaboration in the Cloud
    Hector Gonzalez
    Alon Halevy
    Christian Jensen
    Anno Langen
    Rebecca Shapley
    Warren Shen
    Proceedings of the ACM Symposium on Cloud Computing (SOCC) (2010)
    Preview abstract Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data les (spreadsheets, CSV, KML), currently of up to 100MB. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the integration of data from multiple sources by performing joins across tables that may belong to di erent users. Users can keep the data private, share it with a select set of collaborators, or make it public and thus crawlable by search engines. The discussion feature of Fusion Tables allows collaborators to conduct detailed discussions of the data at the level of tables and individual rows, columns, and cells. This paper describes the inner workings of Fusion Tables, including the storage of data in the system and the tight integration with the Google Maps infrastructure. View details
    Clustering Query Refinements by User Intent
    Eldar Sadikov
    Lu Wang
    Alon Halevy
    Proceedings of the International World Wide Web Conference (WWW) (2010)
    Preview abstract We address the problem of clustering the refinements of a user search query. The clusters computed by our proposed algorithm can be used to improve the selection and placement of the query suggestions proposed by a search engine, and can also serve to summarize the different aspects of information relevant to the original user query. Our algorithm clusters refinements based on their likely underlying user intents by combining document click and session co-occurrence information. At its core, our algorithm operates by performing multiple random walks on a Markov graph that approximates user search behavior. A user study performed on top search engine queries shows that our clusters are rated better than corresponding clusters computed using approaches that use only document click or only sessions co-occurrence information. View details
    Google Fusion Tables: Web-Centered Data Management and Collaboration
    Hector Gonzalez
    Alon Halevy
    Christian Jensen
    Anno Langen
    Rebecca Shapley
    Warren Shen
    Jonathan Goldberg-Kidon
    Proceedings of the ACM SIGMOD conference, ACM (2010)
    Preview abstract It has long been observed that database management systems focus on traditional business applications, and that few people use a database management system outside their workplace. Many have wondered what it will take to enable the use of data management technology by a broader class of users and for a much wider range of applications. Google Fusion Tables represents an initial answer to the question of how data management functionality that focused on enabling new users and applications would look in today's computing environment. This paper characterizes such users and applications and highlights the resulting principles, such as seamless Web integration, emphasis on ease of use, and incentives for data sharing, that underlie the design of Fusion Tables. We describe key novel features, such as the support for data acquisition, collaboration, visualization, and web-publishing. View details
    Harvesting Relational Tables from Lists on the Web
    Hazem Elmeleegy
    Alon Halevy
    Proceedings of the VLDB Endowment (PVLDB) (2009), pp. 1078-1089
    Preview
    Exploring Schema Repositories with Schemr
    Kuang Chen
    Alon Halevy
    Proceedings of the ACM SIGMOD conference (2009), pp. 1095-1098
    Preview
    Harnessing the Deep Web: Present and Future
    Loredana Afanasiev
    Lyublena Antova
    Alon Halevy
    Proceedings of the Conference on Innovative Data system Research (CIDR) (2009)
    Preview
    Web-scale extraction of structured data.
    Michael Cafarella
    Alon Halevy
    SIGMOD Record, vol. 37(4) (2008), pp. 55-61
    Preview
    Google's Deep-Web Crawl
    David Ko
    Lucja Kot
    Vignesh Ganapathy
    Alex Rasmussen
    Alon Halevy
    Proceedings of the International Conference on Very Large Databases (VLDB) (2008)
    Preview
    Web-scale Data Integration: You can only afford to Pay As You Go
    Shawn R. Jeffery
    Shirley Cohen
    Xin (Luna) Dong
    David Ko
    Cong Yu
    Alon Halevy
    CIDR (2007)
    Preview abstract The World Wide Web is witnessing an increase in the amount of structured content - vast heterogeneous collections of structured data are on the rise due to the Deep Web, annotation schemes like Flickr, and sites like Google Base. While this phenomenon is creating an opportunity for structured data management, dealing with heterogeneity on the web-scale presents many new challenges. In this paper, we highlight these challenges in two scenarios - the Deep Web and Google Base. We contend that traditional data integration techniques are no longer valid in the face of such heterogeneity and scale. We propose a new data integration architecture, PAYGO, which is inspired by the concept of dataspaces and emphasizes pay-as-you-go data management as means for achieving web-scale data integration. View details
    Structured Data Meets the Web: A Few Observations
    Alon Halevy
    Shirley Cohen
    Xin (Luna) Dong
    Shawn R. Jeffery
    David Ko
    Cong Yu
    Data Engineering Bulletin (2006)
    Preview abstract The World Wide Web is witnessing an increase in the amount of structured content -- vast heterogeneous collections of structured data are on the rise due to the Deep Web, annotation schemes like Flickr, and sites like Google Base. While this phenomenon is creating an opportunity for structured data management, dealing with heterogeneity on the web-scale presents many new challenges. In this paper we articulate challenges based on our experience with addressing them at Google, and offer some principles for addressing them in a general fashion. View details
    Personal information management with SEMEX
    Yuhan Cai
    Xin Luna Dong
    Alon Y. Halevy
    Jing Michelle Liu
    Proceedings of the ACM SIGMOD conference (2005), pp. 921-923
    Reference Reconciliation in Complex Information Spaces
    Xin Dong
    Alon Y. Halevy
    SIGMOD Conference (2005), pp. 85-96
    Corpus-based Schema Matching
    Philip A. Bernstein
    AnHai Doan
    Alon Y. Halevy
    ICDE (2005), pp. 57-68
    Ontology Matching: A Machine Learning Approach
    AnHai Doan
    Pedro Domingos
    Alon Y. Halevy
    Handbook on Ontologies (2004), pp. 385-404
    Mining structures for semantics
    Xin Dong
    Alon Y. Halevy
    SIGKDD Explorations, vol. 6 (2004), pp. 53-60
    Simlarity Search for Web Services
    Xin Dong
    Alon Y. Halevy
    Ema Nemes
    Jun Zhang
    VLDB (2004), pp. 372-383
    The Piazza Peer Data Management System
    Alon Y. Halevy
    Zachary G. Ives
    Peter Mork
    Dan Suciu
    Igor Tatarinov
    IEEE Transactions on Knowledge & Data Engineering, vol. 16 (2004), pp. 787-798
    Composing Mappings Among Data Sources
    Alon Y. Halevy
    VLDB (2003), pp. 572-583
    Learning to map between ontologies on the semantic web
    AnHai Doan
    Pedro Domingos
    Alon Y. Halevy
    WWW (2002), pp. 662-673
    Representing and Reasoning about Mappings between Domain Models
    Philip A. Bernstein
    Pedro Domingos
    Alon Y. Halevy
    AAAI/IAAI (2002), pp. 80-86
    Generic Schema Matching with Cupid
    Philip A. Bernstein
    Erhard Rahm
    VLDB (2001), pp. 49-58