Jump to Content

Flip Korn

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Search engines including Google are beginning to support local-dining queries such as ``At which nearby restaurants can I order the Indonesian salad \textit{gado-gado}?''. Given the low coverage of online menus worldwide, and only 30\% even having a website, this remains a challenge. Here we leverage the power of the crowd: online users who are willing to answer questions about dish availability at restaurants visited. While motivated users are happy to contribute knowledge for free, they are much less likely to respond to ``silly'' or embarrassing questions (e.g., ``Does \textit{Pizza Hut} serve pizza?'' or ``Does \textit{Mike's Vegan Restaurant} serve hamburgers?'') In this paper, we study the problem of \textit{Vexation-Aware Active Learning}, where judiciously selected questions are targeted towards improving restaurant-dish model prediction, subject to a limit on the percentage of ``unsure'' answers or ``dismissals'' (e.g., swiping the app closed) used to measure vexation. We formalize the problem as an integer linear program and solve it efficiently using a distributed solution that scales linearly with the number of candidate questions. Since our algorithm relies on precise estimation of the unsure-dismiss rate (UDR), we give a regression model that provides accurate results compared to baselines including collaborative filtering. Finally, we demonstrate in a live system that our proposed vexation-aware strategy performs competitively against classical (margin-based) active learning approaches while not exceeding UDR bounds. View details
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on logic-based methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple yet novel approach to acquiring \emph{class-level attributes} from the crowd that represent broad common sense associations between categories, and can be used with the classic knowledge-base default \& override technique (e.g. \cite{reiter1978}) to address the early \textit{label sparsity problem} faced by machine learning systems for problems that lack data for training. We demonstrate the effectiveness of our acquisition and reasoning approach on a pair of very real industrial-scale problems: how to augment an existing KG of places and offerings (e.g. stores and products, restaurants and dishes) with associations between them indicating the availability of the offerings at those places, which would enable the KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges but for this paper we focus mostly on the label sparsity. Less than 30\% of physical places worldwide (i.e. brick \& mortar stores and restaurants) have a website, and less than half of those list their product catalog or menus, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Label sparsity is a general problem, and not specific to these use cases, that prevents modern AI and machine learning techniques from applying to many applications for which labeled data is not readily available. As a result, the study of how to acquire the knowledge and data needed for AI to work is as much a problem today as it was in the 1970s and 80s during the advent of expert systems \cite{mycin1975}. The class-level attributes approach presented here is based on a KG-inspired intuition that a lot of the knowledge people need to understand where to go to buy a product they need, or where to find the dishes they want to eat, is categorical and part of their general common sense: everyone knows grocery stores sell milk and don't sell asphalt, chinese restaurants serve fried rice and not hamburgers, etc. We acquired a mixture of instance- and class- level pairs (e.g. $\langle$\textit{Ajay Mittal Dairy}, milk$\rangle$, $\langle$GroceryStore, milk$\rangle$, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in shopping and dining domains. The acquired common sense knowledge also has long-term value in the KG. The approach was a critical part of enabling a worldwide \textit{local search} capability on Google Maps, with which users can find products and dishes that are available in most places on earth. View details
    Scalable Building Height Estimation from Street Scene Images
    Yunxiang Zhao
    Jianzhong Qi
    Xiangyu Wang
    IEEE Transactions on Geoscience and Remote Sensing (TGRS), vol. 60 (2022)
    Preview abstract Building height estimation plays an essential role in many applications such as 3D city rendering, urban planning, and navigation. Recently, a new building height estimation method was proposed using street scene images and 2D maps, which is more scalable than traditional methods that use high-resolution optical images, RADAR or LiDAR data that are proprietary or expensive to obtain. The method needs to detect building rooflines to compute building height via the pinhole camera model. We observe that this method has limitations in handling complex street scene images where buildings occlude each other or are blocked by other objects such as trees since rooflines can be difficult to locate. To address these limitations, we propose a robust building height estimation method that computes building height simultaneously from street scene images with an orientation along the street and images facing the building with an upward-looking view. We first detect roofline candidates from both types of images. Then, we use a deep neural network called RoofNet to classify and filter these candidates and select the best candidate via an entropy-based ranking algorithm. When the true roofline is identified, we compute building height via the pinhole camera model. Experimental results show that the proposed RoofNet model yields a higher accuracy on building corner and roofline candidate filtering compared with state-of-the-art open set classifiers. As a result, our overall building height estimation method is more accurate than the baseline by up to 11.9\%. View details
    Collocating Structured Web Tables with News Articles
    Alyssa Whitlock Lees
    Cong Yu
    Levy de Souza Silva
    Luciano Barbosa
    WWW 2021 International Workshop on News Recommendation and Intelligence (2021) (to appear)
    Preview abstract With the increased frequency of critical headline updates and news information published on the web, it can be overwhelming for users to understand and verify articles in a larger context. In particular, crucial statistics or facts within a news story may be difficult to parse without comparison to other entities. Structured web tables, as found in a source such as Wikipedia, offer users an important resource for understanding the news. Displaying tables or charts alongside news documents can assist in comprehension of a document content. However, automatically finding relevant and meaningful connections between web tables to a given news article is a challenging research problem. We are not aware of any research directly addressing this problem; however, one can think of the news article as a (long) keyword query and apply information retrieval, or question-answering, techniques. Previous work in this area used embeddings of KB entities and the utilization of different metrics for semantic similarity for table lookup. Our experiments applying these baseline approaches in a straightforward way for this render spurious results that are inappropriate or irrelevant for a reader. In this paper, we build on prior efforts, focusing specifically on the task of matching Wikipedia web tables to news articles. Our contribution includes a survey of existing techniques applied to the news to web matching problem. From these baselines, we propose a new model that leverages recent advances in Bidirectional transformer language models along with entity based table embeddings. Specifically our technique contains three technical components. First, we construct a training data set built from news article follow-up queries to Wikipedia articles over a large aggregate of users. Second, we extract unique web table based categories from Google's Knowledge Graph that describe Wikipedia table column components. Third, we fine-tune a Bidirectional Encoder Representations from Transformers (Bert), pre-trained on news corpus data. Using human-based curators as a standard for creating an evaluation set, our approach significantly outperforms the baselines. View details
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple approach to acquiring and reasoning with class-level attributes from the crowd that represent broad common sense associations between categories. We pick a very real industrial-scale data set and problem: how to augment an existing knowledge graph of places and products with associations between them indicating the availability of the products at those places, which would enable a KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges, not least of which is that only 30\% of physical stores (i.e. brick \& mortar stores) have a website, and fewer list their product inventory, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Based on a KG-inspired intuition that a lot of the class-level pairs are part of people's general common sense, e.g. everyone knows grocery stores sell milk and don't sell asphalt, we acquired a mixture of instance- and class- level pairs (e.g. {Ajay Mittal Dairy, milk}, {GroceryStore, milk}, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in this and similar domains, as well as long-term value in the KG. View details
    Preview abstract Modern search engines increasingly incorporate tabular content, which consists of a set of entities each augmented with a small set of facts. The facts can be obtained from multiple sources: an entity’s knowledge base entry, the infobox on its Wikipedia page, or its row within a WebTable. Crucially, the informativeness of a fact depends not only on the entity but also the specific context (e.g., the query). To the best of our knowledge, this paper is the first to study the problem of contextual fact ranking: given some entities and a con- text (i.e., succinct natural language description), identify the most informative facts for the entities collectively within the context. We propose to contextually rank the facts by exploiting deep learning techniques. In particular, we develop pointwise and pair- wise ranking models, using textual and statistical information for the given entities and context derived from their sources. We en- hance the models by incorporating entity type information from an IsA (hypernym) database. We demonstrate that our approaches achieve better performance than state-of-the-art baselines in terms of MAP, NDCG, and recall. We further conduct user studies for two specific applications of contextual fact ranking—table synthesis and table compression—and show that our models can identify more informative facts than the baselines. View details
    Preview abstract Modern search engines provide contextual information surrounding query entities beyond ``ten blue links'' in the form of knowledge cards. Among the various attributes displayed about entities there has been recent interest in providing trivia due to observed engagement rates. Obtaining such trivia at a large scale is, however, non-trivial: hiring professional content creators is expensive and extracting statements from the Web can result in unreliable or uninteresting facts. In this paper we show how fun facts can be mined from tables on the Web to provide a large volume of reliable and interesting content. We employ a template-based approach to generate statements that are postprocessed by workers. We show how to bootstrap and streamline the process for faster and cheaper task completion. However, the content contained in these tables is dynamic. Therefore, we address the problem of automatically maintaining templates when tables are updated. View details
    Preview abstract With the support of major search platforms such as Google and Bing, fact-checking articles, which can be identified by their adoption of the schema.org ClaimReview structured markup, have gained widespread recognition for their role in the fight against digital misinformation. A claim-relevant document is an online document that addresses, and potentially expresses a stance towards, some claim. The claim-relevance discovery problem, then, is to find claim-relevant documents. Depending on the verdict from the fact check, claim-relevance discovery can help identify online misinformation. In this paper, we provide an initial approach to the claim-relevance discovery problem by leveraging various information retrieval and machine learning techniques. The system consists of three phases. First, we retrieve candidate documents based on various features in the fact-checking article. Second, we apply a relevance classifier to filter away documents that do not address the claim. Third, we apply a language feature based classifier to distinguish documents with different stances towards the claim. We experimentally demonstrate that our solution achieves solid results on a large-scale dataset and beats state-of-the-art baselines. Finally, we highlight a rich set of case studies to demonstrate the myriad of remaining challenges and that this problem is far from being solved. View details
    Investigating Rumor News Using Agreement-Aware Search
    Jingbo Shang
    Jiaming Shen
    Tianhang Sun
    Xingbang Liu
    Anja Gruenheid
    Cong Yu
    Jiawei Han
    CIKM (2018)
    Preview abstract Recent years have witnessed a widespread increase of rumor news generated by humans and machines in order to attract readership, influence opinions, and increase click-through revenue. Therefore, tools for investigating rumor news have become an urgent necessity. One useful function of such tools is to see ways a specific topic or event is represented by presenting different points of view from multiple sources. In this paper, we propose Maester, a novel agreement-aware search framework for investigating rumor news. Given an investigative question, Maester will retrieve related articles to that question, assign and display top articles from agree, disagree, and discuss categories to users. Splitting the results into these three categories provides the user a holistic view towards the investigative question. We build Maester based on the following two key observations: (1) relatedness can commonly be determined by keywords and entities occurring in both questions and articles, and (2) the level of agreement between the investigative question and the related news article can often be decided by a few key sentences. Accordingly, we use gradient boosting tree models with keyword/entity matching features for relatedness detection, and leverage recurrent neural network to infer the level of agreement. Our experiments on the Fake News Challenge (FNC) “stance detection” dataset demonstrate up to an order of magnitude improvement of Maester over the original FNC winning solution, for agreement-aware search. View details
    Knowledge Exploration using Tables on the Web
    Fernando Chirigati
    Cong Yu
    Proceedings of the VLDB Endowment, vol. 10 (2017), pp. 193-204
    Preview abstract The increasing popularity of mobile device usage has ushered in many features in modern search engines that help users with various information needs. One of those needs is Knowledge Exploration, where related documents are returned in response to a user query, either directly through right-hand side knowledge panels or indirectly through navigable sections underneath individual search results. Existing knowledge exploration features have relied on a combination of Knowledge Bases and query logs. In this paper, we propose Knowledge Carousels of two modalities, namely sideways and downwards, that facilitate exploration of IS-A and HAS-A relationships, respectively, with regard to an entity-seeking query, based on leveraging the large corpus of tables on the Web. This brings many technical challenges, including associating correct carousels with the search entity, selecting the best carousel from the candidates, and finding titles that best describe the carousel. We describe how we address these challenges and also experimentally demonstrate through user studies that our approach produces better result sets than baseline approaches. View details
    Goods: Organizing Google's Datasets
    Alon Halevy
    Christopher Olston
    Neoklis Polyzotis
    Sudip Roy
    Steven Euijong Whang
    SIGMOD (2016)
    Preview abstract Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present Goods, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. Goods extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data management systems in general. View details
    Managing Google’s data lake: an overview of the Goods system
    Chris Olston
    Neoklis Polyzotis
    Steven Whang
    Sudip Roy
    IEEE Engineering Bulletin, vol. 39 (3) (2016), pp. 5
    Preview abstract For most large enterprises today, data constitutes their core assets, along with code and infrastructure. Indeed, for most enterprises, the amount of data that they produce internally has exploded. At the same time, in many cases, engineers and data scientists do not use centralized data-management systems and and up creating what became known as a data lake—a collection of datasets that often are not well organized or not organized at all and where one needs to “fish” for the useful datasets. In this paper, we describe our experience building and deploying Goods, a system to manage Google’s internal data lake. Goods crawls Google’s internal infrastructure and builds a catalog of discovered datasets, including structured files, databases, spreadsheets, or even services that provide access to the data. Goods extracts metadata about datasets in a post-hoc way: engineers continue to generate and organize datasets in the same way as they have before, and Goods provides values as without disrupting teams’ practices. The technical challenges that we had to address resulted both from the scale and heterogeneity of Google’s data lake and our decision to extract metadata in a post-hoc manner. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general. View details
    Fractal Modeling of IP Network Traffic at Streaming Speeds
    S. Muthukrishnan
    Yihua Wu
    ICDE (2006), pp. 155
    Space- and time-efficient deterministic algorithms for biased quantiles over data streams
    Graham Cormode
    S. Muthukrishnan
    Divesh Srivastava
    PODS (2006), pp. 263-272
    Modeling skew in data streams
    S. Muthukrishnan
    Yihua Wu
    SIGMOD Conference (2006), pp. 181-192
    Effective Computation of Biased Quantiles over Data Streams
    Graham Cormode
    S. Muthukrishnan
    Divesh Srivastava
    ICDE (2005), pp. 20-31
    Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-Dimensional Data
    Graham Cormode
    S. Muthukrishnan
    Divesh Srivastava
    SIGMOD Conference (2004), pp. 155-166
    Holistic UDAFs at streaming speeds
    Graham Cormode
    Theodore Johnson
    S. Muthukrishnan
    Oliver Spatscheck
    Divesh Srivastava
    SIGMOD Conference (2004), pp. 35-46
    Finding Hierarchical Heavy Hitters in Data Streams
    Graham Cormode
    S. Muthukrishnan
    Divesh Srivastava
    VLDB (2003), pp. 464-475
    IPSOFACTO: A Visual Correlation Tool for Aggregate Network Traffic Data
    S. Muthukrishnan
    Yunyue Zhu
    SIGMOD Conference (2003), pp. 677
    Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases
    S. Muthukrishnan
    Yunyue Zhu
    VLDB (2003), pp. 536-547
    Generalized substring selectivity estimation
    Zhiyuan Chen
    Nick Koudas
    S. Muthukrishnan
    J. Comput. Syst. Sci., vol. 66 (2003), pp. 98-132
    Efficient Approximation of Correlated Sums on Data Streams
    Rohit Ananthakrishna
    Johannes Gehrke
    S. Muthukrishnan
    Divesh Srivastava
    IEEE Trans. Knowl. Data Eng., vol. 15 (2003), pp. 569-572
    Reverse Nearest Neighbor Aggregates Over Data Streams
    S. Muthukrishnan
    Divesh Srivastava
    VLDB (2002), pp. 814-825
    Counting Twig Matches in a Tree
    Zhiyuan Chen
    H. V. Jagadish
    Nick Koudas
    S. Muthukrishnan
    Raymond T. Ng
    Divesh Srivastava
    ICDE (2001), pp. 595-604
    Selectivity Estimation for Boolean Queries
    Zhiyuan Chen
    Nick Koudas
    S. Muthukrishnan
    PODS (2000), pp. 216-225
    Influence Sets Based on Reverse Nearest Neighbor Queries
    S. Muthukrishnan
    SIGMOD Conference (2000), pp. 201-212