Natasha Noy
I am a research scientist at Google Research where I work on making structured data on the Web, in all its different forms, more accessible and useful. Our team has developed Google Dataset Search, which enables users to find datasets stored across the Web.
Prior to joining Google, I worked in the Protege group at Stanford Center for Biomedical Informatics Research. Our team developed an ontology-editing and management platform that is used by hundreds of thousands of users. While at Stanford, I worked in areas of semantic web, ontology development and alignment, and collaborative ontology engineering.
I studied Applied Math in Moscow State University, received my MS in Computer Science from Boston University, and PhD from Northeastern University. For a list of my pre-Google publications, please see my profile on Google Scholar.
Authored Publications
Sort By
Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search
Daniel Russell
Stella Dugall
Harvard Data Science Review (2024)
Preview abstract
With the rise of open data in the last two decades, more datasets are online and more people are using them for projects and research. But how do people find datasets? We present the first user study of Google Dataset Search, a dataset-discovery tool that uses a web crawl and open ecosystem to find datasets. Google Dataset Search contains a superset of the datasets in other dataset-discovery tools—a total of 45 million datasets from 13,000 sources. We found that the tool addresses a previously identified need: a search engine for datasets across the entire web, including datasets in other tools. However, the tool introduced new challenges due to its open approach: building a mental model of the tool, making sense of heterogeneous datasets, and learning how to search for datasets. We discuss recommendations for dataset-discovery tools and open research questions.
View details
Preview abstract
n/a (a Viewpoint article)
View details
Dataset or Not? A study on the veracity of semantic markup for dataset pages
Tarfah Alrashed
Omar Benjelloun
20th International Semantic Web Conference (ISWC 2021) (to appear)
Preview abstract
Semantic markup, such as Schema.org, allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google’s Dataset Search. Dataset Search relies on Schema.org to identify pages that describe datasets. While Schema.org was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61% of internet hosts that provide Schema.org/Dataset markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search’s Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with Schema.org/Dataset markup is a dataset page. Our classifier achieves 96.7% recall at the 95% precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users.
View details
Google Dataset Search by the Numbers
Omar Benjelloun
Shiyu Chen
International Semantic Web Conference (ISWC-2020), In-Use Track (to appear)
Preview abstract
Scientists, governments, and companies increasingly publish datasets on the
Web. Google's Dataset Search tool extracts dataset metadata---expressed in the
schema.org vocabulary---from webpages in order to make datasets discoverable.
Since the tool's inception, the number of datasets described in schema.org has
grown from about 500K to almost 30M, and has become a valuable snapshot of
what data on the Web looks like. This paper analyzes the corpus of dataset
metadata we collected. To the best of our knowledge, this corpus is the
largest and most diverse of its kind. We discuss such questions as where the
datasets originate from, what topics they cover, which form they take, and what
people searching for datasets are interested in. We describe our methods for
collecting and analyzing this data as well as our observations. We conclude
with identifying the gaps and possible future work to help make data more
accessible.
View details
Preview abstract
There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments,scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others’ work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem,where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the “long tail” of the Web. In this paper, we discuss both social and technical challenges in building this type of tool,and the lessons that we learned from this experience.
View details
Industry-scale Knowledge Graphs: Lessons and Challenges
Yuqing Gao
Anshu Jain
Anant Narayanan
Alan Patterson
Jamie Taylor
Communications of the ACM, 62 (8) (2019), pp. 36-43
Preview abstract
Knowledge graphs are critical to many enterprises today: they provide the structured data and factual knowledge that drives many products and makes them more intelligent and "magical." In this paper, we bring together the experience of building and using knowledge graphs in five diverse companies to compare similarities and differences and to discuss challenges that all knowledge-driven enterprises face today: The Bing knowledge graph at Microsoft and the Google knowledge graph support search and answering questions in search and during conversations. Facebook has the world's largest social graph, and also starts to include information that Facebook users care about, such as information about music, movies, celebrities, and places. The eBay knowledge graph describes the enormous variety of products and their connections. Finally, the Knowledge Graph Framework for IBM’s Watson Discovery Offerings provides an offering that allows others to build their own knowledge graph against a pre-built components. We discuss the diverse requirements of these knowledge graphs and many common challenges in building knowledge graphs at this scale. This article summarizes and expands on the panel discussion that the authors conducted at the International Semantic Web Conference in Asilomar, California in October 2018.
View details
Preview abstract
[NOTE TO REVIEWERS: As a viewpoint short article, it doesn't actually have an abstract]
Understanding semantics of data on the Web and thus enabling meaningful processing of it has been at the core of Semantic Web research for over the past decade and a half. The early promise of enabling software agents on the Web to talk to one another in a meaningful way spawned research in a number of areas and has been adopted by governments, industry, and academia. Yet, the nature of the Semantic Web research today is changing. Semantic Web research today distinguishes itself by embracing the messiness of real-world data and its oftentime contradicting semantics. In this paper, we discuss the new research challenges and directions for Semantic Web research.
View details
Goods: Organizing Google's Datasets
Alon Halevy
Christopher Olston
Neoklis Polyzotis
Sudip Roy
Steven Euijong Whang
SIGMOD (2016)
Preview abstract
Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present Goods, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. Goods extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company,
to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order
to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we
learned are applicable to building large-scale enterprise-level data management systems in general.
View details
Managing Google’s data lake: an overview of the Goods system
Chris Olston
Neoklis Polyzotis
Steven Whang
Sudip Roy
IEEE Engineering Bulletin, 39 (3) (2016), pp. 5
Preview abstract
For most large enterprises today, data constitutes their core assets, along with code and infrastructure.
Indeed, for most enterprises, the amount of data that they produce internally has exploded. At the same
time, in many cases, engineers and data scientists do not use centralized data-management systems
and and up creating what became known as a data lake—a collection of datasets that often are not
well organized or not organized at all and where one needs to “fish” for the useful datasets. In this
paper, we describe our experience building and deploying Goods, a system to manage Google’s internal
data lake. Goods crawls Google’s internal infrastructure and builds a catalog of discovered datasets,
including structured files, databases, spreadsheets, or even services that provide access to the data.
Goods extracts metadata about datasets in a post-hoc way: engineers continue to generate and organize
datasets in the same way as they have before, and Goods provides values as without disrupting teams’
practices. The technical challenges that we had to address resulted both from the scale and heterogeneity
of Google’s data lake and our decision to extract metadata in a post-hoc manner. We believe that many
of the lessons that we learned are applicable to building large-scale enterprise-level data-management
systems in general.
View details