Predictive Crawling for Commercial Web Content

Shuguang Han; Bernhard Brodowsky; Przemek Gajda; Sergey Novikov; Mike Bendersky; Marc Najork; Robin Dua; Alexandrin Popescul

Predictive Crawling for Commercial Web Content

Shuguang Han

Bernhard Brodowsky

Przemek Gajda

Sergey Novikov

Mike Bendersky

Marc Najork

Robin Dua

Alexandrin Popescul

Proceedings of the 2019 World Wide Web Conference, pp. 627-637

Download Google Scholar

Abstract

Web crawlers spend significant resources to maintain freshness of their crawled data. This paper describes the optimization of resources to ensure that product prices shown in ads in a context of a shopping sponsored search service are synchronized with current merchant prices. We are able to use the predictability of price changes to build a machine learned system leading to considerable resource savings for both the merchants and the crawler. We describe our solution to technical challenges due to partial observability of price history, feedback loops arising from applying machine learned models, and offers in cold start state. Empirical evaluation over large-scale product crawl data demonstrates the effectiveness of our model and confirms its robustness towards unseen data. We argue that our approach can be applicable in more general data pull settings.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Predictive Crawling for Commercial Web Content

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs