Keith Peters

I'm a roll up your sleeves and get dirty engineer who builds and implements. Sometimes what I build and implement is interesting for others.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Business intelligence and web log analysis workloads often use queries with top-k clauses to produce the most relevant results. Values of k range from small to rather large and sometimes the requested output exceeds the capacity of the available main memory. When the requested output fits in the available memory existing top-k algorithms are efficient, as they can eliminate almost all but the top k results before sorting them. When the requested output exceeds the main memory capacity, existing algorithms externally sort the entire input, which can be very expensive. Furthermore, the drastic difference in execution cost when the memory capacity is exceeded results in an unpleasant user experience. Every day, tens of thousands of production top-k queries executed on F1 Query resort to an external sort of the input. To address these challenges, we introduce a new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory. To achieve this, at execution time our algorithm creates a concise model of the input using histograms. The proposed algorithm is implemented as part of F1 Query and is used in production, where significantly accelerates top-k queries with outputs larger than the available memory. We evaluate our algorithm against existing top-k algorithms and show that it reduces I/O traffic and can be up to 11× faster. View details
    F1 Query: Declarative Querying at Scale
    Bart Samwel
    Ben Handy
    Jason Govig
    Chanjun Yang
    Daniel Tenedorio
    Felix Weigel
    David G Wilhite
    Jiacheng Yang
    Jun Xu
    Jiexing Li
    Zhan Yuan
    Qiang Zeng
    Ian Rae
    Anurag Biyani
    Andrew Harn
    Yang Xia
    Andrey Gubichev
    Amr El-Helw
    Orri Erling
    Allen Yan
    Mohan Yang
    Yiqun Wei
    Thanh Do
    Colin Zheng
    Somayeh Sardashti
    Ahmed Aly
    Divy Agrawal
    Shivakumar Venkataraman
    PVLDB (2018), pp. 1835-1848
    Preview abstract F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems (e.g., BigTable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines transforming data from multiple data sources into formats more suitable for analysis and reporting. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database that Google uses to manage its advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems. View details