WarpFlow: Exploring Petabytes of Space-Time Data
Abstract
WarpFlow is a fast, interactive querying and processing sys-
tem for big data, with a special treatment for petabyte-scale
spatio-temporal datasets. It processes and tranforms rich,
hierarchical data end-to-end (e.g., Protocol Buffers – a
common data format at Google). WarpFlow speeds up three
key metrics for data scientists – time-to-first-result, time-
to-full-scale-result, and time-to-trained-model for machine
learning (e.g., using TensorFlow). In this paper, we
describe the architecture and implementation of WarpFlow.
We present a custom data storage format optimized for fast,
index-based selection of hierarchical data. We also describe
a functional, extensible, pipelined query language (with op-
erators such as map, filter, aggregate, etc.) that greatly
simplifies writing queries on big datasets with hierarchical
data.
tem for big data, with a special treatment for petabyte-scale
spatio-temporal datasets. It processes and tranforms rich,
hierarchical data end-to-end (e.g., Protocol Buffers – a
common data format at Google). WarpFlow speeds up three
key metrics for data scientists – time-to-first-result, time-
to-full-scale-result, and time-to-trained-model for machine
learning (e.g., using TensorFlow). In this paper, we
describe the architecture and implementation of WarpFlow.
We present a custom data storage format optimized for fast,
index-based selection of hierarchical data. We also describe
a functional, extensible, pipelined query language (with op-
erators such as map, filter, aggregate, etc.) that greatly
simplifies writing queries on big datasets with hierarchical
data.