Robert Bradshaw
Research Areas
Authored Publications
Sort By
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Tyler Akidau
Craig Chambers
Reuven Lax
Daniel Mills
Frances Perry
Eric Schmidt
Proceedings of the VLDB Endowment, 8 (2015), pp. 1792-1803
Preview abstract
Unbounded, unordered, global-scale datasets are increasingly
common in day-to-day business (e.g. Web logs, mobile
usage statistics, and sensor networks). At the same time,
consumers of these datasets have evolved sophisticated requirements,
such as event-time ordering and windowing by
features of the data themselves, in addition to an insatiable
hunger for faster answers. Meanwhile, practicality dictates
that one can never fully optimize along all dimensions of correctness,
latency, and cost for these types of input. As a result,
data processing practitioners are left with the quandary
of how to reconcile the tensions between these seemingly
competing propositions, often resulting in disparate implementations
and systems.
We propose that a fundamental shift of approach is necessary
to deal with these evolved requirements in modern
data processing. We as a field must stop trying to groom unbounded
datasets into finite pools of information that eventually
become complete, and instead live and breathe under
the assumption that we will never know if or when we have
seen all of our data, only that new data will arrive, old data
may be retracted, and the only way to make this problem
tractable is via principled abstractions that allow the practitioner
the choice of appropriate tradeoffs along the axes of
interest: correctness, latency, and cost.
In this paper, we present one such approach, the Dataflow
Model, along with a detailed examination of the semantics
it enables, an overview of the core principles that guided its
design, and a validation of the model itself via the real-world
experiences that led to its development.
View details
Cython: The Best of Both Worlds
Stefan Behnel
Craig Citro
Lisandro Dalcin
Dag Sverre Seljebotn
Kurt Smith
Computing in Science and Engineering, 13.2 (2011), pp. 31-39
Preview abstract
Cython is an extension to the Python language that allows explicit type declarations and is compiled directly to C. This addresses Python's large overhead for numerical loops and the difficulty of efficiently making use of existing C and Fortran code, which Cython code can interact with natively. The Cython language combines the speed of C with the power and simplicity of the Python language.
View details
FlumeJava: Easy, Efficient Data-Parallel Pipelines
Craig Chambers
Ashish Raniwala
Frances Perry
Stephen Adams
Robert Henry
Nathan
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), ACM New York, NY 2010, 2 Penn Plaza, Suite 701 New York, NY 10121-0701 (2010), pp. 363-375
Preview abstract
MapReduce and similar systems significantly ease the task of writing
data-parallel code. However, many real-world computations require
a pipeline of MapReduces, and programming and managing
such pipelines can be difficult. We present FlumeJava, a Java library
that makes it easy to develop, test, and run efficient dataparallel
pipelines. At the core of the FlumeJava library are a couple
of classes that represent immutable parallel collections, each
supporting a modest number of operations for processing them in
parallel. Parallel collections and their operations present a simple,
high-level, uniform abstraction over different data representations
and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing
an execution plan dataflow graph. When the final results
of the parallel operations are eventually needed, FlumeJava first optimizes
the execution plan, and then executes the optimized operations
on appropriate underlying primitives (e.g., MapReduces). The
combination of high-level abstractions for parallel data and computation,
deferred evaluation and optimization, and efficient parallel
primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by
hundreds of pipeline developers within Google.
Categories and Subject Descriptors D.1.3 [Concurrent Programming]:
Parallel Programming
General Terms Algorithms, Languages, Performance
Keywords data-parallel programming, MapReduce, Java
View details