Experiences Scaling Use of Google's Sawzall
Abstract
Sawzall is a procedural language developed at Google for parallel analysis of very large data sets. Given a log sharded into many separate files, its companion tool named saw runs Sawzall interpreters to perform an analysis.
Hundreds of Googlers have written thousands of saw+Sawzall programs, which form a significant minority of Google's daily data processing. Short programs grew to become longer programs, which were not easily shared nor tested. In other words, scaling naively written Sawzall led to unmaintainable programs.
The simple idea of writing programs functionally, not iteratively, yielded shareable, testable programs. The functions reflect fundamental map reduction concepts: mapping, reducing, and iterating. Each can be easily tested.
This case study demonstrates that developers of parallel processing systems should also simultaneously develop ways for users to decompose code into sharable pieces that reflect fundamental underlying concepts. As importantly, they must develop ways for users to easily write tests of their code.
Hundreds of Googlers have written thousands of saw+Sawzall programs, which form a significant minority of Google's daily data processing. Short programs grew to become longer programs, which were not easily shared nor tested. In other words, scaling naively written Sawzall led to unmaintainable programs.
The simple idea of writing programs functionally, not iteratively, yielded shareable, testable programs. The functions reflect fundamental map reduction concepts: mapping, reducing, and iterating. Each can be easily tested.
This case study demonstrates that developers of parallel processing systems should also simultaneously develop ways for users to decompose code into sharable pieces that reflect fundamental underlying concepts. As importantly, they must develop ways for users to easily write tests of their code.