Jump to Content

Distributed Data Processing for Large-Scale Simulations on Cloud

Lily Hu
TJ Lu
Yi-fan Chen
2021 IEEE INTERNATIONAL SYMPOSIUM ON ELECTROMAGNETIC COMPATIBILITY, SIGNAL & POWER INTEGRITY (2021) (to appear)

Abstract

In this work, we proposed a distributed data pipeline for large-scale simulations by using libraries and frameworks available on Cloud services. The data pipeline is designed with careful considerations for the characteristics of the simulation data. The implementation of the data pipeline is with Apache Beam and Zarr. Beam is a unified, open-source programming model for building both batch- and streaming-data parallel-processing pipelines. By using Beam, one can simply focus on the logical composition of the data processing task and bypass the low-level details of distributed computing. The orchestration of distributed processing is fully managed by the runner, in this work, Dataflow on Google Cloud. Beam separates the programming layer from the runtime layer such that the proposed pipeline can be executed across various runners. The storage format of the output tensor of the data pipeline is Zarr. Zarr allows concurrent reading and writing, storage on a file system, and data compression before the storage. The performance of the data pipeline is analyzed with an example, of which the simulation data is obtained with an in-house developed computational fluid dynamic solver running in parallel on Tensor Processing Unit (TPU) clusters. The performance analysis demonstrates good storage and computational efficiency of the proposed data pipeline.