Arif Merchant
Arif Merchant is a Research Scientist with the Storage Analytics group at Google, where he studies interactions between components of the storage stack. Prior to this, he was with HP Labs, where he worked on storage QoS, distributed storage systems, and stochastic models of storage. He holds the B.Tech. degree from IIT Bombay and the Ph.D. in Computer Science from Stanford University. He is an ACM Distinguished Scientist.
Authored Publications
Sort By
Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)
Shashwat Silas
Dave Clausen
File and Storage Technologies (FAST), USENIX (2023)
Preview abstract
Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.
We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.
View details
CacheSack: Admission Optimization for Datacenter Flash Caches
Homer Wolfmeister
Mustafa Uysal
Seth Bradley Pollen
Tzu-Wei Yang
2022 USENIX Annual Technical Conference (2022) (to appear)
Preview abstract
This paper describes the algorithm, implementation, and deployment experience of CacheSack, the admission algorithm for Google datacenter flash caches. CacheSack minimizes the dominant costs of Google’s datacenter flash caches: disk IO and flash footprint. CacheSack partitions cache traffic into disjoint categories, analyzes the observed cache benefit of each subset, and formulates a knapsack problem to assign the optimal admission policy to each subset. Prior to this work, Google datacenter flash cache admission policies were optimized manually, with most caches using the Lazy Adaptive Replacement Cache (LARC) algorithm. Production experiments showed that CacheSack significantly outperforms the prior static admission policies for a 6.5% improvement of the total operational cost, as well as significant improvements in disk reads and flash wearout.
View details
Tiger: disk-adaptive redundancy without placement restrictions
Francisco Maturana
Sanjith Athlur
Rashmi KV
Gregory R. Ganger
Tiger: disk-adaptive redundancy without placement restrictions (2022)
Preview abstract
Large-scale cluster storage systems use redundancy (via erasure coding) to ensure data durability. Disk-adaptive redundancy—dynamically tailoring the redundancy scheme to observed disk failure rates—promises significant space and cost savings. Existing disk-adaptive redundancy systems, however, pose undesirable constraints on data placement, partitioning disks into subclusters with homogeneous failure rates and forcing each erasure-coded stripe to be entirely placed on the disks within one subcluster. This design increases risk, by reducing intra-stripe diversity and being more susceptible to unanticipated changes in a make/model’s failure rate, and only works for very large storage clusters fully committed to
disk-adaptive redundancy.
Tiger is a new disk-adaptive redundancy system that efficiently avoids adoption-blocking placement constraints, while also providing higher space-savings and lower risk relative to prior designs. To do so, Tiger introduces the eclectic stripe, in which disks with different failure rates can be used to store a stripe that has redundancy tailored to the set of failure rates of those disks. With eclectic stripes, pre-existing placement policies can be used while still enjoying the space-savings and robustness benefits of disk-adaptive redundancy. This paper introduces eclectic striping and Tiger’s design, including a new mean time-to-data-loss (MTTDL) approximation technique and new approaches for ensuring safe per-stripe settings given that failure rates of different devices change over time. Evaluation with logs from real-world clusters show that Tiger provides better space-savings, less bursty IO for changing redundancy schemes, and better robustness (due to increased risk-diversity) than prior disk-adaptive redundancy designs.
View details
Preview abstract
Solid-state drives (SSDs) based on NAND flash are making deep inroads into data centers as well as the consumer market. In 2016, manufacturers shipped more than 130 million units totaling around 50 Exabytes of storage capacity. As the amount of data stored on solid state drives keeps increasing, it is important to understand the reliability characteristics of these devices. For a long time, our knowledge about flash reliability was derived from controlled experiments in lab environments under synthetic workloads, often using methods for accelerated testing. However, within the last two years, three large-scale field studies have been published that report on the failure behavior of flash devices in production environments subjected to real workloads and operating conditions. The goal of this paper is to provide an overview of what we have learned about flash reliability in production, and where appropriate contrasting it with prior studies performing controlled experiments.
View details
Slicer: Auto-Sharding for Datacenter Applications
Atul Adya
Jon Howell
Jeremy Elson
Colin Meek
Vishesh Khemani
Stefan Fulger
Pan Gu
Lakshminath Bhuvanagiri
Jason Hunter
Roberto Peon
Alexander Shraer
Kfir Lev-Ari
OSDI 2016 (2016)
Preview abstract
Sharding is a fundamental building block of large-scale applications, but most have their own
custom, ad-hoc implementations. Our goal is to make sharding as easily reusable as a filesystem or
lock manager. Slicer is \Google's general purpose sharding service. It monitors signals such as
load hotspots and server health and dynamically shards work over a set of servers. Its goals are to
maintain high availability and reduce load imbalance while minimizing churn from moved work.
In this paper, we describe Slicer's design and implementation. Slicer has the consistency and global
optimization of a centralized sharder while approaching the high availability, scalability, and low
latency of systems that make local decisions.
It achieves this by separating concerns:
a reliable data plane forwards requests, and
a smart control plane makes load-balancing decisions off the critical path.
Slicer's small but powerful API has proven useful and easy to adopt
in dozens of \Google applications.
It is used to allocate resources for web service front-ends,
coalesce writes to increase storage bandwidth, and
increase the efficiency of a web cache.
It currently handles 2-6M~req/s of production traffic.
Production workloads using Slicer
exhibit a most-loaded task 30\%--180\% of the mean load,
even for highly skewed and time-varying loads.
View details
Preview abstract
The use of solid state drives based on NAND flash technology is continuously growing. As more data either lives on flash or is being cached on flash, data durability and availability critically depend on flash reliability.
This paper provides a detailed field study of flash reliability based on data collected over 6 years in a large-scale data center production environment. The data spans many millions of drive days, ten different drive models, different flash technologies (MLC and SLC) and feature sizes (ranging from 24nm to 50nm). The paper analyses this data in order to derive a better understanding of flash reliability in the field, including the most prevalent types of errors and hardware failures and their frequency, and how different factors impact flash reliability.
View details
Poster Paper: Automatic Reconfiguration of Distributed Storage
Artyom Sharov
Alexander Shraer
Murray Stokely
The 12th International Conference on Autonomic Computing, IEEE (2015), pp. 133-134
Preview abstract
The configuration of a distributed storage system with multiple data replicas typically includes the set of servers and their roles in the replication protocol. The configuration can usually be changed manually, but in most cases, system administrators have to determine a good configuration by trial and error. We describe a new workload-driven optimization framework that dynamically determines the optimal configuration at run time. Applying the framework to a large-scale distributed storage system used internally in Google resulted in halving the operation
latency in 17% of the tested databases, and reducing it by more than 90% in some cases.
View details
Take me to your leader! Online Optimization of Distributed Storage Configurations
Artyom Sharov
Alexander Shraer
Murray Stokely
Proceedings of the 41st International Conference on Very Large Data Bases, VLDB Endowment (2015), pp. 1490-1501
Preview abstract
The configuration of a distributed storage system typically
includes, among other parameters, the set of servers and
their roles in the replication protocol. Although mechanisms for changing the configuration at runtime exist, it is usually left to system administrators to manually determine the “best” configuration and periodically reconfigure the system, often by trial and error. This paper describes a new workload-driven optimization framework that dynamically determines the optimal configuration at runtime. We focus on optimizing leader and quorum based replication schemes and divide the framework into three optimization tiers, dynamically optimizing different configuration aspects: 1) leader placement, 2) roles of different servers in the replication protocol, and 3) replica locations. We showcase our optimization framework by applying it to a large-scale distributed storage system used internally in Google and demonstrate that most client applications significantly benefit from using our framework, reducing average operation latency by up to 94%.
View details
Janus: Optimal Flash Provisioning for Cloud Storage Workloads
Christoph Albrecht
Murray Stokely
Muhammad Waliji
Francois Labelle
Xudong Shi
Eric Schrock
Proceedings of the USENIX Annual Technical Conference, USENIX, Advanced Computing System Association, 2560 Ninth Street, Suite 215, Berkeley, CA 94710, USA (2013), pp. 91-102
Preview abstract
Janus is a system for partitioning the flash storage tier between workloads in a cloud-scale distributed file system with two tiers, flash storage and disk. The file system stores newly created files in the flash tier and moves them to the disk tier using either a First-In-First-Out (FIFO) policy or a Least-Recently-Used (LRU) policy, subject to per-workload allocations. Janus constructs compact metrics of the cacheability of the different workloads, using sampled distributed traces because of the large scale of the system. From these metrics, we formulate and solve an optimization problem to determine the flash allocation to workloads that maximizes the total reads sent to the flash tier, subject to operator-set priorities and bounds on flash write rates. Using measurements from production workloads in multiple data centers using these recommendations, as well as traces of other production workloads, we show that the resulting allocation improves the flash hit rate by 47–76% compared to a unified tier shared by all workloads. Based on these results and an analysis of several thousand production workloads, we conclude that flash storage is a cost-effective complement to disks in data centers.
View details
Uncertainty in Aggregate Estimates from Sampled Distributed Traces
Murray Stokely
2012 Workshop on Managing Systems Automatically and Dynamically, USENIX
Preview abstract
Tracing mechanisms in distributed systems give important insight into system properties and are usually sampled to control overhead. At Google, Dapper [8] is the always-on system for distributed tracing and performance analysis, and it samples fractions of all RPC traffic. Due to difficult implementation, excessive data volume, or a lack of perfect foresight, there are times when system quantities of interest have not been measured directly, and Dapper samples can be aggregated to estimate those quantities in the short or long term. Here we find unbiased variance estimates of linear statistics over RPCs, taking into account all layers of sampling that occur in Dapper, and allowing us to quantify the sampling uncertainty in the aggregate estimates. We apply this methodology to the problem of assigning jobs and data to Google datacenters, using estimates of the resulting cross-datacenter traffic as an optimization criterion, and also to the detection of change points in access patterns to certain data partitions.
View details