Wolf-Dietrich Weber
Research Areas
Authored Publications
Sort By
PowerNet for distributed Energy Resources
Anand Ramesh
Sangsun Kim
Jim Schmalzried
Jyoti Sastry
Michael Dikovsky {{+mdikovsky
Konstantin Bozhkov
Eduardo Pinheiro
Scott Collyer
Ankit Somani
Ram Rajagopal
Arun Majumdar
Junjie Qin
Gustavo Cezar
Juan Rivas
Abbas El Gamal
Dian Gruenich
Steven Chu
Sila Kiliccote
Conference: 2016 IEEE Power and Energy Society General Meeting (PESGM), IEEE Power and Energy Society, Boston, MA, USA (2016)
Preview abstract
We propose Powernet as an end-to-end open source technology for economically efficient, scalable and secure coordination of grid resources. It offers integrated hardware and software solutions that are judiciously divided between local embedded sensing, computing and control, which are networked with cloud-based high-level coordination for real-time optimal operations of not only centralized but also millions of distributed resources of various types. Our goal is to enable penetration of 50% or higher of intermittent renewables while minimizing the cost and address security and economical scalability challenges. In this paper we describe the basic concept behind Powernet and illustrate some components of the solution.
View details
Power Management of Online Data-Intensive Services
David Meisner
Christopher M. Sadler
Luiz André Barroso
Thomas F. Wenisch
Proceedings of the 38th ACM International Symposium on Computer Architecture (2011)
Preview abstract
Much of the success of the Internet services model can be attributed to the popularity of a class of workloads that we call Online Data-Intensive (OLDI) services. These workloads perform significant computing over massive data sets per user request but, unlike their offline counterparts (such as MapReduce computations), they require responsiveness in the sub-second time scale at high request rates. Large search products, online advertising, and machine translation are examples of workloads in this class. Although the load in OLDI services can vary widely during the day, their energy consumption sees little variance due to the lack of energy proportionality of the underlying machinery. The scale and latency sensitivity of OLDI workloads also make them a challenging target for power management techniques.
We investigate what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, we evaluate the applicability of active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency. Using Web search as a representative example of this workload class, we first characterize a production Web search workload at cluster-wide scale. We provide a fine-grain characterization and expose the opportunity for power savings using low-power modes of each primary server component. Second, we develop and validate a performance model to evaluate the impact of processor- and memory-based low-power modes on the search latency distribution and consider the benefit of current and foreseeable low-power modes. Our results highlight the challenges of power management for this class of workloads. In contrast to other server workloads, for which idle low-power modes have shown great promise, for OLDI workloads we find that energy-proportionality with acceptable query latency can only be achieved using coordinated, full-system active low-power modes.
View details
Preview abstract
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work
exists on DRAM in laboratory conditions, little has
been reported on real DRAM failures in large production clusters.
In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years.
The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.
The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by
external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?
We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8\% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account.
Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.
View details
Power Provisioning for a Warehouse-sized Computer
Luiz André Barroso
The 34th ACM International Symposium on Computer Architecture (2007)
Preview abstract
Large-scale Internet services require a computing infrastructure that can be appropriately described as a warehouse-sized computing system. The cost of building datacenter facilities capable of delivering a given power capacity to such a computer can rival the recurring energy consumption costs themselves. Therefore, there are strong economic incentives to operate facilities as close as possible to maximum capacity, so that the non-recurring facility costs can be best amortized. That is difficult to achieve in practice because of uncertainties in equipment power ratings and because power consumption tends to vary significantly with the actual computing activity. Effective power provisioning strategies are needed to determine how much computing equipment can be safely and efficiently hosted within a given power budget.
In this paper we present the aggregate power usage characteristics of large collections of servers (up to 15 thousand) for different classes of applications over a period of approximately six months. Those observations allow us to evaluate opportunities for maximizing the use of the deployed power capacity of datacenters, and assess the risks of over-subscribing it. We find that even in well-tuned applications there is a noticeable gap (7 - 16%) between achieved and theoretical aggregate peak power usage at the cluster level (thousands of servers). The gap grows to almost 40% in whole datacenters. This headroom can be used to deploy additional compute equipment within the same power budget with minimal risk of exceeding it. We use our modeling framework to estimate the potential of power management schemes to reduce peak power and energy usage. We find that the opportunities for power and energy savings are significant, but greater at the cluster-level (thousands of servers) than at the rack-level (tens). Finally we argue that systems need to be power efficient across the activity range, and not only at peak performance levels.
View details
Failure Trends in a Large Disk Drive Population
Eduardo Pinheiro
Luiz André Barroso
5th USENIX Conference on File and Storage Technologies (FAST 2007), pp. 17-29
Preview abstract
It is estimated that over 90% of all new information produced
in the world is being stored on magnetic media, most of it in hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.
We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity.
Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
View details