Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 146 publications
    ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
    Shiwei Liu
    Guanchen Tao
    Yifei Zou
    Derek Chow
    Zichen Fan
    Kauna Lei
    Bangfei Pan
    Dennis Sylvester
    Mehdi Saligane
    Arxiv (2024)
    Preview abstract The self-attention mechanism sets transformer-based large language model (LLM) apart from the convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon is challenging due to the extensively used Softmax in self-attention. Apart from the non-linearity, the low arithmetic intensity greatly reduces the processing parallelism, which becomes the bottleneck especially when dealing with a longer context. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design as an efficient Softmax alternative. ConSmax employs differentiable normalization parameters to remove the maximum searching and denominator summation in Softmax. It allows for massive parallelization while performing the critical tasks of Softmax. In addition, a scalable ConSmax hardware utilizing a bitwidth-split look-up table (LUT) can produce lossless non-linear operation and support mix-precision computing. It further facilitates efficient LLM inference. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2 mW and area of 0.0008 mm^2 at 1250-MHz working frequency and 16-nm CMOS technology. Compared to state-of-the-art Softmax hardware, ConSmax results in 3.35x power and 2.75x area savings with a comparable accuracy on a GPT-2 model and the WikiText103 dataset. View details
    CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems
    Ani Udipi
    JunSun Choi
    Joonho Whangbo
    Jerry Zhao
    Edwin Lim
    Vrishab Madduri
    Yakun Sophia Shao
    Borivoje Nikolic
    Krste Asanovic
    Proceedings of the 50th Annual International Symposium on Computer Architecture, Association for Computing Machinery, New York, NY, USA (2023)
    Preview abstract General-purpose lossless data compression and decompression ("(de)compression") are used widely in hyperscale systems and are key "datacenter taxes". However, designing optimal hardware compression and decompression processing units ("CDPUs") is challenging due to the variety of algorithms deployed, input data characteristics, and evolving costs of CPU cycles, network bandwidth, and memory/storage capacities. To navigate this vast design space, we present the first large-scale data-driven analysis of (de)compression usage at a major cloud provider by profiling Google's datacenter fleet. We find that (de)compression consumes 2.9% of fleet CPU cycles and 10-50% of cycles in key services. Demand is also artificially limited; 95% of bytes compressed in the fleet use less capable algorithms to reduce compute, motivating a CDPU that changes cost vs. size tradeoffs. Prior work has improved the microarchitectural state-of-the-art for CDPUs supporting various algorithms in fixed contexts. However, we find that higher-level design parameters like CDPU placement, hash table sizing, history window sizes, and more have as significant of an impact on the viability of CDPU integration, but are not well-studied. Thus, we present the first end-to-end design/evaluation framework for CDPUs, including: 1. An open-source RTL-based CDPU generator that supports many run-time and compile-time parameters. 2. Integration into an open-source RISC-V SoC for rapid performance and silicon area evaluation across CDPU placements and parameters. 3. An open-source (de)compression benchmark, HyperCompressBench, that is representative of (de)compression usage in Google's fleet. Using our framework, we perform an extensive design space exploration running HyperCompressBench. Our exploration spans a 46× range in CDPU speedup, 3× range in silicon area (for a single pipeline), and evaluates a variety of CDPU integration techniques to optimize CDPU designs for hyperscale contexts. Our final hyperscale-optimized CDPU instances are up to 10× to 16× faster than a single Xeon core, while consuming a small fraction (as little as 2.4% to 4.7%) of the area. View details
    EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
    Nayana Prasad Nagendra
    Bhargav Reddy Godala
    Ishita Chaturvedi
    Atmn Patel
    Jared Stark
    Gilles A. Pokam
    Simone Campanoni
    David I. August
    Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA) (2023)
    Preview abstract For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. View details
    Preview abstract While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks. View details
    PTStore: Lightweight Architectural Support for Page Table Isolation
    Wende Tan
    Yangyu Chen
    Yuan Li
    Ying Liu
    Jianping Wu
    Chao Zhang
    2023 60th ACM/IEEE Design Automation Conference (DAC), IEEE, pp. 1-6
    Preview abstract Page tables are critical data structures in kernels, serving as the trust base of most mitigation solutions. Their integrity is thus crucial but is often taken for granted. Existing page table protection solutions usually provide insufficient security guarantees, require heavy hardware, or introduce high overheads. In this paper, we present a novel lightweight hardware-software co-design solution, PTStore, consisting of a secure region storing page tables and tokens verifying page table pointers. Evaluation results on FPGA-based prototypes show that PTStore only introduces <0.92% hardware overheads and <0.86% performance overheads, but provides strong security guarantees, showing that PTStore is efficient and effective. View details
    Profiling Hyperscale Big Data Processing
    Aasheesh Kolli
    Abraham Gonzalez
    Samira Khan
    Sihang Liu
    Krste Asanovic
    ISCA (2023)
    Preview abstract Computing demand continues to grow exponentially, largely driven by "big data" processing on hyperscale data stores. At the same time, the slowdown in Moore's law is leading the industry to embrace custom computing in large-scale systems. Taken together, these trends motivate the need to characterize live production traffic on these large data processing platforms and understand the opportunity of acceleration at scale. This paper addresses this key need. We characterize three important production distributed database and data analytics platforms at Google to identify key hardware acceleration opportunities and perform a comprehensive limits study to understand the trade-offs among various hardware acceleration strategies. We observe that hyperscale data processing platforms spend significant time on distributed storage and other remote work across distributed workers. Therefore, optimizing storage and remote work in addition to compute acceleration is critical for these platforms. We present a detailed breakdown of the compute-intensive functions in these platforms and identify dominant key data operations related to datacenter and systems taxes. We observe that no single accelerator can provide a significant benefit but collectively, a sea of accelerators, can accelerate many of these smaller platform-specific functions. We demonstrate the potential gains of the sea of accelerators proposal in a limits study and analytical model. We perform a comprehensive study to understand the trade-offs between accelerator location (on-chip/off-chip) and invocation model (synchronous/asynchronous). We propose and evaluate a chained accelerator execution model where identified compute-intensive functions are accelerated and pipelined to avoid invocation from the core, achieving a 3x improvement over the baseline system while nearly matching identical performance to an ideal fully asynchronous execution model. View details
    Preview abstract We introduce Downfall attacks, new transient execution attacks that undermine the security of computers running everywhere across the internet. We exploit the gather instruction on high-performance x86 CPUs to leak data across boundaries of user-kernel, processes, virtual machines, and trusted execution environments. We also develop practical and end-to-end attacks to steal cryptographic keys, program’s runtime data, and even data at rest (arbitrary data). Our findings, exploitation techniques, and demonstrated attacks defeat all previous defenses, calling for critical hardware fixes and security updates for widely-used client and server computers. View details
    Automatic Domain-Specific SoC Design for Autonomous Unmanned Aerial Vehicles
    David Brooks
    Gu-Yeon Wei
    Kshitij Bhardwaj
    Paul Whatmough
    Srivatsan Krishnan
    Vijay Janapa Reddi
    Zishen Wan
    55th IEEE/ACM International Symposium on Microarchitecture®, IEEE (2022) (to appear)
    Preview abstract Building domain-specific accelerators is becoming increasingly paramount to meet the high-performance requirements under stringent power and real-time constraints. However, emerging application domains like autonomous vehicles are complex systems, where the constraints extend beyond just the computing stack. Manually selecting and navigating the design space to design custom and efficient domain-specific SoCs (DSSoC) is tedious and expensive. As such, there is a need for automated DSSoC design methodologies. In this paper, we use agile and autonomous UAVs as a case study for understanding how to automate the design of domain-specific SoCs for autonomous vehicles. Architecting a UAV DSSoC requires considering parameters such as sensor rate, compute throughput, and other physical characteristics (e.g., payload weight, thrust-to-weight ratio) that affect overall performance. Iterating over the many component choices results in a combinatorial explosion of the number of possible combinations: from 10s of thousands to billions, depending on implementation details. To navigate the DSSoC design space efficiently, we introduce \emph{AutoPilot}, a systematic methodology for automatically designing DSSoC for autonomous UAVs. AutoPilot uses machine learning to navigate the large DSSoC design space and automatically select a combination of autonomy algorithm and hardware accelerator while considering the cross-product effect across different UAV components. \autop consistently outperforms general-purpose hardware selections like Xavier NX and Jetson TX2, as well as dedicated hardware accelerators built for autonomous UAVs. DSSoC designs generated by \autop increase the number of missions on average by up to 2.25x, 1.62x and 1.43x for nano, micro, and mini-UAVs, respectively, over baselines. We also discuss how \autop can be extended to other related autonomous vehicles using the same set of principles. View details
    Vectorized and performance-portable Quicksort
    Joachim Giesen
    Mark Blacher
    Peter Sanders
    Software: Practice and Experience (2022) (to appear)
    Preview abstract Recent works showed that implementations of Quicksort using vector CPU instructions can outperform the non-vectorized algorithms in widespread use. However, these implementations are typically single-threaded, implemented for a particular instruction set, and restricted to a small set of key types. We lift these three restrictions: our proposed vqsort algorithm integrates into the state-of-the-art parallel sorter ips4o, speeding it up by a factor of 1.5 to 1.8. The same implementation works on seven instruction sets (including SVE and RISC-V V) across four platforms. It also supports floating-point and 16-128 bit integer keys. To the best of our knowledge, this is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries. This paper focuses on the practical engineering aspects enabling the speed and portability, which we have not yet seen demonstrated for a Quicksort implementation. Furthermore, we introduce compact and transpose-free sorting networks for in-register sorting of small arrays, and a vector-friendly pivot sampling strategy that is robust against adversarial input. View details
    Data-Driven Offline Optimization for Architecting Hardware Accelerators
    Aviral Kumar
    Sergey Levine
    International Conference on Learning Representations 2022 (to appear)
    Preview abstract With the goal of achieving higher efficiency, the semiconductor industry has gradually reformed towards application-specific hardware accelerators. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a ``simulation-driven'' approach must be re-run from scratch every time the target applications or constraints change. An alternative paradigm is to use a ``data-driven'', offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulation. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when target applications change. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points and optimizes the design against this estimate without any additional simulator queries during optimization. View details
    Detection and Prevention of Silent Data Corruption in an Exabyte-scale Database System
    The 18th IEEE Workshop on Silicon Errors in Logic – System Effects, IEEE (2022)
    Preview abstract Google’s Spanner database serves multiple exabytes of data at well over a billion queries per second, distributed over a significant fraction of Google’s fleet. Silent data corruption events due to hardware error are detected/prevented by Spanner several times per week. For every detected error there are some number of undetected errors that in rare (but not black swan) events cause corruption either transiently for reads or durably for writes, potentially violating the most fundamental contract that a database system makes with its users: to store and retrieve data with absolute reliability and availability. We describe the work we have done to detect and prevent silent data corruptions and (equally importantly) to remove faulty machines from the fleet, both manually and automatically. We present a simplified analytic model of corruption that provides some insights into the most effective ways to prevent end-user corruption events. We have made qualitative gains in detection and prevention of SDC events, but quantitative analysis remains difficult. We discuss various potential trajectories in hardware (un)reliability and how they will affect our ability to build reliable database systems on commodity hardware. View details
    Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs
    Anton Spiridonov
    Hao Xu
    Marie Charisse White
    Ping Zhou
    Suyog Gupta
    Yun Long
    Zhuo Wang
    IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)
    Preview abstract On-device ML accelerators are becoming a standard in modern mobile system-on-chips (SoC). Neural architecture search (NAS) comes to the rescue for efficiently utilizing the high compute throughput offered by these accelerators. However, existing NAS frameworks have several practical limitations in scaling to multiple tasks and different target platforms. In this work, we provide a two-pronged approach to this challenge: (i) a NAS-enabling infrastructure that decouples model cost evaluation, search space design, and the NAS algorithm to rapidly target various on-device ML tasks, and (ii) search spaces crafted from group convolution based inverted bottleneck (IBN) variants that provide flexible quality/performance trade-offs on ML accelerators, complementing the existing full and depthwise convolution based IBNs. Using this approach we target a state-of-the-art mobile platform, Google Tensor SoC, and demonstrate neural architectures that improve the quality-performance pareto frontier for various computer vision (classification, detection, segmentation) as well as natural language processing tasks. View details
    Preview abstract Both the transition to green energy to reduce CO2 emissions to net-zero by 2050 and the increase in energy prices suggest that we must find ways to reduce electricity consumption in all sectors, and in particular in the ICT sector. A large majority of companies have fleets of computers for their employees, of variable size, but growing. While in the operation of these fleets one of the biggest costs is energy consumption, many of the computers spend long periods of time turned on, but idling, thus wasting large amounts of electricity. Dynamic Power Management (DPM) is a set of techniques and methods that are applied at different levels to reduce the consumption and heat dissipation of a computer. It includes techniques as varied as microprocessor dynamic frequency scalling (DFS) or turning off devices that are not in use. The different DPM tech- niques are directed by a series of energy management policies, which establish the operating guidelines of the different components. These policies are generated using different methods, adapted to the component being managed and the objectives to be achieved. This thesis presents a DPM technique applied to a complete computer fleet. The goal is to reduce fleet consumption by proactively shutting down computers, while maintaining high levels of user satisfaction. The generation of the policies that direct the energy management system are produced based on data collected from the fleet under study and management. Utilisation models are generated from that data and allow representing and predicting the behavior of each user, thus being able to generate fully customized policies for each user. One of the main contributions of this thesis is the use of satisfaction as a central metric in order to solve the optimisation problem that is the generation of energy policies. New metrics are defined that allow user satisfaction to be measured when the fleet is being optimized by the energy management system and, most importantly, to generate energy policies that guarantee a certain level of satisfaction for each user. In order to verify and apply the proposed energy management method, a tool has been implemented that allows obtaining policies for a given fleet, studying variations and, using a simulation method, generating synthetic fleet records. Finally, a validation of the presented work has been carried out, showing the results that it is possible to save up to 90 % of the energy otherwise wasted. View details
    Preview abstract State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models. View details
    Multi-input Serial Adders for FPGA-like Computational Fabric
    Herman Schmit
    Matthew Denton
    INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (2022) (to appear)
    Preview abstract In this paper, we present a new functional unit to replace the K-LUT in an FPGA-like computational fabric designed specifically for use to accelerate instance-specific sparse integer matrix multiplication. We use a suite of matrices, the VPR place-and-route tool\cite{vpr}, and modern architecture representations of the interconnect \cite{global-new-local} to examine this architectural idea. The new cell, called the K-ADD, increases density by 2.5x to 4x, and increases net performance by 8\% to 30\% in a 16nm implementation. This benefit magnifies the two-orders-of-magnitude advantage of using instance specific matrix multipliers demonstrated in \cite{denton2021direct}. We investigate the cluster size, N, in an experiment similar to \cite{global-new-local}, across multiple technology nodes. In that investigation, the netlists that use the K-ADD have a similar relationship to cluster size as conventional netlists. When matrices are mapped to LUTs, however, clustering provides very little delay benefit. View details