![Sotiris Apostolakis](https://storage.googleapis.com/gweb-research2023-media/pubtools/6409.png)
Sotiris Apostolakis
Sotiris is working on performance analysis and compiler optimizations of warehouse-scale systems. Before joining Google, he received his Ph.D. in Computer Science from Princeton University. His dissertation work focused on compilers, program analysis, and automatic parallelization. Earlier, he received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens in Greece.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
PROMPT: A Fast and Extensible Memory Profiling Framework
Ziyang Xu
Yebin Chon
Yian Su
Zujun Tan
Simone Campanoni
David I. August
Proceedings of the ACM on Programming Languages, 8, Issue OOPSLA(2024)
Preview abstract
Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
View details
Safer at Any Speed: Automatic Context-Aware Safety Enhancement for Rust
Natalie Popescu
Ziyang Xu
David I. August
Amit Levy
Proceedings of the ACM on Programming Languages, 5, Issue OOPSLA(2021)
Preview abstract
Type-safe languages improve application safety by eliminating whole classes of vulnerabilities–such as buffer overflows–by construction. However, this safety sometimes comes with a performance cost. As a result, many modern type-safe languages provide escape hatches that allow developers to manually bypass them. The relative value of performance to safety and the degree of performance obtained depends upon the application context, including user goals and the hardware upon which the application is to be executed. Since libraries may be used in many different contexts, library developers cannot make performance-safety trade-off decisions appropriate for all cases. Application developers can tune libraries themselves to increase safety or performance, but this requires extra effort and makes libraries less reusable. To address this problem, we present NADER, a Rust development tool that makes applications safer by automatically transforming unsafe code into equivalent safe code according to developer preferences and application context. In an end-to-end system evaluation in a given context, NADER automatically reintroduces numerous library bounds checks, in many cases making application code that uses popular Rust libraries safer with no corresponding loss in performance.
View details
MemoDyn: Exploiting Weakly Consistent Data Structures for Dynamic Parallel Memoization
Prakash Prabhu
Stephen R. Beard
Ayal Zaks
David I. August
27th IEEE International Conference on Parallel Architecture and Compilation Techniques (PACT)(2018)
Preview abstract
Several classes of algorithms for combinatorial search and optimization problems employ memoization data structures to speed up their serial convergence. However, accesses to these data structures impose dependences that obstruct program parallelization. Such programs often continue to function correctly even when queries into these data structures return a partial view of their contents. Weakening the consistency of these data structures can unleash new parallelism opportunities, potentially at the cost of additional computation. These opportunities must, therefore, be carefully exploited for overall speedup. This paper presents MEMODYN, a framework for parallelizing loops that access data structures with weakly consistent semantics. MEMODYN provides programming abstractions to express weak semantics, and consists of a parallelizing compiler and a runtime system that automatically and adaptively exploit the semantics for optimized parallel execution. Evaluation of MEMODYN shows that it achieves efficient parallelization, providing significant improvements over competing techniques in terms of both runtime performance and solution quality.
View details
Speculatively Exploiting Cross-Invocation Parallelism
Jialu Huang
Prakash Prabhu
Thomas B. Jablin
Soumyadeep Ghosh
Jae W. Lee
David I. August
25th IEEE International Conference on Parallel Architecture and Compilation Techniques (PACT)(2016)
Preview abstract
Automatic parallelization has shown promise in producing
scalable multi-threaded programs for multi-core architectures.
Most existing automatic techniques parallelize independent
loops and insert global synchronization between
loop invocations. For programs with many loop invocations,
frequent synchronization often becomes the performance
bottleneck. Some techniques exploit cross-invocation
parallelism to overcome this problem. Using static analysis,
they partition iterations among threads to avoid crossthread
dependences. However, this approach may fail if
dependence pattern information is not available at compile
time. To address this limitation, this work proposes
SpecCross–the first automatic parallelization technique to
exploit cross-invocation parallelism using speculation. With
speculation, iterations from different loop invocations can
execute concurrently, and the program synchronizes only on
misspeculation. This allows SpecCross to adapt to dependence
patterns that only manifest on particular inputs at
runtime. Evaluation on eight programs shows that SpecCross
achieves a geomean speedup of 3.43× over parallel
execution without cross-invocation parallelization.
View details
NOELLE Offers Empowering LLVM Extensions
Angelo Matni
Enrico Armenio Deiana
Yian Su
Lukas Gross
Souradip Ghosh
Ziyang Xu
Zujun Tan
Ishita Chaturvedi
Brian Homerding
Tommy McMichen
David I. August
Simone Campanoni
Proceedings of the 2022 International Symposium on Code Generation and Optimization (CGO)
Preview abstract
Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
View details
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework
Ziyang Xu
Zujun Tan
Greg Chan
Simone Campanoni
David I. August
Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)(2020)
Preview abstract
Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.
View details
Perspective: A Sensible Approach to Speculative Automatic Parallelization
Ziyang Xu
Greg Chan
Simone Campanoni
David I. August
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)(2020)
Preview abstract
The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.
View details
Architectural Support for Containment-based Security
Hansen Zhang
Soumyadeep Ghosh
Jordan Fix
Stephen R. Beard
Nayana P. Nagendra
Taewook Oh
David I. August
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)(2019)
Preview abstract
Software security techniques rely on correct execution by the hardware. Securing hardware components has been challenging due to their complexity and the proportionate attack surface they present during their design, manufacture, deployment, and operation. Recognizing that external communication represents one of the greatest threats to a system's security, this paper introduces the TrustGuard containment architecture. TrustGuard contains malicious and erroneous behavior using a relatively simple and pluggable gatekeeping hardware component called the Sentry. The Sentry bridges a physical gap between the untrusted system and its external interfaces. TrustGuard allows only communication that results from the correct execution of trusted software, thereby preventing the ill effects of actions by malicious hardware or software from leaving the system. The simplicity and pluggability of the Sentry, which is implemented in less than half the lines of code of a simple in-order processor, enables additional measures to secure this root of trust, including formal verification, supervised manufacture, and supply chain diversification with less than a 15% impact on performance.
View details
Hardware MultiThreaded Transactions
Jordan Fix
Nayana P. Nagendra
Hansen Zhang
Sophie Qiu
David I. August
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)(2018)
Preview abstract
Speculation with transactional memory systems helps programmers and compilers produce profitable thread-level parallel programs. Prior work shows that supporting transactions that can span multiple threads, rather than requiring transactions be contained within a single thread, enables new types of speculative parallelization techniques for both programmers and parallelizing compilers. Unfortunately, software support for multi-threaded transactions (MTXs) comes with significant additional inter-thread communication overhead for speculation validation. This overhead can make otherwise good parallelization unprofitable for programs with sizeable read and write sets. Some programs using these prior software MTXs overcame this problem through significant efforts by expert programmers to minimize these sets and optimize communication, capabilities which compiler technology has been unable to equivalently achieve. Instead, this paper makes speculative parallelization less laborious and more feasible through low-overhead speculation validation, presenting the first complete design, implementation, and evaluation of hardware MTXs. Even with maximal speculation validation of every load and store inside transactions of tens to hundreds of millions of instructions, profitable parallelization of complex programs can be achieved. Across 8 benchmarks, this system achieves a geomean speedup of 99% over sequential execution on a multicore machine with 4 cores.
View details