Keun Soo YIM

Keun Soo YIM

K. S. Yim is an experimental computer scientist working as a software engineer (tech lead manager) at Google. His current research interests include reliability, quality, security, and productivity techniques for software development (e.g., video-on-demand, IoT client, web, big data, and machine learning applications). KS holds 30+ United States patents on intelligent management of mobile and cloud systems and has published 18+ technical papers in top-ranked journals and conferences. He obtained his Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign. Dr. Yim has served as co-chair of the industry track of three international symposia (in software reliability and dependable computing) and has also served on multiple program committees of international conferences and workshops (in fault-tolerant computing, software engineering, computer systems, and parallel and distributed computing). Dr. Yim is a senior member of IEEE.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Task-oriented queries (e.g., one-shot queries to play videos, order food, or call a taxi) are crucial for assessing the quality of virtual assistants, chatbots, and other large language model (LLM)-based services. However, a standard benchmark for task-oriented queries is not yet available, as existing benchmarks in the relevant NLP (Natural Language Processing) fields have primarily focused on task-oriented dialogues. Thus, we present a new methodology for efficiently generating the Task-oriented Queries Benchmark (ToQB) using existing task-oriented dialogue datasets and an LLM service. Our methodology involves formulating the underlying NLP task to summarize the original intent of a speaker in each dialogue, detailing the key steps to perform the devised NLP task using an LLM service, and outlining a framework for automating a major part of the benchmark generation process. Through a case study encompassing three domains (i.e., two single-task domains and one multi-task domain), we demonstrate how to customize the LLM prompts (e.g., omitting system utterances or speaker labels) for those three domains and characterize the generated task-oriented queries. The generated ToQB dataset is made available to the public.We further discuss new domains that can be added to ToQB by community contributors and its practical applications. View details
    Preview abstract This chapter explores the possibility of building a unified assessment methodology for software reliability and security. The fault injection methodology originally designed for reliability assessment is extended to quantify and characterize the security defense aspect of native applications. Native application refers to system software written in C/C++ programming language. Specifically, software fault injection is used to measure the portion of injected software faults caught by the built-in error detection mechanisms of a target program (e.g., the detection coverage of assertions). To automatically activate as many injected faults as possible, a gray box fuzzing technique is used. Using dynamic analyzers during fuzzing further helps us catch the critical error propagation paths of injected (but undetected) faults, and identify code fragments as targets for security hardening. Because conducting software fault injection experiments for fuzzing is an expensive process, a novel, locality-based fault selection algorithm is presented. The presented algorithm increases the fuzzing failure ratios by 3–19 times, accelerating the speed of experiment. The case studies use all the above experimental techniques in order to compare the effectiveness of fuzzing and testing, and consequently assess the security defense of native benchmark programs. View details
    TREBLE: Fast Software Updates by Creating an Equilibrium in an Active Software Ecosystem of Globally Distributed Stakeholders
    Iliyan Batanov Malchev
    Andrew Hsieh
    Dave Burke
    ACM Transactions on Embedded Computing Systems, 18(5s)(2019), 23 pages
    Preview abstract This paper presents our experience with Treble, a two-year initiative to build the modular base in Android, a Java-based mobile platform running on the Linux kernel. Our TREBLE architecture splits the hardware independent core framework written in Java from the hardware dependent vendor implementations (e.g., user space device drivers, vendor native libraries, and kernel written in C/C++). Cross-layer communications between them are done via versioned, stable inter-process communication interfaces whose backward compatibility is tested by using two API compliance suites. Based on this architecture, we repackage the key Android software components that suffered from crucial post-launch security bugs as separate images. That not only enables separate ownerships but also independent updates of each image by interested ecosystem entities. We discuss our experience of delivering TREBLE architectural changes to silicon vendors and device makers using a yearly release model. Our experiments and industry rollouts support our hypothesis that giving more freedom to all ecosystem entities and creating an equilibrium are a transformation necessary to further scale the world largest open ecosystem with over two billion active devices. View details
    A Taste of Android Oreo (v8.0) Device Manufacturer
    Iliyan Batanov Malchev
    Dave Burke
    ACM Symposium on Operating Systems Principles (SOSP) - Tutorial(2017)
    Preview abstract In 2017, over two billion Android devices developed by more than a thousand device manufacturers (DMs) around the world are actively in use. Historically, silicon vendors (SVs), DMs, and telecom carriers extended the Android Open Source Project (AOSP) platform source code and used the customized code in final production devices. Forking, on the other hand, makes it hard to accept upstream patches (e.g., security fixes). In order to reduce such software update costs, starting from Android v8.0, the new Vendor Test Suite (VTS) splits hardware-independent framework and hardware-dependent vendor implementation by using versioned, stable APIs (namely, vendor interface). Android v8.0 thus opens the possibility of a fast upgrade of the Android framework as long as the underlying vendor implementation passes VTS. This tutorial teaches how to develop, test, and certify a compatible Android vendor interface implementation running below the framework. We use an Android Virtual Device (AVD) emulating an Android smartphone device to implement a user-space device driver which uses formalized interfaces and RPCs, develop VTS tests for that component, execute the extended tests, and certify the extended vendor implementation. View details
    Evaluation Metrics of Service-Level Reliability Monitoring Rules of a Big Data Service
    In Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE)(2016), pp. 376-387
    Preview abstract This paper presents new metrics to evaluate the reliability monitoring rules of a large-scale big data service. Our target service uses manually-tuned, service-level reliability monitoring rules. Using the measurement data, we identify two key technical challenges in operating our target monitoring system. In order to improve the operational efficiency, we characterize how those rules were manually tuned by the domain experts. The characterization results provide useful information to operators supposed to regularly tune such rules. Using the actual production failure data, we evaluate the same monitoring rules by using standard metrics and the presented metrics. Our evaluation results show the strengths and weaknesses of each metric and show that the presented metrics can further help operators recognize when and which rules need to be re-tuned. View details
    The Rowhammer Attack Injection Methodology
    In Proceedings of the IEEE Symposium on Reliable Distributed Systems (SRDS)(2016), pp. 1-10
    Preview abstract This paper presents a systematic methodology to identify and validate security attacks that exploit user influenceable hardware faults (i.e., rowhammer errors). We break down rowhammer attack procedures into nine generalized steps where some steps are designed to increase the attack success probabilities. Our framework can perform those nine operations (e.g., pressuring system memory and spraying landing pages) as well as inject rowhammer errors which are basically modeled as ≥3-bit errors. When one of the injected errors is activated, such can cause control or data flow divergences which can then be caught by a prepared landing page and thus lead to a successful attack. Our experiments conducted against a guest operating system of a typical cloud hypervisor identified multiple reproducible targets for privilege escalation, shell injection, memory and disk corruption, and advanced denial-of-service attacks. Because the presented rowhammer attack injection (RAI) methodology uses error injection and thus statistical sampling, RAI can quantitatively evaluate the modeled rowhammer attack success probabilities of any given target software states. View details
    Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units
    IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE International Parallel and Distributed Processing Symposium (IPDPS)(2014), pp. 458-467
    Preview abstract In N-body programs, trajectories of simulated particles have chaotic patterns if errors are in the initial conditions or occur during some computation steps. It was believed that the global properties (e.g., total energy) of simulated particles are unlikely to be affected by a small number of such errors. In this paper, we present a quantitative analysis of the impact of transient faults in GPU devices on a global property of simulated particles. We experimentally show that a single-bit error in non-control data can change the final total energy of a large- scale N-body program with ~2.1% probability. We also find that the corrupted total energy values have certain biases (e.g., the values are not a normal distribution), which can be used to reduce the expected number of re-executions. In this paper, we also present a data error detection technique for N-body pro- grams by utilizing two types of properties that hold in simulated physical models. The presented technique and an existing redundancy-based technique together cover many data errors (e.g., >97.5%) with a small performance overhead (e.g., 2.3%). View details
    Norming to Performing: Failure Analysis and Deployment Automation of Big Data Software Developed by Highly Iterative Models
    IEEE International Symposium on Software Reliability Engineering, IEEE International Symposium on Software Reliability Engineering(2014), pp. 144-155
    Preview abstract We observe many interesting failure characteristics from Big Data software developed and released using some kinds of highly iterative development models (e.g., agile). ~16% of failures occur due to faults in software deployments (e.g., packaging and pushing to production). Our analysis shows that many such production outages are at least partially due to some human errors rooted in the high frequency and complexity of software deployments. ~51% of the observed human errors (e.g., transcription, education, and communication error types) are avoidable through automation. We thus develop a fault-tolerant automation framework to make it efficient to automate end-to-end software deployment procedures. We apply the framework to two Big Data products. Our case studies show the complexity of the deployment procedures of multi-homed Big Data applications and help us to study the effectiveness of the validation and verification techniques for user-provided automation programs. We analyze the production failures of the two products again after the automation. Our experimental data shows how the automation and the associated procedure improvements reduce the deployment faults and overall failure rate, and improve the feature launch velocity. Automation facilitates more formal, procedure-driven software engineering practices which not only reduce the manual work and human-oriented, avoidable production outages but also help engineers to better understand overall software engineering procedures, making them more auditable, predictable, reliable, and efficient. We discuss two novel metrics to evaluate progress in mitigating human errors and the conditions indicating points to start such transition from owner-driven deployment practice. View details
    HTAF: Hybrid Testing Automation Framework to Leverage Local and Global Computing Resources
    David Hreczany
    Ravishankar K. Iyer
    Lecture Notes in Computer Science, 6784(2011), pp. 479-494
    Preview abstract In web application development, testing forms an increasingly large portion of software engineering costs due to the growing complexity and short time-to-market of these applications. This paper presents a hybrid testing automation framework (HTAF) that can automate routine works in testing and releasing web software. Using this framework, an individual software engineer can easily describe his routine software engineering tasks and schedule these described tasks by using both his local machine and global cloud computers in an efficient way. This framework is applied to commercial web software development processes. Our industry practice shows four example cases where the hybrid and decentralized architecture of HTAF is helpful at effectively managing both hardware resources and manpower required for testing and releasing web applications. View details
    From Experiment to Design - Fault Characterization and Detection in Parallel Computer Systems Using Computational Accelerators
    Ph.D. Thesis, University of Illinois at Urbana-Champaign(2013)
    Preview abstract his dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads. View details
    Pluggable Watchdog: Transparent Failure Detection for MPI Programs
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IEEE(2013), pp. 489-500
    Preview abstract This paper presents a framework and its techniques that can detect various types of runtime errors and failures in MPI programs. The presented framework offloads its detection techniques to an external device (e.g., extension card). By developing intelligence on the normal behavioral and semantic execution patterns of monitored parallel threads, the presented external error detectors can accurately and quickly detect errors and failures. This architecture allows us to use powerful detectors without directly using the computing power of the monitored system. The separation of hardware of the monitored and monitoring systems offers an extra advantage in terms of system reliability. We have prototyped our system on a parallel computer system by using an FPGA-based PCI extension card as a monitoring device. We have conducted a fault injection experiment to evaluate the presented techniques using eight MPI-based parallel programs. The techniques cover ~98.5% of faults, on average. The average performance overhead is 1.8% for techniques that detect crash and hang failures and 6.6% for techniques that detect SDC failures. View details
    A Fault-Tolerant, Programmable Voter for N-Modular Redundancy
    V. Sidea
    Z. Kalbarczyk
    Deming Chen
    Ravishankar K. Iyer
    In Proceedings of the IEEE Aerospace Conference, IEEE(2012)
    Preview abstract This paper presents a fault-tolerant, programmable voter architecture for software-implemented N-tuple modular redundant (NMR) computer systems. Software NMR is a cost-efficient solution for high-performance, mission-critical computer systems because this can be built on top of commercial off-the-shelf (COTS) devices. Due to the large volume and randomness of voting data, software NMR system requires a programmable voter. Our experiment shows that voting software that executes on a processor has the time-of-check-to-time-of-use (TOCTTOU) vulnerabilities and is unable to tolerate long duration faults. In order to address these two problems, we present a special-purpose voter processor and its embedded software architecture. The processor has a set of new instructions and hardware modules that are used by the software in order to accelerate the voting software execution and address the identified two reliability problems. We have implemented the presented system on an FPGA platform. Our evaluation result shows that using the presented system reduces the execution time of error detection codes (commonly used in voting software) by 14% and their code size by 56%. Our fault injection experiments validate that the presented system removes the TOCTTOU vulnerabilities and recovers under both transient and long duration faults. This is achieved by using 0.7% extra hardware in a baseline processor. View details
    A Codesigned Fault Tolerance System for Heterogeneous Many-Core Processors
    Ravishankar K. Iyer
    IPDPS Workshops(2011), pp. 2053-2056
    Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
    Cuong Pham
    Mushfiq Saleheen
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    IPDPS(2011), pp. 287-300
    Measurement-based analysis of fault and error sensitivities of dynamic memory
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    DSN(2010), pp. 431-436
    Quantitative Analysis of Long-Latency Failures in System Software
    Zbigniew Kalbarczyk
    Ravishankar K. Iyer
    PRDC(2009), pp. 23-30
    A Software Reproduction of Virtual Memory for Deeply Embedded Systems
    Jae Don Lee
    Jungkeun Park
    Jeong-Joon Yoo
    Chaeseok Im
    Yeonseung Ryu
    Lecture Notes in Computer Science(2006), pp. 1000-1009
    Operating System Support for Procedural Abstraction in Embedded Systems
    Jeong-Joon Yoo
    Jae Don Lee
    Jihong Kim
    RTCSA(2006), pp. 378-384
    CATA: A Garbage Collection Scheme for Flash Memory File Systems
    Long-zhe Han
    Yeonseung Ryu
    Lecture Notes in Computer Science, 4159(2006), pp. 103-112
    A Novel Memory Hierarchy for Flash Memory Based Storage Systems
    Journal of Semiconductor Technology and Science, 5(2005), pp. 69-76
    Preview abstract Semiconductor scientists and engineers ideally desire the faster but the cheaper non-volatile memory devices. In practice, no single device satisfies this desire because a faster device is expensive and a cheaper is slow. Therefore, in this paper, we use heterogeneous non-volatile memories and construct an efficient hierarchy for them. First, a small RAM device (e.g., MRAM, FRAM, and PRAM) is used as a write buffer of flash memory devices. Since the buffer is faster and does not have an erase operation, write can be done quickly in the buffer, making the write latency short. Also, if a write is requested to a data stored in the buffer, the write is directly processed in the buffer, reducing one write operation to flash storages. Second, we use many types of flash memories (e.g., SLC and MLC flash memories) in order to reduce the overall storage cost. Specifically, write requests are classified into two types, hot and cold, where hot data is vulnerable to be modified in the near future. Only hot data is stored in the faster SLC flash, while the cold is kept in slower MLC flash or NOR flash. The evaluation results show that the proposed hierarchy is effective at improving the access time of flash memory storages in a cost-effective manner thanks to the locality in memory accesses. View details
    A fast start-up technique for flash memory based computing systems
    Jihong Kim
    Kern Koh
    SAC(2005), pp. 843-849
    An Energy-Efficient Reliable Transport for Wireless Sensor Networks
    Jihong Kim
    Kern Koh
    Lecture Notes in Computer Science, 3090(2004), pp. 54-64
    An Energy-Efficient Routing and Reporting Scheme to Exploit Data Similarities in Wireless Sensor Networks
    Jihong Kim
    Kern Koh
    Lecture Notes in Computer Science, 3207(2004), pp. 515-527
    Performance Analysis of On-Chip Cache and Main Memory Compression Systems for High-End Parallel Computers
    Jihong Kim
    Kern Koh
    PDPTA(2004), pp. 469-475
    A flash compression layer for SmartMedia card systems
    Hyokyung Bahn
    Kern Koh
    IEEE Trans. Consumer Electronics, 50(2004), pp. 192-197
    NIC-NET: A Host-Independent Network Solution for High-End Network Servers
    Hojung Cha
    Kern Koh
    Lecture Notes in Computer Science, 3320(2004), pp. 401-405
    A Space-Efficient On-Chip Compressed Cache Organization for High Performance Computing
    Jang-Soo Lee
    Jihong Kim
    Shin-Dug Kim
    Kern Koh
    Lecture Notes in Computer Science, 3358(2004), pp. 952-964
    A Compressed Page Management Scheme for NAND-Type Flash Memory
    Kern Koh
    Hyokyung Bahn
    VLSI(2003), pp. 266-271