Hongil Yoon
Authored Publications
Sort By
Preview abstract
Applying ddeep neural networks to 3D point cloud processing has shown a rapid pace of advancements in domains where 3D geometry information can greatly boost task performance, such as AR/VR, robotics, and autonomous driving. However, along with rapid increase in the size of the neural network models and 3D point clouds, reducing the entailed computation and memory access overhead is a primary challenge to meet strict latency and energy requirements of practical applications. This paper proposes a new technique called spatial point distribution aware pruning by leveraging spare nature of 3D point could processing. We identified that particular groups of neighborhood voxels in 3D point clouds more frequently contribute to actual output features. We propose to selectively prune less contributing groups of neighborhood voxels first to reduce the computation overhead while reducing impacts on model accuracy. We applied our technique to three representative Sparse 3D Convolution libraries and showed that our technique significantly reduces the inference latency by 1.48× and energy consumption by 1.61× on NVIDIA GV100 without any accuracy loss.
View details
ASAP: Fast Mobile Application Switch via Adaptive Prepaging
Sam Son
Seung Yul Lee
Jonghyun Bae
Yunho Jin
Jinkyu Jeong
Tae Jun Ham
Jae W. Lee
USENIX Association, ., pp. 365-380
Preview abstract
With ever-increasing demands for memory capacity from a mobile application, along with a steady increase in the number of applications running concurrently, the memory capacity is becoming a scarce resource on mobile devices. When the memory pressure is high, current mobile OSes often kill application processes that have not recently been used to reclaim memory space. This leads to a long delay when the user relaunches the killed application, which degrades the user experience. Even if this mechanism is disabled to utilize a compression-based in-memory swap mechanism, relaunching the application still incurs a substantial latency penalty as it requires decompression of compressed anonymous pages and a stream of I/O accesses to retrieve file-backed pages into memory. This paper identifies the conventional demand paging as the primary source of this inefficiency and proposes ASAP, a mechanism for fast application switch via adaptive prepaging on mobile devices. Specifically, ASAP performs prepaging effectively by combining i) high-precision switch footprint estimators for both file-backed and anonymous pages, and ii) efficient implementation of the prepaging mechanism to minimize resource wastes for CPU cycles and disk bandwidth during an application switch. Our evaluation of ASAP using eight real-world applications on Google Pixel 4 demonstrates that ASAP can reduce the switch time by 22.2% on average (with 33.3% at maximum) over the vanilla Android 10.
View details
2 Billion Devices and Counting: An Industry Perspective on the State of Mobile Computer Architecture
Preview abstract
Mobile computing has grown drastically over the past decade since the arrival of the smartphone. Despite the rapid pace of the advancements, mobile device benchmarking and evaluation are still in its infancy both in the industry and academia. Authors address this issue head-on and present an industrial perspective on the challenges facing mobile architecture research with the hope of fostering new research to solve many of the pending problems. This paper presents “ten commandments” that focus on raising awareness around mobile workloads, metrics and experimental methodology. These issues, as perceived from an industry perspective, are real challenges that if addressed can alleviate the entire mobile ecosystem to the next level.
View details
Filtering Translation Bandwidth with Virtual Caching
Jason Lowe-Power
Gurindar S. Sohi
Architectural Support for Programming Languages and Operating Systems (ASPLOS), ACM, Williamsburg, Virginia, USA (2018)
Preview abstract
Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translations on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) of shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (1.47× slowdown over an ideal MMU).
To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).
View details