Deniz Altınbüken
I’m a Research Engineer at Google DeepMind. My work is focused on advancing the state-of-the-art in systems using AI and ML.
Prior to joining Google I received a Ph.D. in Computer Science from Cornell University. During my Ph.D. I focused on building distributed systems that rely on strong foundations in distributed systems Theory. My primary research interests are in machine learning for systems, distributed systems, database systems, and operating systems.
Authored Publications
Sort By
Snowcat: Efficient Kernel Concurrency Testing using a Learned Coverage Predictor
Sishuai Gong
Dinglan Peng
Pedro Fonseca
Symposium on Operating Systems Principles (SOSP) (2023)
Preview abstract
Random-based approaches and heuristics are commonly
used in kernel concurrency testing due to the massive scale
of modern kernels and corresponding interleaving space.
The lack of accurate and scalable approaches to analyze concurrent
kernel executions makes existing testing approaches
heavily rely on expensive dynamic executions to measure
the effectiveness of a new test. Unfortunately, the high cost
incurred by dynamic executions limits the breadth of the
exploration and puts latency pressure on finding effective
concurrent test inputs and schedules, hindering the overall
testing effectiveness.
This paper proposes Snowcat, a kernel concurrency testing
framework that generates effective test inputs and schedules
using a learned kernel block-coverage predictor. Using a
graph neural network, the coverage predictor takes a concurrent
test input and scheduling hints and outputs a prediction
on whether certain important code blocks will be executed.
Using this predictor, Snowcat can skip concurrent tests that
are likely to be fruitless and prioritize the promising ones
for actual dynamic execution.
After testing the Linux kernel for over a week, Snowcat
finds ∼17% more potential data races, by prioritizing tests of
more fruitful schedules than existing work would have chosen.
Snowcat can also find effective test inputs that expose
new concurrency bugs with higher probability (1.4×∼2.6×),
or reproduce known bugs more quickly (15×) than state-ofart
testing tools. More importantly, Snowcat is shown to be
more efficient at reaching a desirable level of race coverage
in the continuous setting, as the Linux kernel evolves from
version to version. In total, Snowcat discovered 17 new concurrency
bugs in Linux kernel 6.1, of which 13 are confirmed
and 6 are fixed.
View details
HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network
Zhenyu Song
Kevin Chen
Νikhil Sarda
Eugene Brevdo
Jimmy Coleman
Xiao Ju
Pawel Jurczyk
Richard Schooler
20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), USENIX Association, Boston, MA (2023), pp. 1149-1163
Preview abstract
Video streaming services are among the largest web applications in production, and a large source of downstream internet traffic. A large-scale video streaming service at Google, YouTube, leverages a Content Delivery Network (CDN) to serve its users. A key consideration in providing a seamless service is cache efficiency. In this work, we demonstrate machine learning techniques to improve the efficiency of YouTube's CDN DRAM cache. While many recently proposed learning-based caching algorithms show promising results, we identify and address three challenges blocking deployment of such techniques in a large-scale production environment: computation overhead for learning, robust byte miss ratio improvement, and measuring impact under production noise. We propose a novel caching algorithm, HALP, which achieves low CPU overhead and robust byte miss ratio improvement by augmenting a heuristic policy with machine learning. We also propose a production measurement method, impact distribution analysis, that can accurately measure the impact distribution of a new caching algorithm deployment in a noisy production environment.
HALP has been running in YouTube CDN production as a DRAM level eviction algorithm since early 2022 and has reliably reduced the byte miss during peak by an average of 9.1% while expending a modest CPU overhead of 1.8%.
View details
Snowboard: Finding Kernel Concurrency Bugs through Systematic Inter-thread Communication Analysis
Sishuai Gong
Pedro Fonseca
Proceedings of the 28th ACM Symposium on Operating Systems Principles (2021) (to appear)
Preview abstract
Kernel concurrency bugs are challenging to find because they depend on very specific thread interleavings and test inputs. While separately exploring kernel thread interleavings or test inputs has been closely examined, jointly exploring interleavings and test inputs has received little attention, in part due to the resulting vast search space. Using precious, limited testing resources to explore this search space and execute just the right concurrent tests in the proper order is critical.
This paper proposes Snowboard a testing framework that generates and executes concurrent tests by intelligently exploring thread interleavings and test inputs jointly. The design of Snowboard is based on a concept called potential memory communication (PMC), a guess about pairs of tests that, when executed concurrently, are likely to perform memory accesses to shared addresses, which in turn may trigger concurrency bugs. To identify PMCs, Snowboard runs tests sequentially from a fixed initial kernel state, collecting their memory accesses. It then pairs up tests that write and read the same region into candidate concurrent tests. It executes those tests using the associated PMC as a scheduling hint to focus interleaving search only on those schedules that directly affect the relevant memory accesses. By clustering candidate tests on various features of their PMCs, Snowboard avoids testing similar behaviors, which would be inefficient. Finally, by executing tests from small clusters first, it prioritizes uncommon suspicious behaviors that may have received less scrutiny.
Snowboard discovered 14 new concurrency bugs in Linux kernels 5.3.10 and 5.12-rc3, of which 12 have been confirmed by developers. Six of these bugs cause kernel panics and filesystem errors, and at least two have existed in the kernel for many years, showing that this approach can uncover hard-to-find, critical bugs. Furthermore, we show that covering as many distinct pairs of uncommon read/write instructions as possible is the test-prioritization strategy with the highest bug yield for a given test-time budget.
View details
Learned Indexes for a Google-scale Disk-based Database
Hussam Abu-Libdeh
Alex Beutel
Lyric Pankaj Doshi
Tim Klas Kraska
Chris Olston
(2020)
Preview abstract
There is great excitement about learned index structures, but understandable skepticism about the practicality of a new method uprooting decades of research on B-Trees. In this paper, we work to remove some of that uncertainty by demonstrating how a learned index can be integrated in a distributed, disk-based database system: Google’s Bigtable. We detail several design decisions we made to integrate learned indexes in Bigtable. Our results show that integrating learned index significantly improves the end-to-end read latency and throughput for Bigtable.
View details