Chi-yao Hong

Chi-yao Hong

Chi-yao Hong is a Senior Staff Software Engineer in Google Global Networking (GGN). His team develops the traffic engineering systems for Google's global and regional networks, including B2 and B4. He received a PhD in Computer Science from the University of Illinois at Urbana-Champaign.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Orion: Google’s Software-Defined Networking Control Plane
    Amin Vahdat
    Amr Sabaa
    Arjun Singh
    Henrik Muehe
    Joon Suan Ong
    Karthik Swaminathan Nagaraj
    KondapaNaidu Bollineni
    Lorenzo Vicisano
    Mike Conley
    Min Zhu
    Rich Alimi
    Shawn Chen
    Shidong Zhang
    Waqar Mohsin
    (2021)
    Preview abstract We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies. Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems. View details
    B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN
    Mohammad A. Alfares
    Min Zhu
    Rich Alimi
    Kondapa Naidu Bollineni
    Chandan Bhagat
    Sourabh Jain
    Jay Kaimal
    Jeffrey Liang
    Kirill Mendelev
    Faro Thomas Rabe
    Saikat Ray
    Malveeka Tewari
    Monika Zahn
    Joon Ong
    Amin Vahdat
    SIGCOMM'18(2018)
    Preview abstract Private WANs are increasingly important to the operation of enterprises, telecoms, and cloud providers. For example, B4, Google’s private software-defined WAN, is larger and growing faster than our connectivity to the public Internet. In this paper, we present the five-year evolution of B4. We describe the techniques we employed to incrementally move from offering best-effort content-copy services to carrier-grade availability, while concurrently scaling B4 to accommodate 100x more traffic. Our key challenge is balancing the tension introduced by hierarchy required for scalability, the partitioning required for availability, and the capacity asymmetry inherent to the construction and operation of any large-scale network. We discuss our approach to managing this tension: i) we design a custom hierarchical network topology for both horizontal and vertical software scaling, ii) we manage inherent capacity asymmetry in hierarchical topologies using a novel traffic engineering algorithm without packet encapsulation, and iii) we re-architect switch forwarding rules via two-stage matching/hashing to deal with asymmetric network failures at scale. View details
    Achieving High Utilization with Software-Driven WAN
    Srikanth Kandula
    Ratul Mahajan
    Ming Zhang
    Vijay Gill
    Mohan Nanduri
    Roger Wattenhofer
    SIGCOMM(2013)
    Preview abstract We present SWAN, a system that boosts the utilization of inter-datacenter networks by centrally controlling when and how much traffic each service sends and frequently re-configuring the network's data plane to match current traffic demand. But done simplistically, these re-configurations can also cause severe, transient congestion because different switches may apply updates at different times. We develop a novel technique that leverages a small amount of scratch capacity on links to apply updates in a provably congestion-free manner, without making any assumptions about the order and timing of updates at individual switches. Further, to scale to large networks in the face of limited forwarding table capacity, SWAN greedily selects a small set of entries that can best satisfy current demand. It updates this set without disrupting traffic by leveraging a small amount of scratch capacity in forwarding tables. Experiments using a testbed prototype and data-driven simulations of two production networks show that SWAN carries 60% more traffic than the current practice. View details
    Jellyfish: Networking Data Centers Randomly
    Ankit Singla
    Lucian Popa
    P. Brighten Godfrey
    NSDI(2012)
    Preview abstract Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish is more cost-efficient than a fat-tree, supporting as many as 25% more servers at full capacity using the same equipment at the scale of a few thousand nodes, and this advantage improves with scale. Jellyfish also allows great flexibility in building networks with different degrees of oversubscription. However, Jellyfish’s unstructured design brings new challenges in routing, physical layout, and wiring. We describe approaches to resolve these challenges, and our evaluation suggests that Jellyfish could be deployed in today’s data centers. View details
    Finishing flows quickly with preemptive scheduling
    Matthew Caesar
    P. Brighten Godfrey
    SIGCOMM(2012)
    Preview abstract Today's data centers face extreme challenges in providing low latency. However, fair sharing, a principle commonly adopted in current congestion control protocols, is far from optimal for satisfying latency requirements. We propose Preemptive Distributed Quick (PDQ) flow scheduling, a protocol designed to complete flows quickly and meet flow deadlines. PDQ enables flow preemption to approximate a range of scheduling disciplines. For example, PDQ can emulate a shortest job first algorithm to give priority to the short flows by pausing the contending flows. PDQ borrows ideas from centralized scheduling disciplines and implements them in a fully distributed manner, making it scalable to today's data centers. Further, we develop a multipath version of PDQ to exploit path diversity. View details
    Populated IP addresses: classification and applications
    Fang Yu
    Yinglian Xie
    ACM CCS(2012)
    Preview abstract Populated IP addresses (PIP) -- IP addresses that are associated with a large number of user requests are important for online service providers to efficiently allocate resources and to detect attacks. While some PIPs serve legitimate users, many others are heavily abused by attackers to conduct malicious activities such as scams, phishing, and malware distribution. Unfortunately, commercial proxy lists like Quova have a low coverage of PIP addresses and offer little support for distinguishing good PIPs from abused ones. In this study, we propose PIPMiner, a fully automated method to extract and classify PIPs through analyzing service logs. Our methods combine machine learning and time series analysis to distinguish good PIPs from abused ones with over 99.6% accuracy. When applying the derived PIP list to several applications, we can identify millions of malicious Windows Live accounts right on the day of their sign-ups, and detect millions of malicious Hotmail accounts well before the current detection system captures them. View details
    Tiresias: Online anomaly detection for hierarchical operational network data
    Matthew Caesar
    Nick Duffield
    Jia Wang
    IEEE ICDCS(2012)
    Preview abstract Operational network data, management data such as customer care call logs and equipment system logs, is a very important source of information for network operators to detect problems in their networks. Unfortunately, there is lack of efficient tools to automatically track and detect anomalous events on operational data, causing ISP operators to rely on manual inspection of this data. While anomaly detection has been widely studied in the context of network data, operational data presents several new challenges, including the volatility and sparseness of data, and the need to perform fast detection (complicating application of schemes that require offline processing or large/stable data sets to converge). To address these challenges, we propose Tiresias, an automated approach to locating anomalous events on hierarchical operational data. Tiresias leverages the hierarchical structure of operational data to identify high-impact aggregates (e.g., locations in the network, failure modes) likely to be associated with anomalous events. To accommodate different kinds of operational network data, Tiresias consists of an online detection algorithm with low time and space complexity, while preserving high detection accuracy. We present results from two case studies using operational data collected at a large commercial IP network operated by a Tier-1 ISP: customer care call logs and set-top box crash logs. By comparing with a reference set verified by the ISP's operational group, we validate that Tiresias can achieve >;94% accuracy in locating anomalies. Tiresias also discovered several previously unknown anomalies in the ISP's customer care cases, demonstrating its effectiveness. View details
    Clockscalpel: Understanding root causes of Internet clock synchronization inaccuracy
    Chia-Chi Lin
    Matthew Caesar
    International Conference on Passive and Active Network Measurement (PAM)(2011)
    Preview abstract Synchronizing clocks is an integral part of modern network and security architectures. However, the ability to synchronize clocks in modern networks is not well-understood. In this work, we use testbeds equipped with a high-accuracy GPS receiver to acquire ground truth, to study the accuracy of probe-based synchronization techniques to over 1861 public time servers. We find that existing synchronization protocols provide a median error of 2–5 ms, but suffer from a long-tail. We analyze sources of inaccuracy by decoupling and quantifying different network factors. We found that most inaccuracies stem from asymmetry of propagation delay and queueing delay. We discuss possible schemes to compensate these errors to improve synchronization accuracy. View details
    Approximation algorithms for a link scheduling problem in wireless relay networks with QoS guarantee
    Ai-Chun Pang
    Pi-Cheng Hsiu
    IEEE Transactions on Mobile Computing(2010)
    Preview abstract The emerging wireless relay networks (WRNs) are expected to provide significant improvement on throughput and extension of coverage area for next-generation wireless systems. We study an optimization problem for multihop link scheduling with bandwidth and delay guarantees over WRNs. Our optimization problem is investigated under a more general interference model with a generic objective. The objective can be based on various kinds of performance indexes (eg, throughput, fairness, and capacity), which can be determined by service providers. Through our theoretical analysis, the intractability and inapproximability of the optimization problem are shown. Due to the intractable computational complexity, we present efficient algorithms to provide a reasonable small approximation factor against any optimal solution even for a worst-case input. Furthermore, some experimental results indicate that our algorithms yield near-optimal performance in the average case. View details
    BotGrep: Finding Bots with Structured Graph Analysis
    Shishir Nagaraja
    Prateek Mittal
    Matthew Caesar
    Nikita Borisov
    USENIX Security(2010)
    Preview abstract A key feature that distinguishes modern botnets from earlier counterparts is their increasing use of structured overlay topologies. This lets them carry out sophisticated coordinated activities while being resilient to churn, but it can also be used as a point of detection. In this work, we devise techniques to localize botnet members based on the unique communication patterns arising from their overlay topologies used for command and control. Experimental results on synthetic topologies embedded within Internet traffic traces from an ISP's backbone network indicate that our techniques (i) can localize the majority of bots with low false positive rate, and (ii) are resilient to incomplete visibility arising from partial deployment of monitoring systems and measurement inaccuracies from dynamics of background traffic. View details
    Link scheduling with QoS guarantee for wireless relay networks
    Ai-Chun Pang
    INFOCOM(2009)
    Preview abstract The emerging wireless relay networks (WRNs) are expected to provide significant improvement on throughput and extension of coverage area for next-generation wireless systems. We study an optimization problem for multi-hop link scheduling with bandwidth and delay guarantees over WRNs. Our optimization problem is investigated under a general interference model with a generic objective. The objective can be based on various kinds of performance indexes (eg, throughput, fairness and capacity) determined by service providers. We present efficient algorithms for maximizing the objective function. The experimental results indicate that our presented algorithms yield near-optimal performance. View details
    3-approximation algorithm for joint routing and link scheduling in wireless relay networks
    Ai-Chun Pang
    IEEE Transactions on Wireless Communications(2009)
    Preview abstract In emerging wireless relay networks (WRNs) such as IEEE 802.16 j, efficient resource allocation is becoming a substantial issue for throughput optimization. In this paper, we propose an algorithm for joint routing and link scheduling in WRNs. The developed theoretical analysis indicates that the performance of the proposed algorithm is within a factor of three of that of any optimal algorithm in the worst case. Through simulation experiments, the numerical results show that our algorithm outperforms the previously proposed routing and link-scheduling algorithms. Furthermore, the proposed algorithm can effectively achieve near-optimal performance, and provide much better throughput than the theoretical worst-case bound in the average case. View details