Chi-yao Hong
Chi-yao Hong is a Senior Staff Software Engineer in Google Global Networking (GGN). His team develops the traffic engineering systems for Google's global and regional networks, including B2 and B4. He received a PhD in Computer Science from the University of Illinois at Urbana-Champaign.
Authored Publications
Google Publications
Other Publications
Sort By
Orion: Google’s Software-Defined Networking Control Plane
Amin Vahdat
Amr Sabaa
Arjun Singh
Henrik Muehe
Joon Suan Ong
Karthik Swaminathan Nagaraj
KondapaNaidu Bollineni
Lorenzo Vicisano
Mike Conley
Min Zhu
Rich Alimi
Shawn Chen
Shidong Zhang
Waqar Mohsin
(2021)
Preview abstract
We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies.
Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems.
View details
B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN
Mohammad A. Alfares
Min Zhu
Rich Alimi
Kondapa Naidu Bollineni
Chandan Bhagat
Sourabh Jain
Jay Kaimal
Jeffrey Liang
Kirill Mendelev
Faro Thomas Rabe
Saikat Ray
Malveeka Tewari
Monika Zahn
Joon Ong
Amin Vahdat
SIGCOMM'18(2018)
Preview abstract
Private WANs are increasingly important to the operation of enterprises, telecoms, and cloud providers. For example, B4, Google’s private software-defined WAN, is larger and growing faster than our connectivity to the public Internet. In this paper, we present the five-year evolution of B4. We describe the techniques we employed to incrementally move from offering best-effort content-copy services to carrier-grade availability, while concurrently scaling B4 to accommodate 100x more traffic. Our key challenge is balancing the tension introduced by hierarchy required for scalability, the partitioning required for availability, and the capacity asymmetry inherent to the construction and operation of any large-scale network. We discuss our approach to managing this tension: i) we design a custom hierarchical network topology for both horizontal and vertical software scaling, ii) we manage inherent capacity asymmetry in hierarchical topologies using a novel traffic engineering algorithm without packet encapsulation, and iii) we re-architect switch forwarding rules via two-stage matching/hashing to deal with asymmetric network failures at scale.
View details
Achieving High Utilization with Software-Driven WAN
Srikanth Kandula
Ratul Mahajan
Ming Zhang
Vijay Gill
Mohan Nanduri
Roger Wattenhofer
SIGCOMM(2013)
Preview abstract
We present SWAN, a system that boosts the utilization of inter-datacenter networks by centrally controlling when and how much traffic each service sends and frequently re-configuring the network's data plane to match current traffic demand. But done simplistically, these re-configurations can also cause severe, transient congestion because different switches may apply updates at different times. We develop a novel technique that leverages a small amount of scratch capacity on links to apply updates in a provably congestion-free manner, without making any assumptions about the order and timing of updates at individual switches. Further, to scale to large networks in the face of limited forwarding table capacity, SWAN greedily selects a small set of entries that can best satisfy current demand. It updates this set without disrupting traffic by leveraging a small amount of scratch capacity in forwarding tables. Experiments using a testbed prototype and data-driven simulations of two production networks show that SWAN carries 60% more traffic than the current practice.
View details
Jellyfish: Networking Data Centers Randomly
Preview abstract
Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish is more cost-efficient than a fat-tree, supporting as many as 25% more servers at full capacity using the same equipment at the scale of a few thousand nodes, and this advantage improves with scale. Jellyfish also allows great flexibility in building networks with different degrees of oversubscription. However, Jellyfish’s unstructured design brings new challenges in routing, physical layout, and wiring. We describe approaches to resolve these challenges, and our evaluation suggests that Jellyfish could be deployed in today’s data centers.
View details
Finishing flows quickly with preemptive scheduling
Preview abstract
Today's data centers face extreme challenges in providing low latency. However, fair sharing, a principle commonly adopted in current congestion control protocols, is far from optimal for satisfying latency requirements. We propose Preemptive Distributed Quick (PDQ) flow scheduling, a protocol designed to complete flows quickly and meet flow deadlines. PDQ enables flow preemption to approximate a range of scheduling disciplines. For example, PDQ can emulate a shortest job first algorithm to give priority to the short flows by pausing the contending flows. PDQ borrows ideas from centralized scheduling disciplines and implements them in a fully distributed manner, making it scalable to today's data centers. Further, we develop a multipath version of PDQ to exploit path diversity.
View details
Populated IP addresses: classification and applications
Preview abstract
Populated IP addresses (PIP) -- IP addresses that are associated with a large number of user requests are important for online service providers to efficiently allocate resources and to detect attacks. While some PIPs serve legitimate users, many others are heavily abused by attackers to conduct malicious activities such as scams, phishing, and malware distribution. Unfortunately, commercial proxy lists like Quova have a low coverage of PIP addresses and offer little support for distinguishing good PIPs from abused ones. In this study, we propose PIPMiner, a fully automated method to extract and classify PIPs through analyzing service logs. Our methods combine machine learning and time series analysis to distinguish good PIPs from abused ones with over 99.6% accuracy. When applying the derived PIP list to several applications, we can identify millions of malicious Windows Live accounts right on the day of their sign-ups, and detect millions of malicious Hotmail accounts well before the current detection system captures them.
View details
Tiresias: Online anomaly detection for hierarchical operational network data
Preview abstract
Operational network data, management data such as customer care call logs and equipment system logs, is a very important source of information for network operators to detect problems in their networks. Unfortunately, there is lack of efficient tools to automatically track and detect anomalous events on operational data, causing ISP operators to rely on manual inspection of this data. While anomaly detection has been widely studied in the context of network data, operational data presents several new challenges, including the volatility and sparseness of data, and the need to perform fast detection (complicating application of schemes that require offline processing or large/stable data sets to converge). To address these challenges, we propose Tiresias, an automated approach to locating anomalous events on hierarchical operational data. Tiresias leverages the hierarchical structure of operational data to identify high-impact aggregates (e.g., locations in the network, failure modes) likely to be associated with anomalous events. To accommodate different kinds of operational network data, Tiresias consists of an online detection algorithm with low time and space complexity, while preserving high detection accuracy. We present results from two case studies using operational data collected at a large commercial IP network operated by a Tier-1 ISP: customer care call logs and set-top box crash logs. By comparing with a reference set verified by the ISP's operational group, we validate that Tiresias can achieve >;94% accuracy in locating anomalies. Tiresias also discovered several previously unknown anomalies in the ISP's customer care cases, demonstrating its effectiveness.
View details
Clockscalpel: Understanding root causes of Internet clock synchronization inaccuracy
Chia-Chi Lin
Matthew Caesar
International Conference on Passive and Active Network Measurement (PAM)(2011)
Preview abstract
Synchronizing clocks is an integral part of modern network and security architectures. However, the ability to synchronize clocks in modern networks is not well-understood. In this work, we use testbeds equipped with a high-accuracy GPS receiver to acquire ground truth, to study the accuracy of probe-based synchronization techniques to over 1861 public time servers. We find that existing synchronization protocols provide a median error of 2–5 ms, but suffer from a long-tail. We analyze sources of inaccuracy by decoupling and quantifying different network factors. We found that most inaccuracies stem from asymmetry of propagation delay and queueing delay. We discuss possible schemes to compensate these errors to improve synchronization accuracy.
View details
Approximation algorithms for a link scheduling problem in wireless relay networks with QoS guarantee
Preview abstract
The emerging wireless relay networks (WRNs) are expected to provide significant improvement on throughput and extension of coverage area for next-generation wireless systems. We study an optimization problem for multihop link scheduling with bandwidth and delay guarantees over WRNs. Our optimization problem is investigated under a more general interference model with a generic objective. The objective can be based on various kinds of performance indexes (eg, throughput, fairness, and capacity), which can be determined by service providers. Through our theoretical analysis, the intractability and inapproximability of the optimization problem are shown. Due to the intractable computational complexity, we present efficient algorithms to provide a reasonable small approximation factor against any optimal solution even for a worst-case input. Furthermore, some experimental results indicate that our algorithms yield near-optimal performance in the average case.
View details
BotGrep: Finding Bots with Structured Graph Analysis
Preview abstract
A key feature that distinguishes modern botnets from earlier counterparts is their increasing use of structured overlay topologies. This lets them carry out sophisticated coordinated activities while being resilient to churn, but it can also be used as a point of detection. In this work, we devise techniques to localize botnet members based on the unique communication patterns arising from their overlay topologies used for command and control. Experimental results on synthetic topologies embedded within Internet traffic traces from an ISP's backbone network indicate that our techniques (i) can localize the majority of bots with low false positive rate, and (ii) are resilient to incomplete visibility arising from partial deployment of monitoring systems and measurement inaccuracies from dynamics of background traffic.
View details
Link scheduling with QoS guarantee for wireless relay networks
Preview abstract
The emerging wireless relay networks (WRNs) are expected to provide significant improvement on throughput and extension of coverage area for next-generation wireless systems. We study an optimization problem for multi-hop link scheduling with bandwidth and delay guarantees over WRNs. Our optimization problem is investigated under a general interference model with a generic objective. The objective can be based on various kinds of performance indexes (eg, throughput, fairness and capacity) determined by service providers. We present efficient algorithms for maximizing the objective function. The experimental results indicate that our presented algorithms yield near-optimal performance.
View details
3-approximation algorithm for joint routing and link scheduling in wireless relay networks
Preview abstract
In emerging wireless relay networks (WRNs) such as IEEE 802.16 j, efficient resource allocation is becoming a substantial issue for throughput optimization. In this paper, we propose an algorithm for joint routing and link scheduling in WRNs. The developed theoretical analysis indicates that the performance of the proposed algorithm is within a factor of three of that of any optimal algorithm in the worst case. Through simulation experiments, the numerical results show that our algorithm outperforms the previously proposed routing and link-scheduling algorithms. Furthermore, the proposed algorithm can effectively achieve near-optimal performance, and provide much better throughput than the theoretical worst-case bound in the average case.
View details