The scale and expense of modern data centers motivates running them as efficiently as possible. This paper explores how virtualized data center performance can be improved when network traffic and topology data informs VM placement. Our practical heuristics, tested on network-heavy, scale-out workloads in an 80 server cluster, improve overall performance by up to 70% compared to random placement in a multi-tenant configuration.
@article{mckeown2008openflow,
title={OpenFlow: Enabling Innovation in Campus Networks},
author={McKeown, Nick and Anderson, Tom and Balakrishnan, Hari and Parulkar, Guru and Peterson, Larry and Rexford, Jennifer and Shenker, Scott and Turner, Jonathan},
journal={ACM SIGCOMM Computer Communication Review},
volume={38},
number={2},
pages={69--74},
year={2008}
}
Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, David Walker
ACM SIGCOMM Computer Communication Review, Vol. 44, Issue 3, July 2014
P4 is a high-level language for programming protocol-independent packet processors. P4 works in conjunction with SDN control protocols like OpenFlow. In its current form, OpenFlow explicitly specifies protocol headers on which it operates. This set has grown from 12 to 41 fields in a few years, increasing the complexity of the specification while still not providing the flexibility to add new headers. In this paper we propose P4 as a strawman proposal for how OpenFlow should evolve in the future. We have three goals: (1) Reconfigurability in the field: Programmers should be able to change the way switches process packets once they are deployed. (2) Protocol independence: Switches should not be tied to any specific network protocols. (3) Target independence: Programmers should be able to describe packet-processing functionality independently of the specifics of the underlying hardware. As an example, we describe how to use P4 to configure a switch to add a new hierarchical label.
@article{bosshart2014p4,
title={P4: Programming Protocol-Independent Packet Processors},
author={Bosshart, Pat and Daly, Dan and Gibb, Glen and Izzard, Martin and McKeown, Nick and Rexford, Jennifer and Schlesinger, Cole and Talayco, Dan and Vahdat, Amin and Varghese, George and Walker, David},
journal={ACM SIGCOMM Computer Communication Review},
volume={44},
number={3},
pages={87--95},
year={2014}
}
Mininet is a system for rapidly prototyping large networks on the limited resources of a single computer. It creates virtual networks using lightweight OS-level virtualization, providing hosts, switches, controllers, and links that behave like their hardware counterparts. Mininet enables interactive development, testing, and debugging of Software-Defined Networks, and its networks can be easily transferred to hardware for line-rate operation. We describe Mininet's design and evaluate its fidelity and performance.
@inproceedings{lantz2010network,
title={A Network in a Laptop: Rapid Prototyping for Software-Defined Networks},
author={Lantz, Bob and Heller, Brandon and McKeown, Nick},
booktitle={ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets)},
year={2010}
}
We describe and analyze iSLIP, a low-complexity scheduling algorithm for input-queued switches. iSLIP uses multiple iterations of a parallel round-robin matching algorithm to achieve high throughput and low latency. We prove that for uniform i.i.d. Bernoulli arrivals, iSLIP achieves 100% throughput. Under non-uniform traffic, iSLIP performs well in practice, converging quickly to a stable scheduling pattern. The algorithm is simple enough to implement in hardware at high line rates and has been used in several commercial switch designs.
@article{mckeown1999islip,
title={The iSLIP Scheduling Algorithm for Input-Queued Switches},
author={McKeown, Nick},
journal={IEEE/ACM Transactions on Networking},
volume={7},
number={2},
pages={188--201},
year={1999}
}
In Software Defined Networking (SDN) the control plane is physically separate from the forwarding plane. Control software programs the forwarding plane (e.g., switches and routers) using an open interface, such as OpenFlow. This paper aims to overcome two limitations in current switching chips and the OpenFlow protocol: (i) current hardware switches are quite rigid, allowing "Match-Action" processing on only a fixed set of fields, and (ii) the OpenFlow specification only defines a limited repertoire of packet processing actions. We propose the RMT (Reconfigurable Match Tables) model, a new RISC-inspired pipelined architecture for switching chips, and we identify the essential minimal set of action primitives to specify how headers are processed in hardware. RMT allows the forwarding plane to be changed in the field without modifying hardware. As in OpenFlow, the programmer can specify multiple match tables of arbitrary width and depth, subject only to an overall resource limit, with each table configurable for matching on arbitrary fields. However, RMT allows the programmer to modify all header fields much more comprehensively than in OpenFlow. Our paper describes the design of a 64 port by 10 Gb/s switch chip implementing the RMT model.
@inproceedings{bosshart2013forwarding,
title={Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN},
author={Bosshart, Pat and Gibb, Glen and Kim, Hun-Seok and Varghese, George and McKeown, Nick and Izzard, Martin and Mujica, Fernando and Horowitz, Mark},
booktitle={ACM SIGCOMM},
year={2013}
}
Energy proportionality is a key goal for data center networks. Networks are a shared resource connecting critical IT infrastructure, and the general practice is to keep them always on, even during off-peak hours when they are lightly utilized. ElasticTree is a network-wide energy optimizer that dynamically adjusts the set of active network elements to satisfy changing data center traffic loads. It combines a formal model of the network with traffic information to find the minimum power network subset and to turn off unneeded links and switches. We demonstrate that ElasticTree can save 25-40% of network energy in a data center.
@inproceedings{heller2010elastictree,
title={ElasticTree: Saving Energy in Data Center Networks},
author={Heller, Brandon and Seetharaman, Srini and Mahadevan, Priya and Yiakoumis, Yiannis and Sharma, Puneet and Banerjee, Sujata and McKeown, Nick},
booktitle={USENIX NSDI},
year={2010}
}
This paper presents Ethane, a new network architecture for the enterprise. Ethane allows managers to define a single network-wide fine-grain policy, and then enforces it directly. Ethane couples extremely simple flow-based Ethernet switches with a centralized controller that manages the admittance and routing of flows. While radical, this design is backwards-compatible with existing hosts and switches. We have built and deployed Ethane in our university department, and we describe our experience over a seven-month period.
@inproceedings{casado2007ethane,
title={Ethane: Taking Control of the Enterprise},
author={Casado, Martin and Freedman, Michael J. and Pettit, Justin and Luo, Jianying and McKeown, Nick and Shenker, Scott},
booktitle={ACM SIGCOMM},
year={2007}
}
It is well known that head-of-line (HOL) blocking limits the throughput of an input-queued switch with FIFO queues. Under certain conditions, the throughput can be shown to be limited to approximately 58%. It is also known that if non-FIFO queueing policies are used, the throughput can be increased. However, it has not been previously shown that if a suitable queueing policy and scheduling algorithm are used then it is possible to achieve 100% throughput for all independent arrival processes. In this paper we prove this to be the case using a simple linear programming argument and quadratic Lyapunov function. In particular, we assume that each input maintains a separate FIFO queue for each output and that the switch is scheduled using a maximum weight bipartite matching algorithm.
@inproceedings{mckeown1996achieving,
title={Achieving 100\% Throughput in an Input-Queued Switch},
author={McKeown, Nick and Anantharam, Venkat and Walrand, Jean},
booktitle={IEEE INFOCOM},
year={1996}
}
We present pFabric, a minimalistic datacenter transport design that achieves near-optimal flow completion times for short flows while maintaining high throughput for large flows. pFabric has a very simple design: switches implement a priority queue that schedules packets based on a single priority number, and the endpoints implement a minimal rate control mechanism. Despite its simplicity, we show through analysis and simulation that pFabric achieves flow completion times within 10-15% of the optimal for a wide range of workloads.
@inproceedings{alizadeh2013pfabric,
title={pFabric: Minimal Near-Optimal Datacenter Transport},
author={Alizadeh, Mohammad and Yang, Shuang and Sharif, Milad and Katti, Sachin and McKeown, Nick and Prabhakar, Balaji and Shenker, Scott},
booktitle={ACM SIGCOMM},
year={2013}
}
All Internet routers contain buffers to hold packets during times of congestion. Today, the size of these buffers is determined by the rule-of-thumb that each link needs a buffer of size equal to the bandwidth-delay product. The widespread use of this rule has led to backbone routers with millions of packets of buffering. We question whether this much buffering is actually needed. Through a combination of analysis and simulation, we argue that buffers can be much smaller than the bandwidth-delay product, reducing router cost and latency.
@inproceedings{appenzeller2004sizing,
title={Sizing Router Buffers},
author={Appenzeller, Guido and Keslassy, Isaac and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2004}
}
P4 has emerged as the de facto standard language for describing how network packets should be processed, and is becoming widely used by network owners, systems developers, researchers and in the classroom. The goal of the work presented here is to make it easier for engineers, researchers and students to learn how to program using P4, and to build prototypes running on real hardware. Our target is the NetFPGA SUME platform, a 4x10 Gb/s PCIe card designed for use in universities for teaching and research. Until now, NetFPGA users have needed to learn an HDL such as Verilog or VHDL, making it off limits to many software developers and students. Therefore, we developed the P4->NetFPGA workflow, allowing developers to describe how packets are to be processed in the high-level P4 language, then compile their P4 programs to run at line rate on the NetFPGA SUME board. The P4->NetFPGA workflow is built upon the Xilinx P4-SDNet compiler and the NetFPGA SUME open source code base. In this paper, we provide an overview of the P4 programming language and describe the P4->NetFPGA workflow. We also describe how the workflow is being used by the P4 community to build research prototypes, and to teach how network systems are built by providing students with hands-on experience working with real hardware.
@inproceedings{ibanez2019p4netfpga,
title={The P4$\rightarrow$NetFPGA Workflow for Line-Rate Packet Processing},
author={Ibanez, Stephen and Brebner, Gordon and McKeown, Nick and Zilberman, Noa},
booktitle={ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA)},
year={2019}
}
We present the design and implementation of p4v, a practical tool for verifying data planes described using the P4 programming language. The design of p4v is based on classic verification techniques but adds several key innovations including a novel mechanism for incorporating assumptions about the control plane and domain-specific optimizations which are needed to scale to large programs. We present case studies showing that p4v verifies important properties and finds bugs in real-world programs. We conduct experiments to quantify the scalability of p4v on a wide range of additional examples. We show that with just a few hundred lines of control-plane annotations, p4v is able to verify critical safety properties for switch.p4, a program that implements the functionality of on a modern data center switch, in under three minutes.
@inproceedings{liu2018p4v,
title={p4v: Practical Verification for Programmable Data Planes},
author={Liu, Jed and Hallahan, William and Schlesinger, Cole and Sarif, Milad and Lee, Jeongkeun and Soule, Robert and Wang, Han and Cascaval, Calin and McKeown, Nick and Foster, Nate},
booktitle={ACM SIGCOMM},
year={2018}
}
We present AppSwitch, an application-layer approach to load balancing within a software switch. AppSwitch uses system call interposition to transparently redirect connection setup to backend servers, without modifying applications or the network stack. It supports application-aware routing decisions and integrates with container orchestration platforms. AppSwitch achieves performance comparable to kernel-level load balancers while providing greater flexibility and simpler deployment.
@inproceedings{cidon2017appswitch,
title={AppSwitch: Application-layer Load Balancing within a Software Switch},
author={Cidon, Eyan and Choi, Sean and Katti, Sachin and McKeown, Nick},
booktitle={APNet},
year={2017}
}
Many algorithms for congestion control, scheduling, network measurement, active queue management, and traffic engineering require custom processing of packets in the data plane of a network switch. To run at line rate, these data-plane algorithms must be implemented in hardware. With today's switch hardware, algorithms cannot be changed, nor new algorithms installed, after a switch has been built. This paper shows how to program data-plane algorithms in a high-level language and compile those programs into low-level microcode that can run on emerging programmable line-rate switching chips. The key challenge is that many data-plane algorithms create and modify algorithmic state. To achieve line-rate programmability for stateful algorithms, we introduce the notion of a packet transaction: a sequential packet-processing code block that is atomic and isolated from other such code blocks. We have developed this idea in Domino, a C-like imperative language to express data-plane algorithms. We show with many examples that Domino provides a convenient way to express sophisticated data-plane algorithms, and show that these algorithms can be run at line rate with modest estimated chip-area overhead.
@inproceedings{sivaraman2016packet,
title={Packet Transactions: High-Level Programming for Line-Rate Switches},
author={Sivaraman, Anirudh and Cheung, Alvin and Budiu, Mihai and Kim, Changhoon and Alizadeh, Mohammad and Balakrishnan, Hari and Varghese, George and McKeown, Nick and Licking, Steve},
booktitle={ACM SIGCOMM},
year={2016}
}
Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, Nick McKeown
Switches today provide a small menu of scheduling algorithms. While we can tweak scheduling parameters, we cannot modify algorithmic logic, or add a completely new algorithm, after the switch has been designed. This paper presents a design for a {\em programmable} packet scheduler, which allows scheduling algorithms---potentially algorithms that are unknown today---to be programmed into a switch without requiring hardware redesign. Our design uses the property that scheduling algorithms make two decisions: in what order to schedule packets and when to schedule them. Further, we observe that in many scheduling algorithms, definitive decisions on these two questions can be made when packets are enqueued. We use these observations to build a programmable scheduler using a single abstraction: the push-in first-out queue (PIFO), a priority queue that maintains the scheduling order or time. We show that a PIFO-based scheduler lets us program a wide variety of scheduling algorithms. We present a hardware design for this scheduler for a 64-port 10 Gbit/s shared-memory (output-queued) switch. Our design costs an additional 4% in chip area. In return, it lets us program many sophisticated algorithms, such as a 5-level hierarchical scheduler with programmable decisions at each level.
@inproceedings{sivaraman2016programmable,
title={Programmable Packet Scheduling at Line Rate},
author={Sivaraman, Anirudh and Subramanian, Suvinay and Alizadeh, Mohammad and Chole, Sharad and Chuang, Shang-Tse and Agrawal, Anurag and Balakrishnan, Hari and Edsall, Tom and Katti, Sachin and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2016}
}
We present PISCES, a protocol-independent software switch derived from Open vSwitch. Like a hardware P4 switch, PISCES can be configured to parse and process any set of protocols using a P4 program, but it runs entirely in software. By automatically generating the switch's parser and match-action pipeline from a P4 specification, PISCES eliminates the need to manually modify switch source code when adding new protocols. We show that PISCES achieves competitive performance with Open vSwitch while offering significantly greater flexibility.
@inproceedings{shahbaz2016pisces,
title={PISCES: A Programmable, Protocol-Independent Software Switch},
author={Shahbaz, Muhammad and Choi, Sean and Pfaff, Ben and Kim, Changhoon and Feamster, Nick and McKeown, Nick and Rexford, Jennifer},
booktitle={ACM SIGCOMM},
year={2016}
}
Anirudh Sivaraman, Suvinay Subramanian, Anurag Agrawal, Sharad Chole, Shang-Tse Chuang, Tom Edsall, Mohammad Alizadeh, Sachin Katti, Nick McKeown, Hari Balakrishnan
We present a compiler that translates programs written in a high-level packet-processing language to run on reconfigurable switch hardware. The compiler maps packet-processing pipelines onto the available hardware resources of reconfigurable match-action table architectures, handling the constraints of limited table memory, match types, and pipeline depth. Our compiler enables network operators to rapidly deploy new protocols and features on existing hardware, without waiting for vendors to add support in firmware.
@inproceedings{sivaraman2015towards,
title={Towards Programmable Packet Scheduling},
author={Sivaraman, Anirudh and Subramanian, Suvinay and Agrawal, Anurag and Chole, Sharad and Chuang, Shang-Tse and Edsall, Tom and Alizadeh, Mohammad and Katti, Sachin and McKeown, Nick and Balakrishnan, Hari},
booktitle={ACM HotNets},
year={2015}
}
We present a compiler that maps packet-processing programs to the pipeline architecture of reconfigurable match-action switches. These switches have a fixed number of match-action stages with limited resources, and the compiler must map a logical program to these physical constraints. We study the interplay between the expressiveness of the programming model and the efficiency of the resulting pipeline, and show that our compiler can successfully map a variety of packet-processing programs to existing hardware targets.
@inproceedings{jose2015compiling,
title={Compiling Packet Programs to Reconfigurable Switches},
author={Jose, Lavanya and Yan, Lisa and Varghese, George and McKeown, Nick},
booktitle={USENIX NSDI},
year={2015}
}
Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, David Walker
ACM SIGCOMM Computer Communication Review, Vol. 44, Issue 3, July 2014
P4 is a high-level language for programming protocol-independent packet processors. P4 works in conjunction with SDN control protocols like OpenFlow. In its current form, OpenFlow explicitly specifies protocol headers on which it operates. This set has grown from 12 to 41 fields in a few years, increasing the complexity of the specification while still not providing the flexibility to add new headers. In this paper we propose P4 as a strawman proposal for how OpenFlow should evolve in the future. We have three goals: (1) Reconfigurability in the field: Programmers should be able to change the way switches process packets once they are deployed. (2) Protocol independence: Switches should not be tied to any specific network protocols. (3) Target independence: Programmers should be able to describe packet-processing functionality independently of the specifics of the underlying hardware. As an example, we describe how to use P4 to configure a switch to add a new hierarchical label.
@article{bosshart2014p4,
title={P4: Programming Protocol-Independent Packet Processors},
author={Bosshart, Pat and Daly, Dan and Gibb, Glen and Izzard, Martin and McKeown, Nick and Rexford, Jennifer and Schlesinger, Cole and Talayco, Dan and Vahdat, Amin and Varghese, George and Walker, David},
journal={ACM SIGCOMM Computer Communication Review},
volume={44},
number={3},
year={2014}
}
In Software Defined Networking (SDN) the control plane is physically separate from the forwarding plane. Control software programs the forwarding plane (e.g., switches and routers) using an open interface, such as OpenFlow. This paper aims to overcome two limitations in current switching chips and the OpenFlow protocol: (i) current hardware switches are quite rigid, allowing "Match-Action" processing on only a fixed set of fields, and (ii) the OpenFlow specification only defines a limited repertoire of packet processing actions. We propose the RMT (Reconfigurable Match Tables) model, a new RISC-inspired pipelined architecture for switching chips, and we identify the essential minimal set of action primitives to specify how headers are processed in hardware. RMT allows the forwarding plane to be changed in the field without modifying hardware. As in OpenFlow, the programmer can specify multiple match tables of arbitrary width and depth, subject only to an overall resource limit, with each table configurable for matching on arbitrary fields. However, RMT allows the programmer to modify all header fields much more comprehensively than in OpenFlow. Our paper describes the design of a 64 port by 10 Gb/s switch chip implementing the RMT model.
@inproceedings{bosshart2013forwarding,
title={Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN},
author={Bosshart, Pat and Gibb, Glen and Kim, Hun-Seok and Varghese, George and McKeown, Nick and Izzard, Martin and Mujica, Fernando and Horowitz, Mark},
booktitle={ACM SIGCOMM},
year={2013}
}
This article provides a Chinese-language introduction to the P4 programming language for programming network data planes. P4 enables network engineers to define how packets are processed by programmable switches, routers, and NICs. The article covers key P4 concepts including header definitions, parser specifications, match-action tables, and control flow.
@article{kim2015programming,
title={Programming the Network Dataplane in P4},
author={Kim, Changhoon and Kao, Ron and McKeown, Nick},
journal={Communications of the CCF},
year={2015}
}
It became clear in the early 2000s that the Internet faced challenges with ossification, slow innovation, and complex network management due to vendor-controlled hardware and software. Software-defined networking (SDN) emerged as a transformative solution, introducing an open interface for packet forwarding and logically centralized control. Its success stemmed from a virtuous cycle between early U.S. National Science Foundation (NSF)-funded academic research (for example, 100x100, GENI, FIND) and the pressing needs of cloud hyperscalers for flexible, scalable networks. SDN revolutionized network design and operation across public and private sectors, driving significant commercial adoption, fostering a broad research community, and enabling new capabilities like multi-tenant virtualization and efficient wide-area traffic engineering. The investments NSF made in SDN over the past two decades helped SDN revolutionize network design and operation across public and private sectors.
@article{mckeown2025nsf,
title={How the U.S. National Science Foundation Enabled Software-Defined Networking},
author={McKeown, Nick and Rexford, Jennifer},
journal={Communications of the ACM},
volume={68},
number={12},
year={2025}
}
Revitalizing the Public Internet by Making It Extensible
Hari Balakrishnan, Sujata Banerjee, Israel Cidon, David Culler, Deborah Estrin, Ethan Katz-Bassett, Arvind Krishnamurthy, James McCauley, Nick McKeown, Aurojit Panda, Sylvia Ratnasamy, Jennifer Rexford, Michael Schapira, Scott Shenker, Ion Stoica, David Tennenhouse, Amin Vahdat, Ellen Zegura
ACM SIGCOMM Computer Communication Review, Vol. 51, No. 2, 2021
There is now a significant and growing functional gap between the public Internet, whose basic architecture has remained unchanged for several decades, and a new generation of more sophisticated private networks. To address this increasing divergence of functionality and overcome the Internet's architectural stagnation, we argue for the creation of an Extensible Internet (EI) that supports in-network services that go beyond best-effort packet delivery. To gain experience with this approach, we hope to soon deploy both an experimental version (for researchers) and a prototype version (for early adopters) of EI. In the longer term, making the Internet extensible will require a community to initiate and oversee the effort; this paper is the first step in creating such a community.
@article{balakrishnan2021revitalizing,
title={Revitalizing the Public Internet by Making It Extensible},
author={Balakrishnan, Hari and Banerjee, Sujata and Cidon, Israel and Culler, David and Estrin, Deborah and Katz-Bassett, Ethan and Krishnamurthy, Arvind and McCauley, James and McKeown, Nick and Panda, Aurojit and Ratnasamy, Sylvia and Rexford, Jennifer and Schapira, Michael and Shenker, Scott and Stoica, Ion and Tennenhouse, David and Vahdat, Amin and Zegura, Ellen},
journal={ACM SIGCOMM Computer Communication Review},
volume={51},
number={2},
year={2021}
}
Using Deep Programmability to Put Network Owners in Control
Nate Foster, Nick McKeown, Jennifer Rexford, Guru Parulkar, Larry Peterson, Oguz Sunay
ACM SIGCOMM Computer Communication Review, Vol. 50, No. 4, 2020
Controlling an opaque system by reading some "dials" and setting some "knobs," without really knowing what they do, is a hazardous and fruitless endeavor, particularly at scale. What we need are transparent networks, that start at the top with a high-level intent and map all the way down, through the control plane to the data plane. If we can specify the behavior we want in software, then we can check that the system behaves as we expect. This is impossible if the implementation is opaque. We therefore need to use open-source software or write it ourselves (or both), and have mechanisms for checking actual behavior against the specified intent. With fine-grain checking (e.g., every packet, every state variable), we can build networks that are more reliable, secure, and performant. In the limit, we can build networks that run autonomously under verifiable, closed-loop control. We believe this vision, while ambitious, is finally within our reach, due to deep programmability across the stack, both vertically (control and data plane) and horizontally (end to end). It will emerge naturally in some networks, as network owners take control of their software and engage in open-source efforts; whereas in enterprise networks it may take longer. In 5G access networks, there is a pressing need for our community to engage, so these networks, too, can operate autonomously under verifiable, closed-loop control.
@article{foster2020deep,
title={Using Deep Programmability to Put Network Owners in Control},
author={Foster, Nate and McKeown, Nick and Rexford, Jennifer and Parulkar, Guru and Peterson, Larry and Sunay, Oguz},
journal={ACM SIGCOMM Computer Communication Review},
volume={50},
number={4},
year={2020}
}
Democratizing the Network Edge
Larry Peterson, Tom Anderson, Sachin Katti, Nick McKeown, Guru Parulkar, Jennifer Rexford, Mahadev Satyanarayanan, Oguz Sunay, Amin Vahdat
ACM SIGCOMM Computer Communication Review, Vol. 49, No. 2, 2019
The cloud and telecommunications industry is in the midst of a transition towards the edge. There is a tremendous opportunity for the research community to influence this transformation, but doing so requires understanding industry momentum, and making a concerted effort to align with that momentum. We believe there are three keys to doing this: (1) focus on the intersection of the cloud and access networks, (2) contribute to the relevant open source projects, and (3) address the challenge of operationalizing the results. The paper puts forward a concrete proposal for all three, and discusses the opportunity to influence how the Internet evolves at the edge and enable innovative edge applications.
@article{peterson2019democratizing,
title={Democratizing the Network Edge},
author={Peterson, Larry and Anderson, Tom and Katti, Sachin and McKeown, Nick and Parulkar, Guru and Rexford, Jennifer and Satyanarayanan, Mahadev and Sunay, Oguz and Vahdat, Amin},
journal={ACM SIGCOMM Computer Communication Review},
volume={49},
number={2},
year={2019}
}
From Ethane to SDN and Beyond
Martin Casado, Nick McKeown, Scott Shenker
ACM SIGCOMM Computer Communication Review, Vol. 49, No. 5, 2019
We briefly describe the history behind the Ethane paper and its ultimate evolution into SDN and beyond.
@article{casado2019ethane,
title={From Ethane to SDN and Beyond},
author={Casado, Martin and McKeown, Nick and Shenker, Scott},
journal={ACM SIGCOMM Computer Communication Review},
volume={49},
number={5},
year={2019}
}
Event-Driven Packet Processing
Stephen Ibanez, Gianni Antichi, Gordon Brebner, Nick McKeown
The rise of programmable network devices and the P4 programming language has sparked an interest in developing new applications for packet processing data planes. Current data-plane programming models allow developers to express packet processing on a synchronous packet-by-packet basis, motivated by the goal of line rate processing in feed-forward pipelines. But some important data-plane operations do not naturally fit into this programming model. Sometimes we want to perform periodic tasks, or update the same state variables multiple times, or base a decision on state sitting at a different pipeline stage. While a P4-programmable device might contain special features to handle these tasks, such as packet generators and recirculation paths, there is currently no clean and consistent way to expose them to P4 programmers. We therefore propose a common, general way to express event processing using the P4 language, beyond just processing packet arrival and departure events. We believe that this more general notion of event processing can be supported without sacrificing line rate packet processing and we have developed a prototype event-driven architecture on the NetFPGA SUME platform to serve as an initial proof of concept.
@inproceedings{ibanez2019event,
title={Event-Driven Packet Processing},
author={Ibanez, Stephen and Antichi, Gianni and Brebner, Gordon and McKeown, Nick},
booktitle={ACM HotNets},
year={2019}
}
For the past two decades, the communication channel between the NIC and CPU has largely remained the same---issuing memory requests across a slow PCIe peripheral interconnect. Today, with application service times and network fabric delays measuring hundreds of nanoseconds, the NIC--CPU interface can account for most of the overhead when programming modern warehouse-scale computers. In this paper, we tackle this issue head-on by proposing a design for a fast path between the NIC and CPU, called Lightning NIC (L-NIC), which deviates from the established norms of offloading computation onto the NIC (inflating latency), or using centralized dispatcher cores for packet scheduling (limiting throughput). L-NIC adds support for a fast path from the network to the core of the CPU by writing and reading packets directly to/from the CPU register file. This approach minimizes network IO latency, providing significant performance improvements over traditional NIC--CPU interfaces.
@inproceedings{ibanez2019case,
title={The Case for a Network Fast Path to the CPU},
author={Ibanez, Stephen and Shahbaz, Muhammad and McKeown, Nick},
booktitle={ACM HotNets},
year={2019}
}
Dense WiFi deployments, where many access points operate in close proximity, suffer from poor channel selection, coverage gaps, and unsatisfying user experience. BeHop is a testbed that enables researchers to experiment with new approaches to managing dense WiFi networks. Built on commodity hardware, BeHop provides a programmable platform for experimenting with channel assignment, power control, and association policies in a realistic multi-AP environment.
@inproceedings{yiakoumis2014behop,
title={BeHop: A Testbed for Dense WiFi Networks},
author={Yiakoumis, Yiannis and Bansal, Manu and Covington, Adam and van Reijendam, Johan and Katti, Sachin and McKeown, Nick},
booktitle={WiNTECH},
year={2014}
}
Despite network monitoring and testing being critical for computer networks, current solutions are both extremely expensive and inflexible. Into this lacuna we launch the Open Source Network Tester, a fully open source traffic generator and capture system. Our prototype implementation on the NetFPGA-10G supports 4 ×
@article{antichi2014osnt,
title={OSNT: Open Source Network Tester},
author={Antichi, Gianni and Shahbaz, Muhammad and Geng, Yilong and Zilberman, Noa and Covington, Adam and Bruyere, Marc and McKeown, Nick and Feamster, Nick},
journal={IEEE Network},
year={2014}
}
Packet classification on general purpose CPUs remains expensive regardless of advances in classification algorithms. Unless the packet forwarding pipeline is both simple and static in function, fine-tuning the system for optimal forwarding is a time-consuming and brittle process. Network virtualization and network function virtualization value general purpose CPUs exactly for their flexibility: in such systems, a single x86 forwarding element does not implement a single, static classification step but a sequence of dynamically reconfigurable and potentially complex forwarding operations. This leaves a software developer looking for maximal packet forwarding throughput with few options besides flow caching. In this paper, we consider the problem of flow caching and more specifically the challenge of flow caching for packet header fields with high entropy.
@inproceedings{shelly2014flow,
title={Flow Caching for High Entropy Packet Fields},
author={Shelly, Nick and Jackson, Ethan J. and Koponen, Teemu and McKeown, Nick and Rajahalme, Jarno},
booktitle={HotSDN},
year={2014}
}
Since the publication of the original OpenFlow paper in 2008, software-defined networking (SDN) and OpenFlow have evolved from a research experiment into a widely-deployed production technology. We describe the key technical and operational lessons learned through deployment, including the evolution of the OpenFlow specification, the development of production SDN controllers, and the challenges of migrating from research prototypes to production systems.
@article{kobayashi2014maturing,
title={Maturing of OpenFlow and Software-Defined Networking through Deployments},
author={Kobayashi, Masayoshi and Seetharaman, Srini and Parulkar, Guru and Appenzeller, Guido and Little, Joseph and van Reijendam, Johan and Weissmann, Paul and McKeown, Nick},
journal={Computer Networks},
volume={61},
year={2014}
}
The Internet core today is completely based on IP routers. Circuits are only used to provide static point-to-point optical links between routers. We argue that circuits should play a bigger role in the core, because they are simpler, use less power, have lower latency, and are inherently more reliable than routers. In this paper we describe how a combination of circuit and packet switching could lead to a simpler, lower cost, and more reliable Internet backbone.
@article{das2013rethinking,
title={Rethinking IP Core Networks},
author={Das, Saurav and Parulkar, Guru and McKeown, Nick},
journal={Journal of Optical Communications and Networking},
year={2013}
}
All network devices must parse packet headers to decide howpacketsshouldbeprocessed. A64×10Gb/sEthernet switch must parse one billion packets per second to extract fields used in forwarding decisions. Although a necessary part of all switch hardware, very little has been written on parserdesignandthetrade-offsbetweendifferentdesigns. Is it better to design one fast parser, or several slow parsers? What is the cost of making the parser reconfigurable in the field? What design decisions most impact power and area? Inthispaper,wedescribetrade-offsinparserdesign,identify design principles for switch and router designers, and describe a parser generator that outputs synthesizable Verilog that is available for download. We show that i) packet parsers today occupy about 1-2% of the chip, and ii) while future packet parsers will need to be programmable, this only doubles the (already small) area needed.
@inproceedings{gibb2013design,
title={Design Principles for Packet Parsers},
author={Gibb, Glen and Varghese, George and Horowitz, Mark and McKeown, Nick},
booktitle={ACM/IEEE ANCS},
year={2013}
}
Network architectures such as Software-Defined Networks (SDNs) move the control logic off packet processing devices and onto external controllers. These network architectures with decoupled control planes open many unanswered questions regarding reliability, scalability, and performance when compared to more traditional purely distributed systems. This paper opens the investigation by focusing on two specific questions: given a topology, how many controllers are needed, and where should they go? To answer these questions, we examine fundamental limits to control plane propagation latency on an upcoming Internet2 production deployment, then expand our scope to over 100 publicly available WAN topologies. As expected, the answers depend on the topology. More surprisingly, one controller location is often sufficient to meet existing reaction-time requirements (though certainly not fault tolerance requirements).
@inproceedings{heller2012controller,
title={The Controller Placement Problem},
author={Heller, Brandon and Sherwood, Rob and McKeown, Nick},
booktitle={HotSDN},
year={2012}
}
This paper presents an architecture for adding functionality to networks via outsourcing. In this model, the enterprise network only forwards data; any additional processing is performed by external Feature Providers (FPs). FPs provide and manage features, scaling and moving them in response to customer demand, and providing automated recovery in case of failure. Benefits to the enterprise include reduced cost and management complexity, improved features through FP specialization, and increased choice in services. Central to the model are a policy component and a Feature API (FAPI). Policy is specified with features not locations, enabling features to be located anywhere. FAPI enables communication between enterprise and FP control planes to share policy and configuration.
@inproceedings{gibb2012outsourcing,
title={Outsourcing Network Functionality},
author={Gibb, Glen and Zeng, Hongyi and McKeown, Nick},
booktitle={HotSDN},
year={2012}
}
This paper presents an architecture for adding functionality to networks via outsourcing. In this model, the enterprise network only forwards data; any additional processing is performed by external Feature Providers (FPs). FPs provide and manage features, scaling and moving them in response to customer demand, and providing automated recovery in case of failure. Benefits to the enterprise include reduced cost and management complexity, improved features through FP specialization, and increased choice in services. Central to the model are a policy component and a Feature API (FAPI). Policy is specified with features not locations, enabling features to be located anywhere. FAPI enables communication between enterprise and FP control planes to share policy and configuration.
@inproceedings{das2012why,
title={Why OpenFlow/SDN Can Succeed Where GMPLS Failed},
author={Das, Saurav and Parulkar, Guru and McKeown, Nick},
booktitle={ECOC},
year={2012}
}
We demonstrate MPLS Traffic Engineering (MPLS-TE) and MPLS-based Virtual Private Networks (MPLS VPNs) using OpenFlow and NOX. The demonstration is the outcome of an engineering experiment to answer the following questions: How hard is it to implement a complex control plane on top of a network controller such as NOX? Does the global vantage point in NOX make the implementation easier than the traditional method of implementing it on every switch, embedded in the data plane? We implemented every major feature of MPLS-TE and MPLS-VPN in just 2,000 lines of code, compared to much larger lines of code in the more traditional approach, such as Quagga-MPLS. Because NOX maintains a consistent, up-to-date topology map, the MPLS control plane features are no longer tied to multiple protocols that would normally have to be changed.
@inproceedings{sharafat2011mpls,
title={MPLS-TE and MPLS VPNS with OpenFlow},
author={Sharafat, Ali Reza and Das, Saurav and Parulkar, Guru and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2011}
}
Despite the popularity of home networks, they face a number of systemic problems: (i) Broadband networks are expensive to deploy, and it is not clear how the cost can be shared by several service providers; (ii) Home networks are getting harder to manage as we connect more devices, use new applications, and rely on them for entertainment, communication and work—it is common for home networks to be poorly managed, insecure or just plain broken; and (iii) It is not clear how home networks will steadily improve, after they have been deployed, to provide steadily better service to home users. In this paper we propose slicing home networks as a way to overcome these problems. As a mechanism, slicing allows multiple service providers to share a common infrastructure and supports the independent evolution of services.
@inproceedings{yiakoumis2011slicing,
title={Slicing Home Networks},
author={Yiakoumis, Yiannis and Yap, Kok-Kiong and Katti, Sachin and Parulkar, Guru and McKeown, Nick},
booktitle={SIGCOMM Workshop on Home Networks},
year={2011}
}
We demonstrate a converged Open Flow enabled packet-circuit network, where circuit flow properties (guarantee d bandwidth, low latency, recovery) provide differential treatment to dynamically aggregated packet flows for voice, video and web traffic. OCIS codes: (060.4253) Networks, circuit-switched; (060.4259) Networks, packet-switched
@inproceedings{das2011applicationaware,
title = {Application-Aware Aggregation and Traffic Engineering in a Converged Packet-Circuit Network},
author = {Saurav Das, Yiannis Yiakoumis, Guru M. Parulkar, Preeti Singh, Daniel Getachew, Premal Dinesh Desai, Nick McKeown},
booktitle = {Proceedings of OFC/NFOEC'11, Los Angeles, March 2011},
year = {2011}
}
We propose a new approach to MPLS that uses the standard MPLS data plane and an Open Flow based simpler and extensible control plane. We demonstrate this approach using a prototype system for MPLS Traffic Engineering. OCIS codes: (060.4250) Networks; (060.4259) Networks, packet-switched
@inproceedings{das2011mpls,
title = {MPLS with a Simple OPEN Control Plane},
author = {Saurav Das, Ali Reza Sharafat, Guru M. Parulkar, Nick McKeown},
booktitle = {Proceedings of OFC/NFOEC'11, Los Angeles, March 2011},
year = {2011}
}
Network operators want additional functionality from the networks they manage. The current approach to add functionality is to deploy middleboxes. Unfortunately middleboxes raise concerns regarding robustness, correctness, and efficiency due to their need to be deployed at chokepoints. This paper provides some initial thoughts for solving the middleboxprobleminanarchitecturalway. Webelievethat waypoint services are the correct way to add functionality toanetwork. Networkprocessingcanbemodeledasclassification followed by action. Additional functionality should beaddedtothenetworkthroughaservicemodelexposedas new actions. Services would be implemented at waypoints which reside off the normal packet path; routers can send traffic to those services for additional processing. The waypoint service model allows services to be hosted anywherewithinthenetwork,allowsservicestobesharedby multiple routers, and is accessible via a simple action API. Abstracting custom packet processing as waypoint services providesasystematicwaytobringnewfunctionalitytothe network.
@inproceedings{gibb2011initial,
title = {Initial thoughts on custom network processing via waypoint services},
author = {Glen Gibb, Hongyi Zeng, Nick McKeown},
booktitle = {WISH - 3rd Workshop on Infrastructures for Software/Hardware co-design, CGO 2011, April 2011, Chamonix, France},
year = {2011}
}
Despite the popularity of home networks, they face a number of systemic problems: (i) Broadband networks are expensive to deploy, and it is not clear how the cost can be shared by several service providers; (ii) Home networks are getting harder to manage as we connect more devices, use new applications, and rely on them for entertainment, communication and work—it is common for home networks to be poorly managed, insecure or just plain broken; and (iii) It is not clear how home networks will steadily improve, after they have been deployed, to provide steadily better service to home users. In this paper we propose slicing home networks as a way to overcome these problems. As a mechanism, slicing allows multiple service providers to share a common infrastructure and supports the independent evolution of services.
@article{ghodsi2011architecting,
title={Architecting for Innovation},
author={Ghodsi, Ali and Godfrey, P. Brighten and McKeown, Nick and Parulkar, Guru and Raghavan, Barath},
journal={ACM Computer Communication Review},
year={2011}
}
Proceedings of the International Conference on Field-Programmable Technology, FPT 2010, December 2010, Tsinghua University, Beijing, China 2010. FPT 2010:381-384
Distributing a hardware design across multiple physical devices is difficult—splitting a design across two chips requires considerable effort to partition the design and to build the communication mechanism between the chips. Designers and researchers would benefit enormously if this were easier as it would, for example, allow multiple FPGAS to be used when building prototypes. To this end we propose Open Pipes, a platform to allow hardware designs to be distributed across physicalresources.Open Pipesfollowsthemodelofmanysystembuilding platforms: systems are built by composing modules together. What makes it unique is that it uses an Open Flow network as the interconnect between modules, providing Open Pipes with complete control over all traffic flows within the interconnect. Any device that can attach to the network can host modules, allowing software modules to be used alongside hardware modules. The control provided by Open Flow allows running systems to be modified dynamically, and as we show in the paper, Open Pipes provides a mechanism for migrating from software to hardware modules that simplifies testing.
@inproceedings{gibb2010openpipes,
title = {OpenPipes: Making Distributed Hardware Systems Easier},
author = {Glen Gibb, Nick McKeown},
booktitle = {Proceedings of the International Conference on Field-Programmable Technology, FPT 2010, December 2010, Tsinghua University, Beijing, China 2010. FPT 2010:381-384},
year = {2010}
}
A persistent problem in computer network research is validation. When deciding how to evaluate a new feature or bug fix, researchers face an uncomfortable trade-off between the realism of the test environment and the effort required to deploy it. We argue that the production network itself should serve as the testbed. We describe FlowVisor, a special purpose OpenFlow controller that creates rich slices of network resources and delegates control of each slice to a different controller. This allows multiple isolated experimental networks to run simultaneously on the same production hardware, enabling realistic testing without disrupting production traffic.
@inproceedings{sherwood2010can,
title={Can the Production Network Be the Testbed?},
author={Sherwood, Rob and Gibb, Glen and Yap, Kok-Kiong and Appenzeller, Guido and Casado, Martin and McKeown, Nick and Parulkar, Guru},
booktitle={USENIX OSDI},
year={2010}
}
IP and Transport networks are controlled and operated independently today, leading to significant Capex and Opex inefficiencies for the providers. We discuss a unified approach with Open Flow, and present a recent demonstration of a unified control plane for Open Flow enabled IP/Ethernet and TDM switched networks.
@inproceedings{das2010packet,
title = {Packet and Circuit Network Convergence with OpenFlow},
author = {Saurav Das, Guru Parulkar, Preeti Singh, Daniel Getachew, Lyndon Ong and Nick McKeown},
booktitle = {Optical Fiber Communication Conference (OFC'10), San Diego, March, 2010},
year = {2010}
}
OpenRoads is a mobile wireless testbed built on the OpenFlow platform that empowers researchers to deploy and evaluate new protocols and services in a realistic environment. OpenRoads provides a programmatic interface for managing wireless networks, including WiMAX and WiFi, allowing researchers to control network behavior through software. We describe the architecture and deployment of OpenRoads on the Stanford campus, and demonstrate how it enables practical research in mobile networking.
@inproceedings{yap2009openroads,
title={OpenRoads: Empowering Research in Mobile Networks},
author={Yap, Kok-Kiong and Kobayashi, Masayoshi and Sherwood, Rob and Handigol, Nikhil and Huang, Te-Yuan and Chan, Michael and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2009}
}
Current approaches to network security try to prevent attackers from gaining access. We propose a complementary approach: delegate the decision of whether to trust a communication to the end-host by providing it with information about the provenance of traffic. We describe a system that allows end-hosts to learn more about where their traffic is coming from, enabling them to make informed decisions about which network communications to accept.
@inproceedings{naous2009delegating,
title = {Delegating Network Security Through More Information},
author = {Jad Naous, Ryan Stutsman, David Mazieres, Nick McKeown, and Nickolai Zeldovich},
booktitle = {In Proceedings of the Workshop on Research on Enterprise Networking, August, 2009},
year = {2009}
}
We demonstrate a lossless handover mechanism between WiFi and WiMAX networks on the OpenRoads testbed. Using OpenFlow-based n-casting, the system sends duplicate packets over multiple wireless networks simultaneously, enabling seamless transitions between access technologies without packet loss during handoff.
@inproceedings{yap2009demo,
title = {Demo: Lossless Handover with n-casting between WiFi-WiMAX on OpenRoads},
author = {Kok-Kiong Yap, Te-Yuan Huang, Masayoshi Kobayashi, Michael Chan, Rob Sherwood, Guru Parulkar, and Nick McKeown},
booktitle = {Mobicom, Beijing, China, September 2009},
year = {2009}
}
Wehavebuiltanddeployed Open Roads[11],atestbedthat allows multiple network experiments to be conducted concurrently in a production network. For example, multiple routing protocols, mobility managers and network access controllers can run simultaneously in the same network. In this paper, we describe and discuss our deployment of the testbed at Stanford University. We focus on the challenges we faced deploying in a production network, and the tools we built to overcome these challenges. Our goal is to gain enoughexperienceforothergroupstodeploy Open Roadsin their campus network.
@inproceedings{yap2009stanford,
title = {The Stanford OpenRoads Deployment},
author = {Kok-Kiong Yap, Masayoshi Kobayashi, David Underhill, Srinivasan Seetharaman, Peyman Kazemian, and Nick McKeown},
booktitle = {WiNTECH, Mobicom, Beijing, China, September 2009},
year = {2009}
}
There have been many attempts to unify the control and management of circuit and packet switched networks, but none have taken hold. In this paper we propose a simple way to unify both types of network using Open Flow. The basic idea is that a simple flow abstraction fits well with both types of network, provides a common paradigm for control, and makes it easy to insert new functionality into the network. Open Flow provides a common API to the underlying hardware, and allows all of the routing, control and management to be defined in software outside the datapath.
@inproceedings{das2009unifying,
title = {Unifying Packet and Circuit Switched Networks},
author = {Saurav Das, Guru Parulkar and Nick McKeown},
booktitle = {In Proceedings of the Workshop on Below IP Networking, held in conjunction with Globecom09, Hawaii, November, 2009},
year = {2009}
}
Therehasusuallybeenacleanseparationbetweennetworks andtheapplicationsthatusethem. Applicationssendpackets over a simple socket API; the network delivers them. However, there are many occasions when applications can benefit from more direct interaction with the network: to observe more of the current network state and to obtain more control over the network behavior. This paper explores some of the potential benefits of closer interaction betweenapplicationsandthenetwork. Exploitingtheemergence of so-called“software-defined networks”(SDN) built abovenetwork-widecontrolplanes,weexplorehowtobuild amore“software-friendlynetwork”. Wepresentresultsfrom apreliminaryexplorationthataimstoprovidenetworkservicestoapplicationsviaanexplicitcommunicationchannel.
@inproceedings{yap2010towards,
title = {Towards Software-Friendly Networks},
author = {Kok-Kiong Yap, Te-Yuan Huang, Ben Dodson, Monica S. Lam, Nick McKeown},
booktitle = {APSys 2010, New Delhi, India},
year = {2010}
}
Thispaperproposes Phone Net,anapplicationframeworkto supportdirectgroupcommunicationamongphoneswithout relay nodes. Phone Net presents the familiar abstraction of amulti-userchatservicetoapplicationwriters. Itperforms two main functions: inviting participants to the chat room and routing data between participants directly without going through any intermediaries. Madepossiblebyagenericchatroomserviceembeddedin thenetworkitself,allapplication-specificcodein Phone Net applicationsrunsonthephonesthemselves. Unliketheconventional server-client model, this design does not require scalablecentralserversthatcanhandleallsimultaneousinteractions. As a first step, we have created a prototype of Phone Net that works within an administrative domain. The multicast functionality among phones is implemented on top of a software-defined network (SDN). We have developed two applications using Phone Net: teleconferencing and photosharing. Our experience suggests that it is easy to develop Phone Netapplicationsand Phone Netappearstobeeffective in reducing network traffic.
@inproceedings{huang2010phonenet,
title = {PhoneNet: a Phone-to-Phone Network for Group Communication within an Administrative Domain},
author = {Te-Yuan Huang, Kok-Kiong Yap, Ben Dodson, Monica S. Lam, Nick McKeown},
booktitle = {MobiHeld 2010, New Delhi, India},
year = {2010}
}
@inproceedings{gudla2010experimental,
title = {Experimental Demonstration of OpenFlow Control of Packet and Circuit Switches},
author = {Vinesh Gudla, Saurav Das, Anujit Shastri, Guru Parulkar, Nick McKeown and Leonid Kazovsky},
booktitle = {OSA / OFC/NFOEC 2010},
year = {2010}
}
We demonstrate FlowVisor, a special purpose OpenFlow controller that slices the network, allowing multiple researchers to run independent experiments on the same production OpenFlow network simultaneously. FlowVisor provides isolation between slices, ensuring that research experiments cannot interfere with production traffic or with each other.
@inproceedings{sherwood2009demo,
title = {Demo: Carving Research Slices Out of Your Production Networks with OpenFlow},
author = {Rob Sherwood, Michael Chan, Adam Covington, Glen Gibb, Mario Flajslik, Nikhil Handigol, Te-Yuan Huang, Peyman Kazemian, Masayoshi Kobayashi, Jad Naous, Srinivasan Seetharaman, David Underhill, Tatsuya Yabe, Kok-Kiong Yap, Yiannis Yiakoumis, Hongyi Zeng, Guido Appenzeller, Ramesh Johari, Nick McKeown, and Guru Parulkar},
booktitle = {In Proceedings of ACM SIGCOMM, Barcelona, Spain, August 2009},
year = {2009}
}
Enterprise networks are complex and difficult to manage. Current networks are built from a variety of devices from multiple vendors, each with their own configuration interfaces, making it hard to implement and enforce consistent network-wide policies. We have implemented Ethane in both hardware and software and deployed it in the Stanford Computer Science department. Ethane allows managers to define a single network-wide policy, and then enforces it directly using simple flow-based switches and a centralized controller.
@article{casado2009rethinking,
title={Rethinking Enterprise Network Control},
author={Casado, Martin and Freedman, Michael J. and Pettit, Justin and Luo, Jianying and Gude, Natasha and McKeown, Nick and Shenker, Scott},
journal={IEEE/ACM Transactions on Networking},
volume={17},
number={4},
year={2009}
}
We describe an implementation of an OpenFlow switch on the NetFPGA platform, an open-source hardware platform for networking research. Our implementation supports the OpenFlow specification, allowing researchers to experiment with OpenFlow-based networks using commodity hardware. We describe the architecture, implementation, and performance of the switch.
@inproceedings{naous2008implementing,
title = {Implementing an OpenFlow Switch on the NetFPGA platform},
author = {Jad Naous, David Erickson, Adam Covington, Guido Appenzeller, and Nick McKeown},
booktitle = {ANCS'08, November 6-7, 2008, San Jose, CA, USA},
year = {2008}
}
Networks are now an integral part of every aspect of our daily lives. Managing and controlling the network infrastructure has become increasingly complex. The programming model made available by current networking equipment is primitive, much like the computing environment before the development of operating systems. NOX is a network operating system that provides a programmatic interface to the entire network, giving applications a centralized network view and enabling high-level management policies. NOX manages a network the same way that a traditional operating system manages a computer: it provides useful abstractions for the underlying physical resources and a uniform programmatic interface for managing them.
@article{gude2008nox,
title={NOX: Towards an Operating System for Networks},
author={Gude, Natasha and Koponen, Teemu and Pettit, Justin and Pfaff, Ben and Casado, Martin and McKeown, Nick and Shenker, Scott},
journal={ACM Computer Communication Review},
year={2008}
}
This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use every day. OpenFlow is based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries. Our goal is to encourage networking vendors to add OpenFlow to their switch products for deployment in college campus backbones and wiring closets. We believe that OpenFlow is a pragmatic compromise: on one hand, it allows researchers to run experiments on heterogeneous switches in a uniform way at line-rate and with high port-density; while on the other hand, vendors do not need to expose the internal workings of their switches. In addition to allowing researchers to evaluate their ideas in real-world traffic settings, OpenFlow could serve as a useful campus component in testbeds like GENI.
@article{mckeown2008openflow,
title={OpenFlow: Enabling Innovation in Campus Networks},
author={McKeown, Nick and Anderson, Tom and Balakrishnan, Hari and Parulkar, Guru and Peterson, Larry and Rexford, Jennifer and Shenker, Scott and Turner, Jonathan},
journal={ACM SIGCOMM Computer Communication Review},
volume={38},
number={2},
pages={69--74},
year={2008}
}
This paper presents Ethane, a new network architecture for the enterprise. Ethane allows managers to define a single network-wide fine-grain policy, and then enforces it directly. Ethane couples extremely simple flow-based Ethernet switches with a centralized controller that manages the admittance and routing of flows. While radical, this design is backwards-compatible with existing hosts and switches. We have built and deployed Ethane in our university department, and we describe our experience over a seven-month period.
@inproceedings{casado2007ethane,
title={Ethane: Taking Control of the Enterprise},
author={Casado, Martin and Freedman, Michael J. and Pettit, Justin and Luo, Jianying and McKeown, Nick and Shenker, Scott},
booktitle={ACM SIGCOMM},
year={2007}
}
Ethane is a new network architecture that uses a centralized controller to manage the admittance and routing of flows through the network. This paper describes the design and implementation of Ethane flow switches -- the simple forwarding elements that make up an Ethane network. Through the introduction of middleboxes, Ethane provides fine-grained access control and network management without requiring complex configuration of traditional enterprise switches and routers.
@inproceedings{luo2007prototyping,
title={Prototyping Fast, Simple, Secure Switches for Ethane},
author={Luo, Jianying and Pettit, Justin and Casado, Martin and Lockwood, John and McKeown, Nick},
booktitle={Hot Interconnects},
year={2007}
}
Despite the critical importance of enterprise network security, today's defenses are difficult to configure and manage. Enterprise networks are protected by a combination of complex routing and bridging policies, firewalls, NATs, and other middleboxes. SANE is a protection architecture for enterprise networks that replaces these mechanisms with a single, clean design. In SANE, all routing and access control decisions are made by a logically centralized server that authenticates all communication and grants access according to a network-wide policy.
@inproceedings{casado2006sane,
title={SANE: A Protection Architecture for Enterprise Networks},
author={Casado, Martin and Garfinkel, Tal and Akella, Aditya and Freedman, Michael and Boneh, Dan and McKeown, Nick and Shenker, Scott},
booktitle={15th USENIX Security Symposium},
year={2006}
}
Network Verification & Debugging
Hydra: Effective Runtime Network Verification
Sundararajan Renganathan, Benny Rubin, Hyojoon Kim, Pier Luigi Ventre, Carmelo Cascone, Daniele Moro, Charles Chan, Nick McKeown, Nate Foster
It is notoriously difficult to verify that a network is behaving as intended, especially at scale. This paper presents Hydra, a system that uses ideas from runtime verification to check that every packet is correctly processed with respect to a specification in real time. We propose a domain-specific language for writing properties, called Indus, and we develop a compiler that turns properties thus specified into executable P4 code that runs alongside the forwarding code at line rate. To evaluate our approach, we used Indus to model a range of properties, showing that it is expressive enough to capture examples studied in prior work. We also deployed Hydra checkers for validating paths in source routing and for enforcing slice isolation in Aether, an open-source cellular platform. We confirmed a subtle bug in Aether's 5G mobile core that would have been hard to detect using static techniques. We also evaluated the overheads of Hydra on hardware, finding that it does not significantly increase latency and often does not require additional pipeline stages.
@inproceedings{renganathan2023hydra,
title={Hydra: Effective Runtime Network Verification},
author={Renganathan, Sundararajan and Rubin, Benny and Kim, Hyojoon and Ventre, Pier Luigi and Cascone, Carmelo and Moro, Daniele and Chan, Charles and McKeown, Nick and Foster, Nate},
booktitle={ACM SIGCOMM},
year={2023}
}
We present Header Space Analysis, a framework for statically checking network specifications and configurations. Our framework uses a geometric model of packet headers to check for network-wide invariants, such as reachability failures, forwarding loops, and traffic isolation properties. By treating the header as a point in a geometric space, we can use set operations to compute the set of all possible packet headers that can reach from any port to any other port in the network. We implement our framework and apply it to the Stanford University backbone network, where it finds several configuration errors.
@inproceedings{handigol2014know,
title={I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks},
author={Handigol, Nikhil and Heller, Brandon and Jeyakumar, Vimalkumar and Mazieres, David and McKeown, Nick},
booktitle={USENIX NSDI},
year={2014}
}
We present a systematic approach to troubleshooting software-defined networks by leveraging the SDN layering architecture. When a network problem occurs, our system automatically identifies which layer of the SDN stack is responsible for the issue, using existing debugging tools such as NetPlumber and VeriFlow at the appropriate layer. By decomposing the debugging problem across SDN layers, we can pinpoint bugs faster and more accurately than approaches that treat the network as a monolithic system.
@inproceedings{zeng2014libra,
title={Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks},
author={Zeng, Hongyi and Zhang, Shidong and Ye, Fei and Jeyakumar, Vimalkumar and Ju, Mickey and Liu, Junda and McKeown, Nick and Vahdat, Amin},
booktitle={USENIX NSDI},
year={2014}
}
Network state may change rapidly in response to customer demands, load conditions or configuration changes. But the network must also ensure correctness conditions such as isolating tenants from each other and from critical services. Existing policy checkers cannot verify compliance in real time because of the need to collect state from the entire network and the time it takes to analyze this state. SDNs provide an opportunity in this respect as they provide a logically centralized view from which every proposed change can be checked for compliance with policy. But there remains the need for a fast compliance checker. Our paper introduces a real time policy checking tool called NetPlumber based on Header Space Analysis (HSA). Unlike HSA, however, NetPlumber incrementally checks for policy violations as new rules are inserted, rather than checking the entire network from scratch.
@inproceedings{kazemian2013real,
title={Real Time Network Policy Checking using Header Space Analysis},
author={Kazemian, Peyman and Chang, Michael and Zeng, Hongyi and Varghese, George and McKeown, Nick and Whyte, Scott},
booktitle={USENIX NSDI},
year={2013}
}
Brandon Heller, Colin Scott, Nick McKeown, Scott Shenker, Andreas Wundsam, Hongyi Zeng, Sam Whitlock, Vimalkumar Jeyakumar, Nikhil Handigol, James McCauley, Kyriakos Zarifis, Peyman Kazemian
Network state may change rapidly in response to customer demands, load conditions or configuration changes. But the network must also ensure correctness conditions such as isolating tenants from each other and from critical services. Existing policy checkers cannot verify compliance in real time because of the need to collect state from the entire network and the time it takes to analyze this state. SDNs provide an opportunity in this respect as they provide a logically centralized view from which every proposed change can be checked for compliance with policy. But there remains the need for a fast compliance checker. Our paper introduces a real time policy checking tool called NetPlumber based on Header Space Analysis (HSA). Unlike HSA, however, NetPlumber incrementally checks for policy violations as new rules are inserted, rather than checking the entire network from scratch.
@inproceedings{heller2013leveraging,
title={Leveraging SDN Layering to Systematically Troubleshoot Networks},
author={Heller, Brandon and Scott, Colin and McKeown, Nick and Shenker, Scott and Wundsam, Andreas and Zeng, Hongyi and Whitlock, Sam and Jeyakumar, Vimalkumar and Handigol, Nikhil and McCauley, James and Zarifis, Kyriakos and Kazemian, Peyman},
booktitle={ACM HotSDN},
year={2013}
}
Networks are getting larger and more complex, yet administrators rely on rudimentary tools such as ping and traceroute to troubleshoot problems. We present ATPG, a system that automatically generates a minimum set of test packets to test the liveness and correctness of all rules in a network. ATPG reads router configurations and generates a minimum set of test packets that traverse every link and test every forwarding rule in the network, greatly simplifying the testing and debugging of networks.
@article{zeng2014automatic,
title={Automatic Test Packet Generation},
author={Zeng, Hongyi and Kazemian, Peyman and Varghese, George and McKeown, Nick},
journal={IEEE/ACM Transactions on Networking},
year={2014}
}
The behavior of a Software-Defined Network is controlled by programs, which like all software, will have bugs—but this programmatic control also enables new ways to debug networks. This paper introduces ndb, a prototype network debugger inspired by gdb, which implements two primitives useful for debugging an SDN: breakpoints and packet backtraces. We show how ndb modifies forwarding state and logs packet digests to rebuild the sequence of events leading to an errant packet, providing SDN programmers and operators with a valuable tool for tracking down the root cause of a bug.
@inproceedings{handigol2012where,
title={Where is the Debugger for my Software-Defined Network?},
author={Handigol, Nikhil and Heller, Brandon and Jeyakumar, Vimalkumar and Mazieres, David and McKeown, Nick},
booktitle={HotSDN},
year={2012}
}
The behavior of a Software-Defined Network is controlled by programs, which like all software, will have bugs—but this programmatic control also enables new ways to debug networks. This paper introduces ndb, a prototype network debugger inspired by gdb, which implements two primitives useful for debugging an SDN: breakpoints and packet backtraces. We show how ndb modifies forwarding state and logs packet digests to rebuild the sequence of events leading to an errant packet, providing SDN programmers and operators with a valuable tool for tracking down the root cause of a bug.
@inproceedings{kazemian2012header,
title={Header Space Analysis: Static Checking for Networks},
author={Kazemian, Peyman and Varghese, George and McKeown, Nick},
booktitle={USENIX NSDI},
year={2012}
}
Certaintypesoferrors,suchashardwarefailure,linkfailure or congestion can only be detected at runtime of network asthesearefailuresthathappenwhenthenetworkisinoperation. Today, there exist only preliminary tools to detect andpinpointsuchkindoferrors, typifiedbypingandtrace rout. In this paper we introduce a formal and comprehensivewaytofullytestnetworksatruntimewhichenablesus to pinpoint the exact rule that is causing the problem. Our testpacketgenerationalgorithmfirstpicksarelativelysmall numberoftestpacketsandcorrespondinginputportssothat these packets exercise all the rules in network. By periodically sending these packets through the network, we can constantlymonitorthehealthofthenetwork. Wheneveran errorhappens,thefaultlocalizationalgorithmusestheresult ofalltestpacketstofindouttheexactrulethatisinfault.
@techreport{zengformal,
title = {Formal Network Testing},
author = {Hongyi Zeng, Peyman Kazemian, George Varghese, Nick McKeown},
institution = {Techincal Report : Stanford University, Stanford, CA USA, UCSD, San Diego and Yahoo! Labs, Santa Clara, CA, USA},
year = {}
}
Data Center Networks
Chronos: Prescheduled Circuit Switching for LLM TrainingBest Paper
Sundararajan Renganathan, Nick McKeown
2nd Workshop on Networks for AI Computing (NAIC), ACM SIGCOMM 2025
The all-to-all collective communications used in training large language models create a predictable, periodic traffic pattern. We exploit this predictability with Chronos, a network that preschedules optical circuit switches to provide each server-pair with a guaranteed rate. Because the traffic matrix is known in advance, Chronos computes a recurring schedule of circuit configurations before training begins, eliminating run-time scheduling overhead. We show that Chronos can meet the bandwidth requirements of common LLM parallelism strategies while reducing network cost by replacing expensive packet switches with simpler optical circuit switches.
@inproceedings{renganathan2025chronos,
title={Chronos: Prescheduled Circuit Switching for LLM Training},
author={Renganathan, Sundararajan and McKeown, Nick},
booktitle={2nd Workshop on Networks for AI Computing (NAIC), ACM SIGCOMM},
year={2025}
}
The nanoPU: A Nanosecond Network Stack for Datacenters
Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz, Changhoon Kim, Nick McKeown
We present the nanoPU, a new NIC-CPU co-design to accelerate an increasingly pervasive class of datacenter applications: those that utilize many small Remote Procedure Calls (RPCs) with very short microsecond-scale processing times. The novel aspect of the nanoPU is the design of a fast path between the network and the CPU that bypasses the cache and memory hierarchy, placing arriving messages directly into the CPU register file. We implement a prototype nanoPU in an FPGA and show it can provide a thread-to-thread RPC latency of about 69 nanoseconds, more than ten times faster than existing systems.
@inproceedings{ibanez2021nanopu,
title={The nanoPU: A Nanosecond Network Stack for Datacenters},
author={Ibanez, Stephen and Mallery, Alex and Arslan, Serhat and Jepsen, Theo and Shahbaz, Muhammad and Kim, Changhoon and McKeown, Nick},
booktitle={USENIX OSDI},
year={2021}
}
NanoTransport: A Low-Latency, Programmable Transport Layer for NICs
Serhat Arslan, Stephen Ibanez, Alex Mallery, Changhoon Kim, Nick McKeown
Transport protocols can be implemented in NIC hardware to increase throughput and reduce latency, but hardware implementations are inflexible and difficult to modify. NanoTransport is a programmable transport layer implemented on a NIC that supports microsecond-scale latency while allowing transport protocols to be changed in software. NanoTransport provides a simple programming interface for defining custom transport protocols, enabling rapid experimentation with new transport designs without sacrificing performance.
@inproceedings{arslan2021nanotransport,
title={NanoTransport: A Low-Latency, Programmable Transport Layer for NICs},
author={Arslan, Serhat and Ibanez, Stephen and Mallery, Alex and Kim, Changhoon and McKeown, Nick},
booktitle={ACM SOSR},
year={2021}
}
A Distributed Algorithm to Calculate Max-Min Fair Rates Without Per-Flow State
Lavanya Jose, Stephen Ibanez, Mohammad Alizadeh, Nick McKeown
Most congestion control algorithms, like TCP, rely on a reactive control system that detects congestion, then marches carefully towards a desired operating point (e.g. by modifying the window size or adjusting a rate). In an effort to balance stability and convergence speed, they often take hundreds of RTTs to converge; an increasing problem as networks get faster, with less time to react. This paper is about an alternative class of congestion control algorithms based on proactive-scheduling: switches and NICs "proactively" exchange control messages to run a distributed algorithm to pick "explicit rates" for each flow.We call these Proactive Explicit Rate Control (PERC) algorithms. They take as input the routing matrix and link speeds, but not a congestion signal. By exploiting information such as the number of flows at a link, they can converge an order of magnitude faster than reactive algorithms.
@inproceedings{jose2019distributed,
title={A Distributed Algorithm to Calculate Max-Min Fair Rates Without Per-Flow State},
author={Jose, Lavanya and Ibanez, Stephen and Alizadeh, Mohammad and McKeown, Nick},
booktitle={ACM SIGMETRICS},
year={2019}
}
We present pFabric, a minimalistic datacenter transport design that achieves near-optimal flow completion times for short flows while maintaining high throughput for large flows. pFabric has a very simple design: switches implement a priority queue that schedules packets based on a single priority number, and the endpoints implement a minimal rate control mechanism. Despite its simplicity, we show through analysis and simulation that pFabric achieves flow completion times within 10-15% of the optimal for a wide range of workloads.
@inproceedings{alizadeh2013pfabric,
title={pFabric: Minimal Near-Optimal Datacenter Transport},
author={Alizadeh, Mohammad and Yang, Shuang and Sharif, Milad and Katti, Sachin and McKeown, Nick and Prabhakar, Balaji and Shenker, Scott},
booktitle={ACM SIGCOMM},
year={2013}
}
We observe that the requirements for datacenter transport are fundamentally different from the wide-area Internet. In the datacenter, the network fabric is administered by a single entity, latencies are small, and the switching elements can be enhanced with new functionality. This suggests that datacenter transport should be deconstructed: instead of implementing all transport functions at the end-host, some functions are better performed inside the network fabric where switches have more timely and complete information about congestion. We describe a transport architecture that decouples flow scheduling from rate control, placing scheduling decisions in the network and rate control at the end-hosts.
@inproceedings{alizadeh2012deconstructing,
title={Deconstructing Datacenter Packet Transport},
author={Alizadeh, Mohammad and Yang, Shuang and Katti, Sachin and McKeown, Nick and Prabhakar, Balaji and Shenker, Scott},
booktitle={HotNets},
year={2012}
}
Martin Casado, David Erickson, Igor Anatolyevich Ganichev, Rean Griffith, Brandon Heller, Nick Mckeown, Daekyeong Moon, Teemu Koponen, Scott Shenker, Kyriakos Zarifis
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2010-93
In this demo, we present Ripcord, a modular platform for rapidly prototyping scale-out data center networks. Ripcord enables researchers to build and evaluate new network features and topologies, using only commercially available hardwareandopen-sourcesoftware. The Ripcorddemowill showthreeexamplesofcustomnetworkfunctions,operating together,ontopofa160-nodecluster. Thefirstisarouting engine that isolates classes of traffic. The second is a dynamicnetworkmanagerthanadjustslinksandswitchpower states to reduce energy. The third is a statistics aggregatorthatsupportsnetworkhealthmonitoringandautomatic alerts. Thedemowill beinteractive, witha visualizationof liveparametersforeachlinkandswitch,suchasbandwidth, drops, and power status, as well a control panel to modify thetrafficload. Wefeelthataninteractivedemoisthebest way to introduce the research community to Ripcord and get their feedback.
@techreport{casado2010ripcord,
title = {Ripcord: A Modular Platform For Data Center Networking},
author = {Martin Casado, David Erickson, Igor Anatolyevich Ganichev, Rean Griffith, Brandon Heller, Nick Mckeown, Daekyeong Moon, Teemu Koponen, Scott Shenker, Kyriakos Zarifis},
institution = {EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2010-93},
year = {2010}
}
David Erickson, Brandon Heller, Shuang Yang, Jonathan Chu, Jonathan D. Ellithorpe, Scott Whyte, Stephen Stuart, Nick McKeown, Guru M. Parulkar, Mendel Rosenblum
Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, August 2011
Many data centers extensively use virtual machines (VMs), whichprovidetheflexibilitytomoveworkloadamongphysicalservers. VMscanbeplacedtomaximizeapplicationperformance,powerefficiency,orevenfaulttolerance. However, VMsaretypicallyrepositionedwithoutconsideringnetwork topology, congestion, or traffic routes. Inthisdemo,weshowasystem, Virtue,whichenablesthe comparison of different algorithms for VM placement and network routing at the scale of an entire data center. Our goalistounderstandhowplacementandroutingaffectoverallapplicationperformancebyvaryingthetypesandmixof workloads,networktopologies,andcomputeresources;these parameters will be available for demo attendees to explore.
@inproceedings{erickson2011optimizing,
title = {Optimizing a Virtualized Data Center},
author = {David Erickson, Brandon Heller, Shuang Yang, Jonathan Chu, Jonathan D. Ellithorpe, Scott Whyte, Stephen Stuart, Nick McKeown, Guru M. Parulkar, Mendel Rosenblum},
booktitle = {Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, August 2011},
year = {2011}
}
Energy proportionality is a key goal for data center networks. Networks are a shared resource connecting critical IT infrastructure, and the general practice is to keep them always on, even during off-peak hours when they are lightly utilized. ElasticTree is a network-wide energy optimizer that dynamically adjusts the set of active network elements to satisfy changing data center traffic loads. It combines a formal model of the network with traffic information to find the minimum power network subset and to turn off unneeded links and switches. We demonstrate that ElasticTree can save 25-40% of network energy in a data center.
@inproceedings{heller2010elastictree,
title={ElasticTree: Saving Energy in Data Center Networks},
author={Heller, Brandon and Seetharaman, Srini and Mahadevan, Priya and Yiakoumis, Yiannis and Sharma, Puneet and Banerjee, Sujata and McKeown, Nick},
booktitle={USENIX NSDI},
year={2010}
}
David Erickson, Glen Gibb, Brandon Heller, David Underhill, Jad Naous, Guido Appenzeller, Guru Parulkar, Nick McKeown, Mendel Rosenblum, Monica Lam, Sailesh Kumar, Valentina Alaria, Pere Monclus, Flavio Bonomi, Jean Tourrilhes, Praveen Yalagandula, Sujata Banerjee, Charles Clark, Rick McGeer
ACM SIGCOMM, (Best Demo Award), Seattle, WA, August 2008
We demonstrate seamless virtual machine mobility in an OpenFlow network. OpenFlow is an architecture that allows researchers to run experimental protocols in campus networks. Using OpenFlow-enabled switches at Stanford, we show how virtual machines can be migrated between physical hosts while maintaining active network connections, without disrupting ongoing communication.
@inproceedings{erickson2008demo,
title = {Demo: A Demonstration of Virtual Machine Mobility in an OpenFlow network},
author = {David Erickson, Glen Gibb, Brandon Heller, David Underhill, Jad Naous, Guido Appenzeller, Guru Parulkar, Nick McKeown, Mendel Rosenblum, Monica Lam, Sailesh Kumar, Valentina Alaria, Pere Monclus, Flavio Bonomi, Jean Tourrilhes, Praveen Yalagandula, Sujata Banerjee, Charles Clark, Rick McGeer},
booktitle = {ACM SIGCOMM, (Best Demo Award), Seattle, WA, August 2008},
year = {2008}
}
The scale and expense of modern data centers motivates running them as efficiently as possible. This paper explores how virtualized data center performance can be improved when network traffic and topology data informs VM placement. Our practical heuristics, tested on network-heavy, scale-out workloads in an 80 server cluster, improve overall performance by up to 70% compared to random placement in a multi-tenant configuration.
@inproceedings{erickson2014using,
title={Using Network Knowledge to Improve Workload Performance in Virtualized Data Centers},
author={Erickson, David and Heller, Brandon and McKeown, Nick and Rosenblum, Mendel},
booktitle={IC2E},
year={2014}
}
Video Streaming
Sammy: Smoothing Video Traffic to Be a Friendly Internet Neighbor
Bruce Spang, Shravya Kunamalla, Renata Teixeira, Te-Yuan Huang, Grenville Armitage, Ramesh Johari, Nick McKeown
The scale and expense of modern data centers motivates running them as efficiently as possible. This paper explores how virtualized data center performance can be improved when network traffic and topology data informs VM placement. Our practical heuristics, tested on network-heavy, scale-out workloads in an 80 server cluster, improve overall performance by up to 70% compared to random placement in a multi-tenant configuration.
@inproceedings{spang2023sammy,
title={Sammy: Smoothing Video Traffic to Be a Friendly Internet Neighbor},
author={Spang, Bruce and Kunamalla, Shravya and Teixeira, Renata and Huang, Te-Yuan and Armitage, Grenville and Johari, Ramesh and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2023}
}
In this paper we report on measurement of Netflix video traffic, to understand how the size of router buffers affects the quality of video seen by the end-user (i.e. video QoE). In certain locations, Netflix streams video over congested links, and the resulting packet loss leads to video impairments that can cause the viewer to abandon the stream. Intuitively, larger router buffers lead to fewer drops, better video, and fewer abandonments. But how much buffer is enough? In this paper we describe a set of measurement and analysis techniques that give video service providers a framework for determining this. We describe experiments performed on a congested transoceanic link, where we measured the impact of buffer size on video quality as measured by Netflix's Video Quality of Experience (QoE) metrics. We found that in our setup, reducing the buffer size from 210ms to 30ms of buffering had little impact on the video quality experienced by subscribers.
@inproceedings{spang2019buffer,
title={Buffer Sizing and Video QoE Measurements at Netflix},
author={Spang, Bruce and Walsh, Brady and Huang, Te-Yuan and Rusnock, Tom and Lawrence, J.C. and McKeown, Nick},
booktitle={Workshop on Buffer Sizing},
year={2019}
}
Existing ABR algorithms face a significant challenge in estimating future capacity: capacity can vary widely over time, a phenomenon commonly observed in commercial services. In this work, we suggest an alternative approach to rate adaptation that avoids this problem. We design an ABR algorithm that chooses the video rate based on the current buffer occupancy. We show using analysis that a carefully designed buffer-based approach can provide near-optimal results: it can achieve a video rate close to the maximum average throughput, while maintaining a small probability of rebuffering. We also describe a deployment by Netflix over a period of several weeks with millions of users. Our deployments confirm the theoretical findings: our buffer-based algorithm provides a comparable or improved user experience.
@inproceedings{huang2014buffer,
title={A Buffer-Based Approach to Rate Adaptation: Evidence from a Large Video Streaming Service},
author={Huang, Te-Yuan and Johari, Ramesh and McKeown, Nick and Trunnell, Matthew and Watson, Mark},
booktitle={ACM SIGCOMM},
year={2014}
}
Recent work has shown how hard it is to pick a video streaming rate. Video service providers use heuristics to estimate the network capacity leading to unnecessary rebuffering events and suboptimal video quality. This paper begins with the observation that video streaming is, in essence, a scheduling problem: the video player must schedule the download of each chunk over a shared bottleneck link. We describe a principled approach based on Model Predictive Control that optimizes the bitrate selection and schedules the chunk downloads over a finite time horizon. We implement the algorithm in a client-side video player, and compare it to existing commercial clients in a wide range of network conditions. We find that it reduces the rebuffer rate by 10-20% while achieving similar or better average bitrate.
@inproceedings{huang2013downton,
title={Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming},
author={Huang, Te-Yuan and Johari, Ramesh and McKeown, Nick},
booktitle={ACM SIGCOMM FhMN Workshop},
year={2013}
}
Today's commercial video streaming services use dynamic rate selection to provide a high-quality user experience. Most services host content on standard HTTP servers in CDNs, so rate selection must occur at the client. Picking the right video rate is hard because the capacity of the path between server and client varies over time and is hard to predict. If the selected rate is too high, the video player's buffer will run empty and the video will rebuffer; if the rate is too low, video quality will suffer unnecessarily. In this paper, we examine several state-of-the-art rate-selection algorithms used by commercial services and find that they fall into common traps, leading to behavior we describe as confused, timid, and unstable. We explain the root causes of these behaviors and their effects on user experience.
@inproceedings{huang2012confused,
title={Confused, Timid, and Unstable: Picking a Video Streaming Rate is Hard},
author={Huang, Te-Yuan and Handigol, Nikhil and Heller, Brandon and McKeown, Nick and Johari, Ramesh},
booktitle={ACM IMC},
year={2012}
}
Congestion Control
Unbiased Experiments in Congested Networks
Bruce Spang, Veronica Hannan, Shravya Kunamalla, Te-Yuan Huang, Nick McKeown, Ramesh Johari
When developing a new networking algorithm, it is established practice to run a randomized experiment, or A/B test, to evaluate its performance. In an A/B test, traffic is randomly allocated between a treatment group, which uses the new algorithm, and a control group, which uses the existing algorithm. However, because networks are congested, both treatment and control traffic compete against each other for resources in a way that biases the outcome of these tests. This bias can have a surprisingly large effect; for example, in lab A/B tests with two widely used congestion control algorithms, the treatment appeared to deliver 150% higher throughput when used by a few flows, and 75% lower throughput when used by most flows---despite the fact that the two algorithms have identical throughput when used by all traffic. Beyond the lab, we show that A/B tests can also be biased at scale. In an experiment run in cooperation with Netflix, estimates from A/B tests mistake the direction of change of some metrics, miss changes in other metrics, and overestimate the size of effects. We propose alternative experiment designs, previously used in online platforms, to more accurately evaluate new algorithms and allow experimenters to better understand the impact of congestion on their tests.
@inproceedings{spang2021unbiased,
title={Unbiased Experiments in Congested Networks},
author={Spang, Bruce and Hannan, Veronica and Kunamalla, Shravya and Huang, Te-Yuan and McKeown, Nick and Johari, Ramesh},
booktitle={ACM Internet Measurement Conference (IMC)},
year={2021}
}
Deployed congestion control algorithms rely on end hosts to figure out how congested the network is. Initially, for TCP, the congestion signal was packet drops; a binary signal that kicks in only after the congestion is well underway. More recently, the trend has been towards using RTT as a congestion signal instead (e.g. Timely and BBR). But RTT is a noisy surrogate for congestion; it contains a valuable signal about congestion at the bottleneck but also includes noise from the queuing delay at the non-bottlenecked switches. Taking a step back, it is worth asking: Why don't the switches and routers simply tell us the actual congestion they are experiencing? After all, they must keep track of the precise occupancy of their own queues anyway; they can directly tell the end-hosts. Conventional wisdom said this is too expensive (in terms of additional bits in headers, or complexity and power consumption). We argue that even if this was once the case, it no longer is. Today, it is quite feasible, with negligible increase in power or lost capacity, to report the precise queuing delay at the switches, allowing the end hosts to make more accurate decisions when minimizing required buffering. We explore how this might work using modern programmable switches and NICs that stamp each packet with the queue occupancy (or the maximum queue occupancy along the path), which can be thought of as a multi-bit ECN signal. We provide evidence that the resulting signal is a more accurate indication of congestion at the flow's bottleneck and can lead to higher link utilization and shorter flow completion times than RTT-based algorithms. Consequently, it becomes easier to control required buffer sizes. Our goal here is not to argue for a particular multi-bit ECN algorithm, but to point out that in the future, there is no longer a need to rely on noisy RTT measurements at the edges.
@inproceedings{arslan2019switches,
title={Switches Know the Exact Amount of Congestion},
author={Arslan, Serhat and McKeown, Nick},
booktitle={Workshop on Buffer Sizing},
year={2019}
}
New congestion control algorithms are rapidly improving datacenter performance by reducing latency and overcoming incast collapse. However, deploying a new algorithm requires updating every endpoint in the datacenter, which is slow and error-prone. We present Virtualized Congestion Control, which decouples congestion control from the endpoint operating system and runs it in a virtual congestion control layer. This allows operators to deploy, update, and experiment with congestion control algorithms without modifying the host OS kernel.
@inproceedings{cronkite2016virtualized,
title={Virtualized Congestion Control},
author={Cronkite-Ratcliff, Bryce and Bergman, Aran and Vargaftik, Shay and Ravi, Madhusudhan and McKeown, Nick and Abraham, Ittai and Keslassy, Isaac},
booktitle={ACM SIGCOMM},
year={2016}
}
As datacenter speeds scale to 100 Gb/s and beyond, traditional congestion control algorithms like TCP and RCP converge slowly to steady sending rates, which leads to poorer reactive and less predictable user experience for short flows. We argue that at today's datacenter speeds, link bandwidth should be allocated proactively by the network, rather than reactively by end-hosts. We describe an architecture called Homa, which proactively allocates bandwidth using in-network priority queues and a receiver-driven protocol. We show that Homa can deliver messages with tail latencies within a small factor of optimal, while achieving high overall throughput.
@inproceedings{jose2015high,
title={High Speed Networks Need Proactive Congestion Control},
author={Jose, Lavanya and Yan, Lisa and Alizadeh, Mohammad and Varghese, George and McKeown, Nick and Katti, Sachin},
booktitle={ACM HotNets},
year={2015}
}
Buffersizinghasreceived alot ofattention recentlysinceit isbecomingincreasinglydifficulttouselargebuffersinhighspeed routers. Much of the prior work has concentrated on analyzing the amount of buffering required in core routers assumingthatTCPcarriesallthedatatraffic. Inthispaper, we evaluate the amount of buffering required for RCP on a singlecongestedlink,whileexplicitlymodelingflowarrivals anddepartures. Ourtheoreticalanalysisandsimulationsindicatethatbuffersizesofabout10%ofthebandwidth-delay product are sufficient for RCP to deliver good performance to end-users.
@inproceedings{lakshmikantha2009buffer,
title = {Buffer Sizing results for RCP Congestion Control under Connection Arrivals and Departures},
author = {Ashvin Lakshmikantha, R. Srikant, Nandita Dukkipati, Nick McKeown, and Carolyn Beck},
booktitle = {ACM SIGCOMM Computer Communication Review, Volume 39, Issue 1, January 2009, Pages 5-15},
year = {2009}
}
In the context of explicit congestion control protocols like XCP and RCP where the equilibrium queue lengths are zero, we show that the stability region derived from traditional Nyquist analysis is not an accurate predictor of protocol stability for networks carrying a finite number of flows. This is because when there is no queueing, the flow rates do not change smoothly; they instead change abruptly creating limit cycles. We characterize the stability region analytically and find that it is a function of the number of flows. Our analysis provides a simple rule for choosing protocol parameters to ensure stability for a given network scenario.
@article{balakrishnan2007stability,
title={Stability Analysis of Explicit Congestion Control Protocols},
author={Balakrishnan, Hamsa and Dukkipati, Nandita and McKeown, Nick and Tomlin, Claire},
journal={IEEE Communications Letters},
volume={11},
number={10},
year={2007}
}
Mostcongestioncontrolalgorithmstrytoemulateprocessorsharing (PS) by giving each competing flow approachleadstofairness,andpreventslongflowsfromhoggingresources.For example,ifasetofflowswiththesameroundtriptimeshareabottlenecklink, TCP’scongestioncontrolmechanismtriestoachievePS;sodomostoftheproposedalternatives,suchase Xplicit Control Protocol(XCP).Butalthoughthey emulate PS well in a static scenario comeclosetoPSwhennewflowsarriverandomlyandhaveafiniteamountof datatosend,asisthecaseintoday’s Internet.Typically,flowstakeanorderof magnitudelongertocompletewithTCPorXCPthanwithPS,suggestinglarge roomforimprovement.Andsointhispaper,weexplorehowanewcongestion controlalgorithm—Rate Control Protocol(RCP)—comesmuchclosertoemulatingPSoverabroadrangeofoperatingconditions.InRCP,arouterassignsa singleratetoallflowsthatpassthroughit.Therouterdoesnotkeepflow-state, anddoesnoper-packetcalculations.Yetweareabletoshowthatunderawide rangeoftrafficcharacteristicsandnetworkconditions, RCP’sperformanceisvery closetoidealprocessorsharing.
@techreport{dukkipatiprocessor,
title = {Processor Sharing Flows in the Internet},
author = {Nandita Dukkipati, Nick McKeown},
institution = {Stanford University High Performance Networking Group Technical Report TR04-HPNG-061604},
year = {}
}
Mostcongestioncontrolalgorithmstrytoemulateprocessorsharing (PS) by giving each competing flow approachleadstofairness,andpreventslongflowsfromhoggingresources.For example,ifasetofflowswiththesameroundtriptimeshareabottlenecklink, TCP’scongestioncontrolmechanismtriestoachievePS;sodomostoftheproposedalternatives,suchase Xplicit Control Protocol(XCP).Butalthoughthey emulate PS well in a static scenario comeclosetoPSwhennewflowsarriverandomlyandhaveafiniteamountof datatosend,asisthecaseintoday’s Internet.Typically,flowstakeanorderof magnitudelongertocompletewithTCPorXCPthanwithPS,suggestinglarge roomforimprovement.Andsointhispaper,weexplorehowanewcongestion controlalgorithm—Rate Control Protocol(RCP)—comesmuchclosertoemulatingPSoverabroadrangeofoperatingconditions.InRCP,arouterassignsa singleratetoallflowsthatpassthroughit.Therouterdoesnotkeepflow-state, anddoesnoper-packetcalculations.Yetweareabletoshowthatunderawide rangeoftrafficcharacteristicsandnetworkconditions, RCP’sperformanceisvery closetoidealprocessorsharing.
@inproceedings{dukkipati2005processor,
title = {Processor Sharing Flows in the Internet},
author = {Nandita Dukkipati, Masayoshi Kobayashi, Rui Zhang-Shen, Nick McKeown},
booktitle = {Thirteenth International Workshop on Quality of Service (IWQoS), Passau, Germany, June 2005},
year = {2005}
}
We recently proposed the Rate Control Protocol (RCP) as way to minimize download times (or flow-completion times). Simulations suggest that if RCP were widely deployed, downloadswould frequently finish ten times faster thanwithTCP.ThisisbecauseRCPinvolvesexplicitfeedbackfrom theroutersalongthepath,allowingasenderto pickafaststartingrate,andadaptquicklytonetworkconditions. RCP is particularly appealing because it can be shown to be stable under broad operatingconditions, and itsperformanceisindependentoftheflow-sizedistribution and the RTT. Although it requires changes to the routers, the changesare small: The routers keep no per-flow state orper-flowqueues, andthe per-packetprocessingis minimal. However, the bar is high for a new congestion control mechanism–introducinganewschemerequiresenormous change,andtheargumentneedstobecompelling. Andso, to enable some scientific and repeatable experiments with RCP, we have built and tested an open and public implementation of RCP; we have made available both the endhostsoftware,andtherouterhardware. In this paper we describe our end-host implementation ofRCPin Linux,andourrouterimplementationin Verilog (on the NetFPGA platform). We hope that others will be ableto use theseimplementationstoexperimentwith RCP andfurtherourunderstandingofcongestioncontrol.
@inproceedings{dukkipati2007building,
title = {Building a RCP (Rate Control Protocol) Test Network},
author = {Nandita Dukkipati, Glen Gibb, Nick McKeown, Jiang Zhu},
booktitle = {Hot Interconnects, Stanford, August 2007},
year = {2007}
}
Rate Control Protocol with Acceleration Control (RCP-AC) is a congestion control protocol designed to minimize flow completion time. RCP-AC operates at the router, which assigns rates to flows based on aggregate feedback. Unlike traditional TCP-based approaches that optimize for throughput or fairness, RCP-AC explicitly targets flow completion time, which enables typical Internet flows to complete several times faster than with TCP.
@inproceedings{dukkipati2006rcp,
title={RCP-AC: Congestion Control to Make Flows Complete Quickly in Any Environment},
author={Dukkipati, Nandita and McKeown, Nick and Fraser, Alexander G.},
booktitle={IEEE INFOCOM High-Speed Networking Workshop},
year={2006}
}
We argue that flow completion time is the right metric for evaluating congestion control in datacenter networks. Most datacenter applications care about how quickly their flows complete, not about the throughput or fairness achieved in steady state. We show that optimizing for flow completion time leads to different design choices than optimizing for throughput or fairness, and present a framework for reasoning about the flow completion time performance of different congestion control schemes.
@article{dukkipati2006why,
title={Why Flow-Completion Time is the Right Metric for Congestion Control},
author={Dukkipati, Nandita and McKeown, Nick},
journal={ACM SIGCOMM Computer Communication Review},
volume={36},
number={1},
year={2006}
}
Shortest path routing protocols may suffer from con- of shortest path routing are well known. They constrain the load gestion due to the use of a single shortest path between a source and on the network along a small set of shortest paths even if altera destination. The goal of our work is to first understand how links nate paths with available capacity exist and limit the throughput become overloaded in an IP backbone, and then to explore if the between any two nodes to well below the maximum capacity routing protocol, — either in its existing form, or in some enhanced form could be made to respond immediately to overload and reduce [1]. Ideally, if multiple paths were available, the routing protothe likelihood of its occurrence. Our method is to use extensive mea- col would spread the traffic uniformly over the whole network, surements of Sprint’s backbone network, measuring 138 links minimizing the probability that any one link is overloaded. between September 2000 and June 2001. We find that since the Indeed, both IS-IS and OSPF have been extended to split traffic backbone is designed to be overprovisioned, link overload is rare, among multiple equal cost paths. Multipath routing protocols and when it occurs, 80% of the time it is caused due to link failures. Furthermore, we find that when a link is overloaded, few (if any) which can divide the load unequally amongst several alternate other links in the network are also overloaded.
@inproceedings{iyer2003approach,
title={An Approach to Alleviate Link Overload as Observed on an IP Backbone},
author={Iyer, Sundar and Bhattacharyya, Supratik and Taft, Nina and Diot, Christophe and McKeown, Nick},
booktitle={IEEE INFOCOM},
year={2003}
}
Feedback based congestion control — used by TCP and other transport protocols — causes the transmission rate of a long-lived flow to oscillate. This paper is about the tendency of multiplexed flows on a bottleneck link to oscillate causing “aggregate periodic behavior”. As a consequence, bottleneck links can become underutilized, wasting capacity. One simple way to prevent this is to use larger buffers in the bottleneck router. We explore how large these buffers need to be to absorb oscillations. Another proposed solution is to use a randomized active queue management algorithm, such as RED. A goal of RED is to reduce “global synchronization” of TCP flows, and hence avoid oscillations. Our results suggest that RED does not reduce aggregate periodic behavior when compared to Drop Tail. Perhaps surprisingly, our results and a simple analytical model suggest that the tendency for aggregate traffic to oscillate is robust against many forms of disturbances added to the timing of feedback and reaction, even if each constituent flow shows little or no sign of periodicity. Different flows can be in different states and yet conspire to make their sum have periodic behavior.
@inproceedings{gilbert2001congestion,
title = {Congestion Control and Periodic Behavior},
author = {Anna Gilbert, Youngmi Joo and Nick McKeown},
booktitle = {Presented at LANMAN Workshop 2001},
year = {2001}
}
Should applications receive special treatment from the network? And if so, who decides which applications are preferred? This discussion, known as net neutrality, goes beyond technology and is a hot political topic. In this paper we approach net neutrality from a user's perspective. Through user studies, we demonstrate that users do indeed want some services to receive preferential treatment; and their preferences have a heavy tail: a one-size-fits-all approach will not work. We prototype Boost, a mechanism that gives users the ability to prioritize their own traffic, putting the net neutrality decision in the hands of the end user.
@inproceedings{yiakoumis2016neutral,
title={Neutral Net Neutrality},
author={Yiakoumis, Yiannis and Katti, Sachin and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2016}
}
There is an active policy debate about who can control the Internet traffic that flows over the last-mile connection, often causing congestion and bad user experience. Policy-makers, ISPs, and content providers each want a say, but the user is rarely consulted. We argue that users should be able to decide how their traffic is treated. We present a system that allows home users to prioritize their own traffic, giving them direct control over their network experience.
@inproceedings{yiakoumis2012putting,
title={Putting Home Users in Charge of their Network},
author={Yiakoumis, Yiannis and Huang, Te-Yuan and Yap, Kok-Kiong and Katti, Sachin and Johari, Ramesh and McKeown, Nick},
booktitle={ACM HomeSys Workshop},
year={2012}
}
Mobile devices today have access to multiple network interfaces simultaneously, including WiFi, cellular, and Bluetooth. We present a system that schedules packets across available interfaces to optimize performance while respecting user preferences about cost, energy consumption, and quality. Users express preferences as simple policies, which the scheduler translates into optimal scheduling decisions. We implement and evaluate our system on Android devices.
@inproceedings{yap2013scheduling,
title={Scheduling Packets Over Multiple Interfaces While Respecting User Preferences},
author={Yap, Kok-Kiong and Huang, Te-Yuan and Yiakoumis, Yiannis and Chinchali, Sandeep and McKeown, Nick and Katti, Sachin},
booktitle={CoNEXT},
year={2013}
}
Current networking stacks were designed for a single wired network interface. Today, it is common for a mobile device connect to many networks that come and go, and whose rates are constantly changing. Current practice is to pick one network, hand-off to it, and use it until a better one comes along, leading to packet loss and delay during the handoff. We describe a late-binding architecture, in which the sender does not choose the interface ahead of time. Instead, we use a simple multi-path protocol that delays the binding of a packet to a particular path. We find that late-binding can reduce median handoff packet-loss by 5-10x.
@inproceedings{yap2013late,
title={Late-binding: How to Lose Fewer Packets During Handoff},
author={Yap, Kok-Kiong and Huang, Te-Yuan and Yiakoumis, Yiannis and McKeown, Nick and Katti, Sachin},
booktitle={CellNet},
year={2013}
}
End-hosts are increasingly equipped with multiple network interfaces, ranging from smartphones with multiple radiostoserverswithmulti-homing.Theseinterfacesarediverse; some are expensive to use (e.g. 4G), some are free (e.g Wi Fi) and they have different rates and reliability. On the other hand, end-hosts now run diverse applications with different priorities, from relatively less important web browsing to higher priority VoIP and video calls. Finally, users may have policies that constrain interface use (e.g. use 4G only for high priority flows). This paper tackles the question of how different applications can use different subsets of the available network interfaces, while ensuring a fair resource allocation among flows, while satisfying policy constraints. We generalize prior classical work onprocessorsharing(GPS)tothecaseofflowssharingdifferent subsets of the available interfaces. We show a simple scheduling scheme for packet-by-packet GPS over multiple interfaces, and prove that it can provide bounded delay and rate guarantees.
@inproceedings{yap2012multiserver,
title = {Multi-Server Generalized Processor Sharing},
author = {Kok-Kiong Yap, Nick McKeown, Sachin Katti},
booktitle = {September 2012, Krakw, Poland
International Teletraffic Congress 2012},
year = {2012}
}
Poor connectivity is common when we use wireless networks on the go. A natural way to tackle the problem is to take advantage of the multiple network interfaces on our mobile devices, and use all the networks around us simultaneously. We describe Hercules, a system that uses all available network interfaces to increase throughput and reliability for mobile devices. We implement Hercules in Android and evaluate it through extensive measurements. We find that Hercules can increase throughput by up to 2x compared to using a single interface, while also reducing the duration of connectivity gaps.
@inproceedings{yap2012making,
title={Making Use of All the Networks Around Us: A Case Study in Android},
author={Yap, Kok-Kiong and Huang, Te-Yuan and Yiakoumis, Yiannis and Kobayashi, Masayoshi and Katti, Sachin and Parulkar, Guru and McKeown, Nick},
booktitle={ACM CellNet Workshop},
year={2012}
}
Guest WiFi systems have become an integral part of our lives, yet they continue to be plagued by largely unresolved issues. In this technical report, we argue for the separation of authentication, access and accounting. To understand our proposal, we prototyped the OpenWiFi system. We describe OpenWiFi, its uses, implementation, and the numerous challenges we faced, along with trials we have undertaken to share wireless access more broadly.
@inproceedings{yap2011separating,
title = {Separating Authentication, Access and Accounting: A Case Study with OpenWiFi},
author = {Kok-Kiong Yap, Yiannis Yiakoumis, Masayoshi Kobayashi, Sachin Katti, Guru Parulkar, and Nick McKeown},
booktitle = {Stanford University
NEC
OPENFLOW-TR-2011-1},
year = {2011}
}
In the past couple of years we have seen quite a change in the mobile industry, driven largely by user demand for new applications and services. But the underlying wireless infrastructure is slow to evolve. We argue that the network itself must become programmable and open to innovation, just as the Internet was opened up through software-defined networking. We present a blueprint for making wireless and mobile networks more open to experimentation, enabling rapid deployment of new services and protocols.
@inproceedings{chan2010blueprint,
title={Blueprint for Introducing Innovation into Wireless Mobile Networks},
author={Chan, Michael and Handigol, Nikhil and McKeown, Nick and Parulkar, Guru},
booktitle={ACM SIGCOMM VISA Workshop},
year={2010}
}
Despite of findings that only 5.2% of the spectrum from 30 MHz to 3 GHz is utilized, there is much talk of an impending spectrum crisis. This is no contradiction—spectrum is inefficiently utilized, with some part of the spectrum being heavily used while others barely used. In this paper, we explore a simple high-level approach to the problem: to enable user mobility across networks and exploit all the capacityavailable. Bydoingso,wecancreateabettermobile networkby stitchingtogether existingones spanningmultiplewirelesstechnologies. Webrieflyoutlineourexploratory foray into radio agnostic handover and discuss the various challenges ahead.
@inproceedings{yap2010delivering,
title = {Delivering Capacity for the Mobile Internet by Stitching Together Networks},
author = {Kok-Kiong Yap, Sachin Katti, Guru Parulkar, Nick McKeown},
booktitle = {Proceedings of the 2010 ACM workshop on Wireless of the students, by the students, for the students S310, September 2010, Chicago, Illinois},
year = {2010}
}
In an ideal world, all research papers would be runnable: simply click to replicate all results, using the same setup as the authors. Unfortunately, reproducibility in networking research is rare. We present a system for creating reproducible network experiments using lightweight container-based emulation. Our approach combines Mininet with version-controlled experiment specifications, making it possible to package, share, and re-run network experiments with minimal effort. We demonstrate the system by reproducing several published research results.
@inproceedings{handigol2012reproducible,
title={Reproducible Network Experiments using Container-Based Emulation},
author={Handigol, Nikhil and Heller, Brandon and Jeyakumar, Vimalkumar and Lantz, Bob and McKeown, Nick},
booktitle={CoNEXT},
year={2012}
}
Mininet is a system for rapidly prototyping large networks on the limited resources of a single computer. It creates virtual networks using lightweight OS-level virtualization, providing hosts, switches, controllers, and links that behave like their hardware counterparts. Mininet enables interactive development, testing, and debugging of Software-Defined Networks, and its networks can be easily transferred to hardware for line-rate operation. We describe Mininet's design and evaluate its fidelity and performance.
@inproceedings{lantz2010network,
title={A Network in a Laptop: Rapid Prototyping for Software-Defined Networks},
author={Lantz, Bob and Heller, Brandon and McKeown, Nick},
booktitle={ACM SIGCOMM HotNets},
year={2010}
}
Network experiments often require precise timing and repeatability, which are difficult to achieve in real networks. We describe a system for performing time-sensitive network experiments that enables researchers to control buffer sizes, measure queue occupancies, and replay traffic patterns with high precision. Our system is built on the NetFPGA platform and provides the accuracy needed for buffer sizing experiments.
@inproceedings{beheshti2008performing,
title = {Performing Time-Sensitive Network Experiments},
author = {Neda Beheshti, Yashar Ganjali, Monia Ghobadi, Nick McKeown, Jad Naous, and Geoff Salmon},
booktitle = {ANCS'08, November 6-7, 2008, San Jose, CA, USA},
year = {2008}
}
Ifopticalroutersaretobecomereality,wewillneed several new optical technologies, one of which is to build sufficientlylargeopticalbuffers.Buildingopticalbuffersforroutersis daunting:Today’selectronicroutersoftenholdmillionsofpackets, whichiswellbeyondthecapabilitiesofopticaltechnology.Inthis paper, we argue that two new results offer a solution. First, we show that the size of buffers in backbone routers can be made verysmall—justabout20packetsperlinecard—attheexpenseof asmalllossinthroughput.Second,weshowthatintegrateddelay line optical buffers can store a few dozen packets on a photonic chip.Withthecombinationofthesetworesults,weconcludethat future Internetrouterscoulduseopticalbuffers.
@article{beheshti2010optical,
title = {Optical Packet Buffers for Backbone Internet Routers},
author = {Neda Beheshti, Emily F. Burmeister, Yashar Ganjali, John E. Bowers, Daniel J. Blumenthal, Nick McKeown},
journal = {IEEE Transactions on Networking, (TON), vol. 18, no. 5, October 2010},
year = {2010}
}
Daniel J. Blumenthal, John Barton, Neda Beheshti, John E. Bowers, Emily Burmeister, Larry A. Coldren, Matt Dummer, Garry Epps, Alexander Fang, Yashar Ganjali, John Garcia, Brian Koch, Vikrant Lal, Erica Lively, John Mack, Milan Masanovic, Nick McKeown, Kim Nguyen, Steven C. Nicholes, Hyundai Park, Biljana Stamenic, Anna Tauke-Pedretti, Henrik Poulsen, and Matt Sysak
IEEE Journal of Selected Topics in Quantum Electronics, Vol 17, No. 2, March/April 2011
We describe how integrated photonic technologies can enable low-power packet networking. As network speeds increase, the power consumption of electronic packet processing becomes a growing concern. We discuss how photonic integration can reduce power in key networking functions including buffering, switching, and signal processing, and present results from several prototype demonstrations.
@article{blumenthal2011integrated,
title = {Integrated Photonics for Low-Power Packet Networking},
author = {Daniel J. Blumenthal, John Barton, Neda Beheshti, John E. Bowers, Emily Burmeister, Larry A. Coldren, Matt Dummer, Garry Epps, Alexander Fang, Yashar Ganjali, John Garcia, Brian Koch, Vikrant Lal, Erica Lively, John Mack, Milan Masanovic, Nick McKeown, Kim Nguyen, Steven C. Nicholes, Hyundai Park, Biljana Stamenic, Anna Tauke-Pedretti, Henrik Poulsen, and Matt Sysak},
journal = {IEEE Journal of Selected Topics in Quantum Electronics, Vol 17, No. 2, March/April 2011},
year = {2011}
}
Internet routers have packet buffers which reduce packet loss during times of congestion. Sizing the router buffer correctly is important: if a router buffer is too small, it can cause high packet loss and link under-utilization. If a buffer is too large, packets may have to wait an unnecessarily long time in the buffer during congested periods, often up to hundreds of milliseconds. While an operator can reduce the operational size of a router buffer, the maximum size of a router buffer is decided by the router manufacturer, and the operator typically configures the router to use all the available buffers. Without clear guidance about how big a buffer needs to be, manufacturers tend to oversize buffers and operators tend to configure larger buffers than necessary, leading to increased cost and delay.
@article{spang2021updating,
title={Updating the Theory of Buffer Sizing},
author={Spang, Bruce and Arslan, Serhat and McKeown, Nick},
journal={Performance Evaluation},
volume={151},
year={2021}
}
Sizing Router Buffers (Redux)
Nick McKeown, Guido Appenzeller, Isaac Keslassy
ACM SIGCOMM Computer Communication Review, Vol. 49, No. 5, 2019
The queueing delay faced by a packet is arguably the largest source of uncertainty during its journey. It therefore seems crucial that we understand how big the buffers should be in Internet routers. Our 2004 Sigcomm paper revisited the existing rule of thumb that a buffer should hold one bandwidth-delay product of packets. We claimed that for long-lived TCP flows, it could be reduced by √N, where N is the number of active flows, potentially reducing the required buffers by well over 90% in Internet backbone routers. One might reasonably expect that such a result, which supports cheaper routers with smaller buffers, would be embraced by the ISP community. In this paper we revisit the result 15 years later, and explain where it has succeeded and failed to affect how buffers are sized.
@article{mckeown2019sizing,
title={Sizing Router Buffers (Redux)},
author={McKeown, Nick and Appenzeller, Guido and Keslassy, Isaac},
journal={ACM SIGCOMM Computer Communication Review},
volume={49},
number={5},
year={2019}
}
All packet switches contain packet buffers to hold packets during times of congestion. The capacity of a high performance router is often dictated by the speed of its packet buffers. This is particularly true for a shared memory switch where the memory needs to operate at N times the line rate, where N is the number of ports in the system. Even input queued switches must be able to buffer packets at the rate at which they arrive. And so as link rates increase memory bandwidth requirements grow. With today’s DRAM technology and for an OC192c (10Gb/s) link, it is barely possible to write packets to (read packets from) memory at the rate at which they arrive (depart). As link rates increase, the problem will get harder. There are several techniques for building faster packet buffers, based on ideas from computer architecture such as memory interleaving and banking. While not directly applicable to packet switches, they form the basis of several techniques in use today. In this paper we consider one particular packet buffer architecture consisting of large, slow, low cost, DRAMs coupled with a small, fast SRAM “buffer”. We describe and analyze a memory management algorithm (ECQF-MMA) for replenishing the cache and find a bound on the size of the SRAM.
@inproceedings{iyer2001analysis,
title = {Analysis of a Memory Architecture for Fast Packet Buffers},
author = {Sundar Iyer, Ramana Rao Kompella, and Nick McKeown},
booktitle = {IEEE - High Performance Switching and Routing, Dallas, Texas, May 2001, pp. 368-373},
year = {2001}
}
Devavrat Shah, Sundar Iyer, Balaji Prabhakar, and Nick McKeown
Hot Interconnects, Stanford, August 2001. This paper also appeared as: "Maintaining Statistics Counters in Router Line Cards in IEEE Micro, Jan-Feb 2002, pp. 76-81
Packet switches (e.g., IP routers, ATM switches and Ethernet switches) maintain statistics for a variety of reasons: performance monitoring, network management, security, network tracing, and traffic engineering. The statistics are usually collected by counters which might, for example, count the number of arrivals of a specific type of packet, or count particular events, such as when a packet is dropped. The arrival of a packet may lead to several different statistics counters being updated. The number of statistics counters and the rate at which they are updated is often limited by memory technology. A small number of counters may be held in on-chip registers or in (onor off-chip) SRAM. But often, the number of counters is very large, and hence they need to be stored in off-chip DRAM. However, the large random access times of DRAMs make it difficult to support high bandwidth links. The time taken to read, update and write a single counter would be too large, and worse still multiple counters may need to be updated for each arriving packet. In this paper we consider a specific architecture for storing and updating statistics counters. Smaller sized counters are maintained in fast (potentially on-chip) SRAM, while a large, slower DRAM maintains the full-sized counters. The problem is to ensure that the counter values are always correctly maintained at line-rate. We describe and analyze an optimal counter management algorithm (LCF-CMA), which minimizes the size of the SRAM required while ensuring correct line-rate operation of a large number of counters.
@article{shah2001analysis,
title = {Analysis of a Statistics Counter Architecture},
author = {Devavrat Shah, Sundar Iyer, Balaji Prabhakar, and Nick McKeown},
journal = {Hot Interconnects, Stanford, August 2001. This paper also appeared as: "Maintaining Statistics Counters in Router Line Cards in IEEE Micro, Jan-Feb 2002, pp. 76-81},
year = {2001}
}
Shared memory switches provide the best possible throughput and delay characteristics, but building them at high speeds is challenging because the memory must operate at N times the line rate, where N is the number of ports. We describe techniques to build shared memory switches at high speeds while reducing the memory speed requirement, making practical high-performance shared memory switch implementations feasible.
@techreport{iyertechniques,
title = {Techniques for Fast Shared Memory Switches},
author = {Sundar Iyer, Nick McKeown},
institution = {Stanford HPNG Technical Report TR01-HPNG-081501},
year = {}
}
During the past four years, several papers have proposed rules for sizing buffers in Internet core routers. We describe a set of experiments conducted on a commercial router to measure the relationship between buffer size and link utilization, queueing delay, and drop rate. Our results confirm the theoretical predictions that buffers can be significantly smaller than the bandwidth-delay product, and we characterize the trade-offs between buffer size, link utilization, and packet loss.
@inproceedings{beheshti2008experimental,
title={Experimental Study of Router Buffer Sizing},
author={Beheshti, Neda and Ganjali, Yashar and Ghobadi, Monia and McKeown, Nick and Salmon, Geoff},
booktitle={ACM IMC},
year={2008}
}
We have deployed a network of fully programmable routers in the Internet2 backbone. Each router, built on the NetFPGA platform [1], can process four ports of line-rate Gigabit Ethernet traffic. The routers are interconnected by a full mesh of dedicated links over the Internet2 backbone. Our goal is to demonstrate how such a network of routers can enable real-time experiments that are not possible with commercial routers. In particular, we will demonstrate the tradeoff between a router’s buffer size and its forwarding performance, as seen by both the end-user and the network operator. Attendees will be able to interactively set the buffer size while monitoring the real-time occupancy of the buffers, drop rate, and throughput with a resolution of 8ns.
@inproceedings{beheshti2008demo,
title = {Demo: Experimenting with Programmable Routers in Real Networks},
author = {N. Beheshti, D. Underhill, B. Heller, S. Bolouki, N. McKeown, and Y. Ganjali},
booktitle = {ACM SIGCOMM, (2nd-Best Demo Award), Seattle, WA, August 2008},
year = {2008}
}
Internet routers need packet buffers to absorb bursts of traffic and avoid packet loss during congestion. On router linecards, these buffers must be arranged to serve multiple logical queues while meeting strict timing constraints. We analyze the design of packet buffers for router linecards, considering the trade-offs between memory technology, buffer organization, and performance. We describe practical designs that achieve high throughput with reasonable memory costs.
@article{iyer2008designing,
title={Designing Packet Buffers for Router Linecards},
author={Iyer, Sundar and Kompella, Ramana and McKeown, Nick},
journal={IEEE/ACM Transactions on Networking},
volume={16},
number={3},
year={2008}
}
Recent advances in optical technology show the possibilityofbuildingall-opticalbuffersinthenearfuture.These buffers are usually composed of a number of fiber delay lines (FDLs) and optical switches. Incoming packets are stored for a limitedtimebygoingthroughopticaldelaylines.Opticalswitches transfer these packets among different delay lines, or send them towards the output line if a packet is to leave the system. As a direct consequence of using optical technology, one of the major constraints in this setting is that the size of switch needs to be small. In this paper, we show the feasibility of constructing a FIFO N O(logN) queue of size by using only 2x2 switches. A simple scheduling algorithm that achieves this bound is developed. The proposed structure provides an efficient way of storing optical packets using a minimal number of delay lines and switches.
@inproceedings{beheshti2007packet,
title = {Packet Scheduling in Optical FIFO Buffers},
author = {N. Beheshti, Y. Ganjali, and N. McKeown},
booktitle = {High-Speed Networking Workshop (In Conjunction with IEEE Infocom 2007), Anchorage, AK, May 2007},
year = {2007}
}
Recenttheoreticalresultsinbuffer sizingresearchsuggest that core Internetrouterscanachievehighlinkutilization,iftheyarecapable of storing only a handful of packets. The underlyingassumption is thatthetrafficisnon-bursty,andthatthesystemisoperatedbelow 85-90%utilization. In this paper,we presenta test-bedfor buffersizingexperiments using NetFPGA [2], a PCI-form factor board that contains reprogrammableFPGAelements,andfour Gigabit Ethernetinterfaces. We have designed and implementeda NetFPGA-based Ethernet switch with finely tunable buffer sizes, and an event capturing system to monitor buffer occupanciesinside the switch. We show that reducing buffer sizes down to 20-50 packets does not necessarily degrade systemperformance.
@inproceedings{beheshti2007experimenting,
title = {Experimenting with Buffer Sizing in Routers},
author = {Neda Beheshti, Yashar Ganjali, Jad Naous, and Nick McKeown},
booktitle = {ANCS'07, December 2007, Orlando, Florida, USA},
year = {2007}
}
In this paper we explore whether a general topology network built up of routers with very small buffers, can maintain high throughputunderTCP’scongestioncontrolmechanism.Recentresultson buffersizingchallengedthewidelyusedassumptionthatroutersshould buffer millions of packets. These new results suggest that when smooth TCP traffic goes through a single tiny buffer of size O(logW), then close-to-peak throughput can be achieved; W is the maximum window sizeofTCPflows. In this work, we want to know if a network of many routers can perform well when all buffers in the network are made very small, independent of the structure of the network. This scenario represents a real network where packets go through several buffering stages on their routes. Assuming the ingress TCP traffic to a network is paced, wefirstprovethatallrouterscangetbywithverysmallbuffers,ifthe network has a tree structure. For networks with general topology, we proposeasimpleactivequeuemanagementpolicycalled Bounded Jitter Policy (BJP), and show that under the proposed policy each flow will preserveitssmoothpatternacrossthenetwork.Logarithmicsizebuffers wouldthereforebeenoughineveryrouterofthenetwork.
@inproceedings{beheshti2008obtaining,
title = {Obtaining High Throughput Networks with Tiny Buffers},
author = {Neda Beheshti, Yashar Ganjali, Ashish Goel, Nick McKeown},
booktitle = {In Proceedings of 16th International Workshop on Quality of Service (IWQoS), Enschede, Netherlands, June 2008},
year = {2008}
}
In the past two years, several papers have proposed rules that suggest two to five orders of magnitude reduction in Internet core router buffers. Others present scenarios where buffer sizes need to be significantly larger. Depending on the rule, a 40 Gb/s router might need a few bytes, or a few gigabytes of buffer. The choice of buffer size affects the design and cost of routers, network congestion, quality of service, and the fairness and stability of TCP. In this paper we discuss the state of the debate on buffer sizing, describe the areas of agreement, and try to identify areas where more work is needed.
@article{ganjali2006update,
title={Update on Buffer Sizing in Internet Routers},
author={Ganjali, Yashar and McKeown, Nick},
journal={ACM SIGCOMM Computer Communication Review},
volume={36},
number={5},
year={2006}
}
Internet routers require buffers to hold packets during times of congestion. The buffers need to be fast, and so ideally they should be small enough to use fast memory technologies such as SRAM or all-optical packet buffers. In this paper, we present results that suggest routers need only a very small amount of buffering. Specifically, we show that if the number of flows through the bottleneck link is large enough, the amount of buffering can be reduced by a factor equal to the square root of the number of flows, from the commonly used rule-of-thumb of B = RTT x C, to B = (RTT x C) / sqrt(n), where B is the buffer size, C is the link capacity, and n is the number of flows.
@inproceedings{enachescu2006routers,
title={Routers with Very Small Buffers},
author={Enachescu, Mihaela and Ganjali, Yashar and Goel, Ashish and McKeown, Nick and Roughgarden, Tim},
booktitle={IEEE INFOCOM},
year={2006}
}
High end routers need to store a iurge amount of dntn. Dynamic random access memories (DRAMSa) re typicdy used.for this purpose. However, DMM metnojy devices don’t Mcht he bdwiddh random access speeds. In this requirements, especially in terms of memory interleavkg scheme. This paper, we analyze a gener&ed implements memory using muliplo, slower scheme a b e , f ast DRAMS. In the presence of smd mount of speed-up) we show that low drop probabditks) can be reosonabk stafktical guarantees &e., provided by using smdl SRAM buffers thot queue redwrite requests to DRAMS. We then rehed rop probabilities to SRAM buffer size fur a wide range of statistical arrivalpatterns.
@inproceedings{shrimali2005building,
title = {Building Packet Buffers with Interleaved Memories},
author = {Gireesh Shrimali and Nick McKeown},
booktitle = {Proceedings of Workshop on High Performance Switching and Routing, Hong Kong, May 2005},
year = {2005}
}
In this article we describe recent work on buffer sizing for core Internet routers. This work suggests that the widelyused rule of thumb leads to buffers which are much larger thantheyneedtobe. Forexample,thebufferinabackbone router could be reduced from 1,000,000 packets to 10,000 without loss in performance. It could be reduced even further,perhapsto10–20packets,atthecostofasmallamount ofbandwidthutilization. Thistradeoffisworthconsidering, for example for a possible futureall-optical router.
@article{wischik2005part,
title = {Part I: Buffer Sizes for Core Routers},
author = {Damon Wischik and Nick McKeown},
journal = {ACM/SIGCOMM Computer Communication Review, Vol. 35, No. 3, July 2005},
year = {2005}
}
@inproceedings{beheshti2006buffer,
title = {Buffer sizing in all-optical packet switches},
author = {Neda Beheshti, Yashar Ganjali, Ramesh Rajaduray, Daniel Blumenthal, and Nick McKeown},
booktitle = {In Proceedings of OFC/NFOEC, Anaheim, CA, March 2006},
year = {2006}
}
All Internet routers contain buffers to hold packets during times of congestion. Today, the size of these buffers is determined by the rule-of-thumb that each link needs a buffer of size equal to the bandwidth-delay product. The widespread use of this rule has led to backbone routers with millions of packets of buffering. We question whether this much buffering is actually needed. Through a combination of analysis and simulation, we argue that buffers can be much smaller than the bandwidth-delay product, reducing router cost and latency.
@inproceedings{appenzeller2004sizing,
title={Sizing Router Buffers},
author={Appenzeller, Guido and Keslassy, Isaac and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2004}
}
Packet buffers are an essential part of routers. In highend routers these buffers need to store a large amount of data at very high speeds. To satisfy these requirements, we need a memory with the the speed of SRAM and the density of DRAM. A typical solution is to use hybrid packet buffers built from a combination of SRAM and DRAM, where the SRAM holds the heads and tails of per-flow packet FIFOs and the DRAM is used for bulk storage. The main challenge then is to minimize the size of the SRAM while providing reasonable performance guarantees. In this paper, we analyze a commonly used hybrid architecture from a statistical perspective, and ask the following question: if the packet buffer designer is willing to tolerate a certain drop probability, then how small can the SRAM get? To do so, we introduce an analytical model for representing the SRAM buffer occupancy, and derive drop probabilities as a function of SRAM size under a wide range of statistical traffic patterns. As a consequence of our analysis we show that, for low drop probability, the required SRAM size is proportional to the number of flows.
@inproceedings{shrimali2004designing,
title = {Designing Packet Buffers with Statistical Guarantees},
author = {Gireesh Shrimali, Isaac Keslassy, and Nick McKeown},
booktitle = {Proceedings of Hot Interconnects, Stanford, CA, August 2004},
year = {2004}
}
Memory bandwidth is frequently a limiting factor in the design of high-speed switches and routers. In this paper, we introduce a buffering scheme called ping-pong buffering, that increases memory bandwidth utilization by a factor of two, while adding less than one percent to total memory requirements. In particular, we show how ping-pong buffering can be used to double the throughput of a shared-memory switch, or reduce the memory speed requirement of an output-queued switch by a factor of two.
@inproceedings{joo1998doubling,
title={Doubling Memory Bandwidths for Network Buffers},
author={Joo, Youngmi and McKeown, Nick},
booktitle={IEEE INFOCOM},
year={1998}
}
Networkoperatorsconnecttheirbackbonenetworks togetheratpeeringpoints.Itiswellknownthatthepeeringpoints are the most congested parts of the backbone network. Network operators have little incentive to provision them well, and have few tools to decide how best to route traffic over them. In this paper we propose how peering networks can be congestion free, so long as we know the total amount of traffic between them. In particular, we propose the use of Valiant Load-Balancing (VLB), which has been previously studied for individual backbone networks. In our approach, the backbone networks do not need to use VLB internally—they simply loadbalance traffic over their peering links. Our analysis shows how theload-balancingshouldbedone,andweconcludethatnoother methodismoreefficientthanVLBinachievingacongestion-free network.
@inproceedings{zhang-shen2008guaranteeing,
title = {Guaranteeing Quality of Service to Peering Traffic},
author = {Rui Zhang-Shen, Nick McKeown},
booktitle = {Presented at Infocom 2008},
year = {2008}
}
Commercial backbone networks must continue to operate even when links and routers fail. Routing schemes such as OSPF, IS-IS, and MPLS reroute traffic, but they cannot guarantee that the resulting network will be congestion-free. As a result, backbone networks are grossly over-provisioned— sometimes running at a utilization below 10% so they can remain uncongested under failure. Yet even with such large over-provisioning, they still cannot guarantee to be uncongested, sometimes even with just a single failure. With our proposed approach, a network can be designed to tolerate an almost arbitrary number of failures, and guarantee no congestion, usually with an extremely small amount of overprovisioning. In a typical case, a 50 node network can continue to run congestion-free when any 5 links or routers fail, with only 10% over-provisioning. The key to the approach is Valiant Load-Balancing(VLB).VLB’spathdiversityallowsittotolerate karbitraryfailuresinanN nodenetwork,withover-provisioning k. ratio of approximately N
@inproceedings{zhang-shen2008designing,
title = {Designing a Fault-Tolerant Network Using Valiant Load-Balancing},
author = {Rui Zhang-Shen, Nick McKeown},
booktitle = {Presented at Infocom 2008},
year = {2008}
}
Routers built around a single-stage crossbar and a centralized scheduler do not scale, and (in practice) do not provide the throughput guarantees that network operators need to make efficient use of their network. We describe a scalable router architecture based on a three-stage Clos network topology, using an optical switch in the middle stage. The architecture allows the router to scale to very high aggregate bandwidths while maintaining deterministic performance guarantees. We address two main questions: what performance guarantees can a three-stage Clos-network router provide, and how optical switches can be used to build a practical implementation.
@inproceedings{keslassy2003scaling,
title={Scaling Internet Routers Using Optics},
author={Keslassy, Isaac and Chuang, Shang-Tse and Yu, Kyoungsik and Miller, David and Horowitz, Mark and Solgaard, Olav and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2003}
}
A parallel packet switch (PPS) is a switch in which the memories run slower than the line rate. Arriving packets are load-balanced packet-by-packet over multiple lower speed center stage packet switches. It is known that, for unicast traffic, a PPS can precisely emulate a FCFS output-queued (OQ) switch with a speedup of two and an OQ switch with delay guarantees with a speedup of three. In this paper we ask: Is it possible for a PPS to emulate the behavior of an OQmulticast switch? The main result is that for multicast traffic an N-port PPS can precisely emulate a FIFO OQ S>2 N+1, switch with a speedup of and a switch that provides delay guarantees with a speedup ofS>2 2N+2.
@article{iyer2001speedup,
title = {On the Speedup Required for a Multicast Parallel Packet Switch},
author = {Sundar Iyer and Nick McKeown},
journal = {IEEE Communication Letters, June 2001, vol. 5, no. 6, pp},
year = {2001}
}
We present a load-balanced switch architecture that can scale to an arbitrary number of linecards. The architecture uses a two-stage switching fabric that first load-balances traffic across the fabric, then delivers it to the correct output. We show that this architecture can provide 100% throughput without centralized scheduling, making it suitable for building very large routers.
@techreport{keslassyloadbalanced,
title = {A Load-Balanced Switch with an Arbitrary Number of Linecards},
author = {Isaac Keslassy, Shang-Tse Chuang, Nick McKeown},
institution = {Stanford HPNG Technical Report TR03-HPNG-080102},
year = {}
}
Ourworkismotivatedbythedesiretodesignpacket switcheswithlargeaggregatecapacityandfastlinerates.Inthis relay paper,weconsiderbuildingapacketswitchfrommultiplelower packets speed packet switches operating independently and in parallel. process In particular, we consider a (perhaps obvious) parallel packet electronic switch (PPS) architecture in which arriving traffic is demulti- (bydefinition),anditisnoteconomicallyfeasibletodaytostore plexed over identical lower speed packet switches, switched to the correct output port, then recombined (multiplexed) before packets departingfromthesystem.Essentially,thepacketswitchperforms electronic packet-by-packet load balancing, or inverse multiplexing, over at multiple independent packet switches. Each lower speed packet we switchoperatesatafractionofthelinerate .Forexample,each and packetswitchcanoperateatrate .Itisagoalofourworkthat allmemorybuffersinthePPSrunslowerthanthelinerate.IdeOC768(40Gb/s)andeventoOC3072(160Gb/s),itbecomes ally,aPPSwouldsharethebenefitsofanoutput-queuedswitch, difficult,
@article{iyer2003analysis,
title = {Analysis of the Parallel Packet Switch Architecture},
author = {Sundar Iyer and Nick McKeown},
journal = {IEEE/ACM Transactions on Networking, pp. 314-324, April 2003},
year = {2003}
}
Most high performance routers today use combined input and output queueing (CIOQ). The CIOQ router is also frequently used as an abstract model for routers: at one extreme is input queueing, at the other extreme is output queueing, and in between are CIOQ routers with various amounts of speedup. It is well known that a CIOQ router with a speedup of two can precisely emulate an output-queued router. In this paper, we ask whether it is possible to achieve similar performance guarantees with a CIOQ router that has a speedup of just one; i.e., with a router that has only a single stage of buffering. We find that it is possible, using a technique called buffered crossbar switching.
@inproceedings{iyer2002routers,
title={Routers with a Single Stage of Buffering},
author={Iyer, Sundar and Zhang, Rui and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={2002}
}
High performance packet switches frequently use a centralized scheduler (also known as an arbiter) to determine the configuration of a non-blocking crossbar. The scheduler often limits the scalability of the system because of the frequency and complexity of its decisions. A recent paper by C.-S. Chang et al. introduces an interesting two-stage switch, in which each stage uses a trivial deterministic sequence of configurations. The switch is simple to implement at high speed and has been proved to provide 100% throughput for a broad class of traffic. Furthermore, there is a bound between the average delay of the two-stage switch and that of an ideal output-queued switch. However, in its simplest form, the switch mis-sequences packets by an arbitrary amount. In this paper, building on the two-stage switch, we present an algorithm called Full Frames First (FFF), that prevents mis-sequencing while maintaining the performance benefits (in terms of throughput and delay) of the basic two-stage switch. FFF comes at some additional cost, which we evaluate in this paper.
@inproceedings{keslassy2002maintaining,
title = {Maintaining Packet Order in Two-Stage Switches},
author = {Isaac Keslassy and Nick McKeown},
booktitle = {Proceedings of IEEE INFOCOM '02, New York, June 2002},
year = {2002}
}
This paper is about load-balancing packets across multiplepathsinsideaswitch,oracrossanetwork.Itismotivated by the recent interest in load-balanced switches. Load-balanced switches provide an appealing alternative to crossbars with centralizedschedulers.Aload-balancedswitchhasnoscheduler, is particularly amenable to optics, and – most relevant here – guarantees 100% throughput. A uniform mesh is used to loadbalance packets uniformly across all 2-hop paths in the switch. Inthispaperweexplorewhetherthisparticularmethodofloadbalancing is optimal in the sense that it achieves the highest throughput for a given capacity of interconnect. The method we use allows the load-balanced switch to be compared with Fig.1. Load-balancedswitcharchitecture ring, torus and hypercube interconnects, too. We prove that for a given interconnect capacity, the load-balancing mesh has the maximumthroughput.Perhapssurprisingly,wefindthatthebest mesh is slightly non-uniform, or biased, and has a throughput of N/(2N −1), where N is the number of nodes.
@inproceedings{keslassy2005optimal,
title={Optimal Load-Balancing},
author={Keslassy, Isaac and Chang, Cheng-Shang and McKeown, Nick and Lee, Duan-Shin},
booktitle={IEEE INFOCOM},
year={2005}
}
Designing a backbone network is hard. On one hand, users expect the network to have very high availability, little or no congestion, and hence little or no queueing delay. On the other hand, traffic conditions change frequently, making it hard to predict future traffic demands. Network operators try to build enough capacity for the worst case, but this can be very expensive. In this paper, we describe an approach to designing backbone networks that provide predictable performance in the face of unpredictable demands. The approach is based on the idea of designing the network so that its performance degrades gracefully under a wide range of traffic conditions, and in particular under link failures.
@inproceedings{zhangshen2004designing,
title={Designing a Predictable Internet Backbone Network},
author={Zhang-Shen, Rui and McKeown, Nick},
booktitle={HotNets III},
year={2004}
}
We describe recent developments in how optics can be used inside Internet routers to scale capacity and reduce power. At Stanford University we are designing a 100 Tb/s Internet router with an optical switch fabric that guarantees 100% throughput for all traffic. We discuss how optical switches, combined with load-balancing techniques, can overcome the scaling limitations of electronic crossbar switches.
@inproceedings{mckeown2003optics,
title = {Optics inside Routers},
author = {Nick McKeown},
booktitle = {ECOC 2003, Rimini, Italy, September 2003},
year = {2003}
}
We present a load-balanced switch architecture that can scale to an arbitrary number of linecards. The architecture uses a two-stage switching fabric that first load-balances traffic across the fabric, then delivers it to the correct output. We show that this architecture can provide 100% throughput without centralized scheduling, making it suitable for building very large routers.
@inproceedings{keslassy2004loadbalanced,
title = {A Load-Balanced Switch with an Arbitrary Number of Linecards},
author = {Isaac Keslassy, Shang-Tse Chuang, Nick McKeown},
booktitle = {Proceedings of IEEE Infocom '04, Hong Kong, March 2004},
year = {2004}
}
The load-balanced switch architecture is a promising way to scale router capacity. We explained in [1] how it can be used to builda 100Tb/s router with no centralized scheduler, no memory operating faster than the line-rate, no packet missequencing,a100%throughputguaranteeforalltrafficpatterns, and an optical switch fabric that simply spreads traffic evenly amonglinecards.ThisswitchfabricusesopticalMEMSswitches thatarereconfiguredonlywhenlinecardsareaddedanddeleted, allowing the router to function when any subset of linecards is present and working. In[2]weintroducedaconfigurationalgorithmthatwillfinda correct configurationof theMEMSswitchesinpolynomial time. However, we found that our algorithm takes over 50 seconds to run in software for a 100Tb/s router. Our goal is to restore the router to operation within 50ms upon failure. So we modify our algorithm for implementation in dedicated hardware. In particular, to simplifythe Ford-Fulkerson algorithmin bipartite matches, we reduce memory accesses and use bit manipulation schemes. Then, we decompose permutations using the Slepian Duguid algorithm and reduce the configuration time with a simplified memory scheme. Our experimental results show that it is possible to achieve the 50ms target.
@inproceedings{arekapudi2004configuring,
title = {Configuring a Load-Balanced Switch in Hardware},
author = {Srikanth Arekapudi, Shang-Tse Chuang, Isaac Keslassy, Nick McKeown},
booktitle = {Hot Interconnects 12, Stanford, August 2004},
year = {2004}
}
A parallel packet switch (PPS) is a switch in which the mem- operating no faster than, say, 10Gb/s. We simply make the obserories run slower than the line rate. Arriving packets are spread (or load-bal- vation that if line rates do increase, then memory bandwidth limanced) packet-by-packet over multiple slower-speed packet switches. It is itations may make packet buffers difficult or impossible to already known that with a speedup of S‡ 2, a PPS can theoretically mimic a implement. FCFS output-queued (OQ) switch. However, the theory relies on a centralized packet scheduling algorithm that is essentially impractical because of high communication complexity. In this paper, we attempt to make a high II. BACKGROUND performance PPS practical by introducing two results. First, we show that In a previous paper [1] we proposed the parallel packet switch small co-ordination buffers can eliminate the need for a centralized packet (PPS) as a way to overcome the memory bandwidth limitation. A scheduling algorithm, allowing a full distributed implementation with low computational and communication complexity. Second, we show that with- key attribute of the PPS is that its packet buffers can run slower out speedup, the resulting PPS can mimic an FCFS OQ switch within a delay than the line rate; by increasing parallelism they can be made to bound. operate arbitrarily slowly. The PPS architecture is illustrated in Figure 1, and is based on the 3-stage Clos Network [2]. The main
Our work is motivated by the desire to build a very high speed packet-switch with extremely high line-rates. In this paper, we consider building a packet-switch from multiple, lower speed packet-switches operating independently and in parallel. In particular, we consider a (perhaps obvious) parallel packet switch (PPS) architecture in which arriving traffic is demultiplexed over k identical, lower speed packet-switches, switched to the correct output port, then recombined (multiplexed) before departing from the system. Essentially, the packet-switch performs packet-by-packet loadbalancing, or “inverse-multiplexing” over multiple independent packetswitches. Each lower-speed packet switch, operates at a fraction of the linerate, R; for example, if each packet-switch operates at rateR⁄k no memory buffers are required to operate at the full line-rate of the system. Ideally, a PPS would share the benefits of an output-queued switch; i.e. the delay of individual packets could be precisely controlled, allowing the provision of guaranteed qualities of service. In this paper, we ask the question: Is it possible for a PPS to precisely emulate the behavior of an output-queued packetswitch with the same capacity and with the same number of ports? The main result of this paper is that it is theoretically possible for a PPS to emulate a FCFS output-queued packet-switch if each layer operates at a rate of 2R⁄k. This simple result is analogous to Clos’ theorem for a approximately three-stage circuit switch to be strictly non-blocking. We further show that the PPS can emulate any QoS queueing discipline if each layer operates at a rate of approximately3R⁄k.
@inproceedings{iyer2000analysis,
title = {Analysis of a Packet Switch with Memories Running Slower than the Line Rate},
author = {Sundar Iyer, Amr A. Awadallah, and Nick McKeown},
booktitle = {IEEE INFOCOM March 2000, Tel-Aviv, Israel, pp. 529-537},
year = {2000}
}
Input-queued packetswitchesuse amatching algorithmtoconfigureanonblockingswitchfabric(e.g.,acrossbar). Ideally,thematchingalgorithmwillguarantee100%throughput outputs,i.e., for a broad class of traffic, so long as the switch is not oversubscribed.Anintuitivechoiceisthemaximumsizematching(MSM) algorithm,whichmaximizestheinstantaneousthroughput.Itwas shown,by McKeownetal.in1999,thatwithMSMthethroughput 3,evenwithbenign Bernoulli canbelessthan100%when 2,and i.i.d.arrivals.Inthisletter,weextendthisresultto henceshowittobetrueforswitchesofanysize.
@techreport{keslassymaximum,
title = {Maximum Size Matching is Unstable for Any Packet Switch},
author = {Isaac Keslassy, Rui Zhang-Shen, Nick McKeown},
institution = {Stanford HPNG Technical Report TR03-HPNG-030100},
year = {}
}
This paper is about high capacity switches and routersthatgiveguaranteedthroughput,rateanddelayguarantees. Many routers are built using input queueing or combined input and output queueing (CIOQ), using crossbar switching fabrics.Butsuchroutersrequireimpracticallycomplexscheduling algorithms to provide the desired guarantees. We explore how a buffered crossbar — a crossbar switch with a packet buffer at each crosspoint — can provide guaranteed performance(throughput,rate,anddelay),withlesscomplex,practical scheduling algorithms. We describe scheduling algorithms that operate in parallel on each input and output port, and hence are scalable. With these algorithms, buffered crossbars with a speedup of two can provide 100% throughput, rate, and delay guarantees.
@techreport{chuangpractical,
title = {Practical Algorithms for Performance Guarantees in Buffered Crossbars},
author = {Shang-Tse Chuang, Sundar Iyer, Nick McKeown},
institution = {Stanford HPNG Technical Report TR03-HPNG-061501},
year = {}
}
Internet routers frequently use a crossbar switch to interconnect linecards. The crossbar switch is scheduled using an algorithm that picks a new crossbar configuration every cycle. Several scheduling algorithms have been shown to guarantee 100% throughput under a variety of traffic patterns. The first such algorithm was the maximum weight matching (MWM) algorithm in which the weight is the sum of the occupancies of the queues. We explore whether alternative weight functions, such as using the sum of the square of the occupancies, leads to stronger or weaker stability. The first result of this paper is that a broad class of weight functions give rise to strong stability, including the sum of the squares, the sum of the cubes and so on. A counter-intuitive result, indicating a limitation of the Lyapunov method, is that the sum of the square root of the occupancies is not included in this class, even though simulation suggests that the resulting average delay is lower than for the other functions. We also consider the simpler, ON()log , randomized scheduling algorithm (TASS) proposed by Tassiulas. We show similar results for different weight functions as for MWM. We finally show that TASS gives 100% throughput when the weights are noisy, or out-of-date.
@inproceedings{keslassy2001analysis,
title = {Analysis of Scheduling Algorithms That Provide 100% Throughput in Input-Queued Switches},
author = {Isaac Keslassy and Nick McKeown},
booktitle = {Proceedings of the 39th Annual Allerton Conference on Communication, Control, and Computing. Monticello, Illinois, October 2001},
year = {2001}
}
Simulation results suggest that a maximum size matching (MSM) algorithm will lead to 100% throughput for uniform Bernoulli throughput of MSM algorithms enforce deterministic constraints on the joint arrival traffic. In this paper we explore MSM algorithms under Bernoulli arrivals. We show that when the arrival traffic is admissible and cells are scheduled in batches, a sub-class of MSMs called the Critical Maximum Size Matching achieves uniform) Bernoulli i.i.d arrivals. Further, we show that — with batch scheduling — all MSMs achieve 100% throughput under Bernoulli i.i.d. uniform load.
@inproceedings{iyermaximum,
title = {Maximum Size Matching and Input Queued Switches},
author = {Sundar Iyer and Nick McKeown},
booktitle = {Proceedings of the 40th Annual Allerton Conference on Communication, Control and Computing},
year = {}
}
We recently proposed Constraint Sets as a simple technique to analyze routers with a single stage of buffering. In of thisletter,weextendthetechniquetoanalyzecombinedinputand aboundeddelaydifference.Inotherwords,whensubjecttothe output(CIOQ)routerswithtwostagesofbuffering. same
@article{iyer2003using,
title = {Using Constraint Sets to Achieve Delay Bounds in CIOQ Switches},
author = {Sundar Iyer and Nick McKeown},
journal = {IEEE Communication Letters, pp. 275-277, 2003},
year = {2003}
}
Input-queued packetswitchesuse amatching algorithmtoconfigureanonblockingswitchfabric(e.g.,acrossbar). Ideally,thematchingalgorithmwillguarantee100%throughput outputs,i.e., for a broad class of traffic, so long as the switch is not oversubscribed.Anintuitivechoiceisthemaximumsizematching(MSM) algorithm,whichmaximizestheinstantaneousthroughput.Itwas shown,by McKeownetal.in1999,thatwithMSMthethroughput 3,evenwithbenign Bernoulli canbelessthan100%when 2,and i.i.d.arrivals.Inthisletter,weextendthisresultto henceshowittobetrueforswitchesofanysize.
@article{keslassy2003maximum,
title = {Maximum Size Matching is Unstable for Any Packet Switch},
author = {Isaac Keslassy, Rui Zhang-Shen, Nick McKeown},
journal = {IEEE Communications Letters, Vol. 7, No. 10, pp. 496-498, Oct. 2003},
year = {2003}
}
Thethroughputofaninput-queuedcrossbarswitch –withasingleFIFOqueueateachinput–islimitedto (cid:0) (cid:0) (cid:1) for uniformly distributed, Bernoulli i.i.d.arrivals of fixed (cid:0) (cid:2) (cid:0)(cid:3) (cid:1) (cid:2) (cid:4) lengthpackets.Inthisletterweprove thatifthecrossbarswitch can buffer one packet at each crosspoint, then the throughput increases to asymptotically as , where is the (cid:1) (cid:1) (cid:5) (cid:6) (cid:6) (cid:4) number of switch ports. (cid:3) (cid:4)
@article{lin2004throughput,
title = {The Throughput of a Buffered Crossbar Switch},
author = {Mingjie Lin, Nick McKeown},
journal = {IEEE Communications Letters 2004},
year = {2004}
}
This paper is about high capacity switches and routersthatgiveguaranteedthroughput,rateanddelayguarantees. Many routers are built using input queueing or combined input and output queueing (CIOQ), using crossbar switching fabrics.Butsuchroutersrequireimpracticallycomplexscheduling algorithms to provide the desired guarantees. We explore how a buffered crossbar — a crossbar switch with a packet buffer at each crosspoint — can provide guaranteed performance(throughput,rate,anddelay),withlesscomplex,practical scheduling algorithms. We describe scheduling algorithms that operate in parallel on each input and output port, and hence are scalable. With these algorithms, buffered crossbars with a speedup of two can provide 100% throughput, rate, and delay guarantees.
@inproceedings{chuang2005practical,
title = {Practical Algorithms for Performance Guarantees in Buffered Crossbars},
author = {Shang-Tse Chuang, Sundar Iyer, Nick McKeown},
booktitle = {Proceedings of IEEE INFOCOM 2005, Miami, Florida, March 2005},
year = {2005}
}
It is well known that head-of-line (HOL) blocking limits the throughput of an input-queued switch with FIFO queues. Under certain conditions, the throughput can be shown to be limited to approximately 58%. It is also known that if non-FIFO queueing policies are used, the throughput can be increased. However, it has not been previously shown that if a suitable queueing policy and scheduling algorithm are used then it is possible to achieve 100% throughput for all independent arrival processes. In this paper we prove this to be the case using a simple linear programming argument and quadratic Lyapunov function. In particular, we assume that each input maintains a separate FIFO queue for each output and that the switch is scheduled using a maximum weight bipartite matching algorithm.
@article{mckeown1999achieving,
title={Achieving 100\% Throughput in an Input-Queued Switch},
author={McKeown, Nick and Mekkittikul, Adisak and Anantharam, Venkat and Walrand, Jean},
journal={IEEE Transactions on Communications},
volume={47},
number={8},
year={1999}
}
We describe and analyze iSLIP, a low-complexity scheduling algorithm for input-queued switches. iSLIP uses multiple iterations of a parallel round-robin matching algorithm to achieve high throughput and low latency. We prove that for uniform i.i.d. Bernoulli arrivals, iSLIP achieves 100% throughput. Under non-uniform traffic, iSLIP performs well in practice, converging quickly to a stable scheduling pattern. The algorithm is simple enough to implement in hardware at high line rates and has been used in several commercial switch designs.
@article{mckeown1999islip,
title={iSLIP: A Scheduling Algorithm for Input-Queued Switches},
author={McKeown, Nick},
journal={IEEE/ACM Transactions on Networking},
volume={7},
number={2},
year={1999}
}
The Internet is facing two problems simultaneously: there is a need for a faster switching/routing infrastructure, and a need to introduce guaranteed qualities of service (QoS). Each problem can be solved independently: switches and routers can be made faster by using input-queued crossbars, instead of shared memory systems; and QoS can be provided using WFQ-based packet scheduling. However, until now, the two solutions have been mutually exclusive — all of the work on WFQ-based scheduling algorithms has required that switches/routers use output-queueing, or centralized shared memory. This paper demonstrates that a Combined Input Output Queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet scheduling algorithms, including WFQ and strict priorities. More precisely, we show that for an NN· switch, a “speedup” of 21– ⁄ N is necessary and a speedup of two is sufficient for this exact emulation. Perhaps most interestingly, this result holds for all traffic arrival patterns. On its own, the result is primarily a theoretical observation; it shows that it is possible to emulate purely OQ switches with CIOQ switches running at approximately twice the line-rate. To make the result more practical, we introduce several scheduling algorithms that, with a speedup of two, can emulate an OQ switch. We focus our attention on the simplest of these algorithms, Critical Cells First (CCF), and consider its running-time and implementation complexity. We conclude that additional techniques are required to make the scheduling algorithms implementable at high speed, and propose two specific strategies.
@inproceedings{chuang1999matching,
title={Matching Output Queueing with a Combined Input Output Queued Switch},
author={Chuang, Shang-Tse and Goel, Ashish and McKeown, Nick and Prabhakar, Balaji},
booktitle={IEEE INFOCOM},
year={1999}
}
At very high aggregate bandwidths, output queueing is impractical because of insufficient memory bandwidth. This problem is getting worse: memory bandwidth is improving slowly, whereas the demand for network bandwidth continues to grow exponentially. The difficulty is that outputqueued switches require memories that run at a speedup of N, where N is equal to the number of switch ports. This paper addresses the following question: Is it possible for a switch to exactly match outputqueueing with a reduced speedup? We prove that if virtual output queueing is used, a combined input-output queued switch is always work-conserving if its speedup is N⁄2. greater than This result is proved using a novel scheduling algorithm - the Home Territory Algorithm (HTA). 1: Introduction Many commercial switches and routers today employ output queueing.1 The advantages of output queueing are two-fold. First, it allows the throughput to be maximized: so long as no input or output is oversubscribed, the switch will support the traffic. Second, because packets are immediately placed in output queues upon arrival, it is possible to control the latency of packets through the switch. This is very important for supporting QoS in a switch or router. But output
@inproceedings{mckeown1997matching,
title = {Matching Output Queueing with Combined Input and Output Queueing},
author = {Nick McKeown, Balaji Prabhakar, and Mingyan Zhu},
booktitle = {Proceedings of the 35th Annual Allerton Conference on
Communication, Control, and Computing. Monticello, Illinois, October 1997},
year = {1997}
}
We consider the speedup needed for a combined input and output queued (CIOQ) switch to emulate an output-queued switch. It is known that a speedup of two suffices. We examine this question from a theoretical perspective, deriving conditions on the speedup factor that allow the CIOQ switch to precisely mimic the behavior of an ideal output-queued switch, including providing identical departure times for all packets.
@techreport{prabhakar1997speedup,
title = {On the Speedup Required for Combined Input and Output Queued Switching},
author = {Balaji Prabhakar and Nick McKeown},
institution = {Computer Systems Technical Report CSL-TR-97-738. November 1997. Also published in: Automatica, Vol. 35, no. 12, December 1999},
year = {1997}
}
Increasingly, a passive shared bus is being replaced by an active non-blocking switch fabric — most often a Input queueing is becoming increasingly used for highcrossbar switch. Each line card is connected by a dedibandwidth switches and routers. In previous work, it was cated point-to-point link to the central switch fabric, and proved that it is possible to achieve 100% throughput for therefore has fewer electrical limitations due to loading input-queued switches using a combination of virtual outand reflections. More importantly, each connection to the put queueing and a scheduling algorithm called LQF. switch need run only as fast as the line rate, rather than at However, this is only a theoretical result: LQF is too comthe aggregate bandwidth of the switch. Centralized shared plex to implement in hardware. In this paper we introduce memory is also being replaced—by separate queues at a new algorithm called Longest Port First (LPF), which is each input of the switching fabric. Input queues need only designed to overcome the complexity problems of LQF, run at the line rate, and therefore allow a faster overall sysand can be implemented in hardware at high speed. By tem to be built [6][11]. giving preferential service based on queue lengths, we prove that LPF can achieve 100% throughput. The very fastest switches and routers usually transfer packets across the switching fabric in fixed size units, that
@inproceedings{mekkittikul1998practical,
title={A Practical Scheduling Algorithm to Achieve 100\% Throughput in Input-Queued Switches},
author={Mekkittikul, Adisak and McKeown, Nick},
booktitle={IEEE INFOCOM},
year={1998}
}
In this paper we quantitatively evaluate three iterative algorithms for scheduling cells in a high-bandwidth input-queued ATM switch. In particular, we compare the performance of an algorithm described in this paper (iSLIP) with two recently proposed algorithms: PIM and iRRM. These algorithms find a maximal match between input and output ports using multiple iterations of request-grant-accept, and can be readily implemented in hardware at high speed. We compare their throughput and delay performance under various traffic conditions, and identify the trade-offs between complexity and performance.
@article{mckeown1998quantitative,
title={A Quantitative Comparison of Scheduling Algorithms for Input-Queued Switches},
author={McKeown, Nick and Anderson, Thomas E.},
journal={Computer Networks and ISDN Systems},
volume={30},
number={24},
year={1998}
}
We consider the problem of scheduling cells in an input-queued switch. Input-queued switches are attractive for high-speed networking because they require memory that operates only at the line rate rather than N times the line rate. However, input queuing suffers from head-of-line blocking, which limits throughput to approximately 58.6%. We describe scheduling algorithms that can overcome this limitation.
@article{mckeown1993scheduling,
title = {Scheduling Cells in an Input-Queued Switch},
author = {Nick McKeown, Pravin Varaiya, and Jean Walrand},
journal = {IEE Electronics Letters, Dec 9th 1993, pp.2174-5},
year = {1993}
}
Proceedings of 7th IEEE LAN/MAN Workshop, Florida, March 1995
We describe a fast scheduling algorithm for input-queued switches. Input-queued switches are attractive at high line rates because they require memory bandwidth proportional to only the line rate rather than the aggregate switch rate. We present an iterative algorithm that converges quickly to a high-throughput matching between inputs and outputs, enabling practical implementation at high speeds.
@inproceedings{mckeown1995fast,
title = {A Fast Scheduling Algorithm for Input-Queued Switches},
author = {Nick McKeown, Jean Walrand},
booktitle = {Proceedings of 7th IEEE LAN/MAN Workshop, Florida, March 1995},
year = {1995}
}
We consider input-buffered ATM switches that provide guaranteed-rate service. We describe a class of scheduling algorithms that can guarantee a minimum rate to each input-output pair while achieving high overall switch throughput. The algorithms operate by computing matchings in a bipartite graph and can be implemented efficiently in hardware.
@inproceedings{hung1998atm,
title = {ATM Input-Buffered Switches with Guaranteed-Rate Property},
author = {Anthony Hung, George Kesidis, and Nick McKeown},
booktitle = {Proc. IEEE ISCC '98, Athens, Jul 1998, pp 331-335},
year = {1998}
}
We describe the design and implementation of a fast crossbar scheduler for input-queued switches. The scheduler implements the iSLIP algorithm, which achieves high throughput with low implementation complexity. We present a hardware implementation that operates at speeds sufficient for gigabit-rate switches and describe the design tradeoffs involved in building practical crossbar schedulers.
@article{gupta1998design,
title = {Design and Implmentation of a Fast Crossbar Scheduler},
author = {Pankaj Gupta and Nick McKeown},
journal = {Hot Interconnects VI, Stanford University, August 1998.
IEEE Micro Magazine, Jan-Feb 1999},
year = {1998}
}
Inthispaperweconsiderpoliciesforscheduling cellsinaninput-queuedmulticastswitch.Itisassumed thateachinputmaintainsa singlequeueforarriving multicast cells and that only the cell at the head of line(HOL)canbeobservedandscheduledatonetime. Thepoliciesareassumedtobework-conserving,which meansthatcellsmaybecopiedtotheoutputsthatthey requestoverseveralcelltimes. When a schedulingpolicydecideswhich cells to schedule,contentionmayrequirethatitleavearesidue of cells to be scheduled in the next cell time. The selectionofwheretoplacetheresidueuniquelydefines theschedulingpolicy. Weprovethatfora2 (cid:2) Nswitch, apolicythatalwaysconcentratestheresidue,subject to a natural fairness constraint, always outperforms allotherpolicies. Simulation results indicate that this policy also performs well for more general M (cid:2) N switches. We present a heuristic round-robin policy called mRRM thatissimpletoimplementinhardware,fair,andperformsalmostaswellastheconcentratingpolicy.
@article{ahuja1997multicast,
title={Multicast Scheduling for Input-Queued Switches},
author={Ahuja, Rietsh and Prabhakar, Balaji and McKeown, Nick},
journal={IEEE JSAC},
volume={15},
number={5},
year={1997}
}
We analyze the performance of scheduling algorithms for virtual output queued (VOQ) switches under non-uniform traffic patterns. While many scheduling algorithms are known to achieve 100% throughput under uniform traffic, their performance under non-uniform traffic can vary significantly. We characterize the throughput regions of several common scheduling algorithms and provide guidelines for algorithm selection.
@techreport{mekkittikul1997scheduling,
title = {Scheduling VOQ Switches under Non-Uniform Traffic},
author = {Adisak Mekkittikul, and Nick McKeown},
institution = {CSL Technical Report, CSL-TR 97-747, Stanford University, 1997},
year = {1997}
}
It is well known that head-of-line (HOL) blocking limits the throughput of an input-queued switch with FIFO queues. Under certain conditions, the throughput can be shown to be limited to approximately 58%. It is also known that if non-FIFO queueing policies are used, the throughput can be increased. However, it has not been previously shown that if a suitable queueing policy and scheduling algorithm are used then it is possible to achieve 100% throughput for all independent arrival processes. In this paper we prove this to be the case using a simple linear programming argument and quadratic Lyapunov function. In particular, we assume that each input maintains a separate FIFO queue for each output and that the switch is scheduled using a maximum weight bipartite matching algorithm.
@inproceedings{mckeown1996achieving,
title={Achieving 100\% Throughput in an Input-Queued Switch},
author={McKeown, Nick and Anantharam, Venkat and Walrand, Jean},
booktitle={IEEE INFOCOM},
year={1996}
}
This paper presents the design of the scheduler assumedthateachinputmaintainsasinglequeueforarrivingmulticastcellsandthatonlythecellatthehead ofline(HOL)canbeobservedandscheduledatonetime. whichmeansthatnooutputportmaybeidleaslongasthereisaninputcelldestinedtoit. schedulerisrequiredtobefair,whichmeansthatnoinputcellmaybeheldatHOLformorethan Mcelltimes (Misthenumberofinputports). Theaim isto finda throughputandminimizesinputqueuelatency. Whena contention may requirethat it leave a residue of cells ofwheretoplacetheresidueuniquelydefinestheschedulingpolicy. alwaysconcentratestheresidue,subjecttoourfairnessconstraint,alwaysoutperformsallotherpolicies. presentonesuchpolicy,calledTATRA,andanalyzeitgeometrically. policycalledmRRMthatissimpletoimplementinhardware,fair,andperformsquitewellwhencompared toaconcentratingalgorithm.
@inproceedings{prabhakar1995designing,
title = {Designing a Multicast Switch Scheduler},
author = {Balaji Prabhakar and Nick McKeown},
booktitle = {Proceedings of the 33rd Annual Allerton Conference on
Communication, Control, and Computing. pp. 984-993.
Monticello, Illinois, October 1995},
year = {1995}
}
We describe an architecture for an 8 Tb/s ATM interconnection network using optical wavelength division multiplexing (WDM). The architecture combines high-speed ATM switching with WDM optical networking to achieve very high aggregate bandwidths. We analyze the performance of the proposed architecture and discuss the feasibility of implementation using available optical and electronic technologies.
@inproceedings{mekkittikul1996,
title = {8 Tb/s ATM Interconnection through optical WDM networks},
author = {Adisak Mekkittikul, D. Sadot, L.G. Kazovsky, Nick McKeown},
booktitle = {High-Speed Semiconductor Laser Sources, San Jose, CA, Vol. 2684, pp. 186-98, 1-2 February, 1996},
year = {1996}
}
Inthispaperweconsiderpoliciesforscheduling cellsinaninput-queuedmulticastswitch.Itisassumed thateachinputmaintainsa singlequeueforarriving multicast cells and that only the cell at the head of line(HOL)canbeobservedandscheduledatonetime. Thepoliciesareassumedtobework-conserving,which meansthatcellsmaybecopiedtotheoutputsthatthey requestoverseveralcelltimes. When a schedulingpolicydecideswhich cells to schedule,contentionmayrequirethatitleavearesidue of cells to be scheduled in the next cell time. The selectionofwheretoplacetheresidueuniquelydefines theschedulingpolicy. Weprovethatfora2 Nswitch, (cid:2) apolicythatalwaysconcentratestheresidue,subject to a natural fairness constraint, always outperforms allotherpolicies. Simulation results indicate that this policy also performs well for more general M N switches. We (cid:2) present a heuristic round-robin policy called mRRM thatissimpletoimplementinhardware,fair,andperformsalmostaswellastheconcentratingpolicy.
@inproceedings{mckeown1996scheduling,
title = {Scheduling Multicast Cells in an Input-Queued Switch},
author = {Nick McKeown and Balaji Prabhakar},
booktitle = {Proceedings of IEEE Infocom '96, San Francisco, Vol 1, pp. 271-278, March 1996},
year = {1996}
}
We introduce Tetris models for analyzing multicast switches. In a multicast switch, a single input packet may need to be delivered to multiple outputs simultaneously. We develop an analytical framework based on the Tetris game metaphor that captures the combinatorial nature of multicast scheduling and derive performance bounds for various scheduling policies.
@inproceedings{prabhakar1996tetris,
title = {Tetris Models for Multicast Switches},
author = {Balaji Prabhakar, Nick McKeown and Jean Mairesse},
booktitle = {Proceedings of the Princeton Conference, Princeton, March 1996},
year = {1996}
}
This thesis addresses the scheduling problem for input-queued cell switches. We describe and analyze scheduling algorithms that determine which cells are transferred across the switch fabric in each time slot. We present new algorithms that are simple enough to implement in hardware at high speed, yet provide high throughput and low delay. The main contributions include the iSLIP algorithm, which uses iterated round-robin matching to achieve 100% throughput for uniform traffic, and new analysis techniques for understanding the behavior of input-queued switches.
@phdthesis{mckeown1995scheduling,
title={Scheduling Algorithms for Input-Queued Cell Switches},
author={McKeown, Nick},
school={University of California at Berkeley},
year={1995}
}
The process of categorizing packets into “flows” in an Internet router is called packet classification. All packets belonging to the same flow obey a pre-defined rule and are processed in a similar manner by the router. For example, all packets with the same source and destination IP addresses may be defined to form a flow. Packet classification is needed for non “best-effort” services, such as firewalls and quality of service; services that require the capability to distinguish and isolate traffic in different flows for suitable processing. In general, packet classification on multiple fields is a difficult problem. Hence, researchers have proposed a variety of algorithms which, broadly speaking, can be categorized as “basic search algorithms,” geometric algorithms, heuristic algorithms, or hardware-specific search algorithms. In this tutorial we describe algorithms that are representative of each category, and discuss which type of algorithm might be suitable for different applications.
@article{gupta2001algorithms,
title={Algorithms for Packet Classification},
author={Gupta, Pankaj and McKeown, Nick},
journal={IEEE Network},
year={2001}
}
We present dynamic algorithms for packet classification that provide worst-case performance guarantees. Packet classification is the process of categorizing packets based on multiple header fields, and is a fundamental operation in routers and firewalls. Our algorithms support efficient insertion and deletion of classification rules while maintaining fast lookup times with provable worst-case bounds.
@inproceedings{gupta2000dynamic,
title = {Dynamic Algorithms with Worst-case Performance for Packet Classification},
author = {Pankaj Gupta and Nick McKeown},
booktitle = {Proceedings IFIP Networking, May 2000, Paris, France},
year = {2000}
}
Routers classify packets to determine which flow they belong to, and to decide what service they should receive. Classification may, in general, be based on an arbitrary number of fields in the packet header. Performing classification quickly on an arbitrary number of fields is known to be difficult, and has poor worst-case performance. In this paper, we consider a number of classifiers taken from real networks. We find that the classifiers contain considerable structure and redundancy that can be exploited by the classification algorithm. In particular, we find that a simple multi-stage classification algorithm, called RFC (recursive flow classification), can classify 30 million packets per second in pipelined hardware, or one million packets per second in software.
@inproceedings{gupta1999packet,
title={Packet Classification on Multiple Fields},
author={Gupta, Pankaj and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={1999}
}
the rules could be based on several fields of the packet Internet routers that operate as firewalls, or provide a variety including layer 2, 3, 4 and may be 5 addressing and protoof service classes, perform different operations on different col information. flows. A flow is defined to be all the packets sharing common header characteristics; for example a flow may be defined as The simplest, and most well-known form of packet all the packets between two specific IP addresses. In order to classify a packet, a router consults a table (or classifier) using classification is used in routing IP datagrams, where each one or more fields from the packet header to search for the rule specifies a destination prefix. The associated action is corresponding flow. The classifier is a list of rules that identify each flow and the actions to be performed on each. With the IP address of the next-hop where the packet needs to be the increasing demands on router performance, there is a routed to. The classification process requires determining need for algorithms that can classify packets quickly with minimal storage requirements and allow new flows to be frethe longest prefix which matches the destination address of quently added and deleted. In the worst case, packet classification is hard requiring routers to use heuristics that exploit the packet. structure present in the classifiers.
@inproceedings{gupta1999hierarchical,
title={Packet Classification using Hierarchical Intelligent Cuttings},
author={Gupta, Pankaj and McKeown, Nick},
booktitle={Hot Interconnects VII},
year={1999}
}
The increased bandwidth in the Internet puts great demands on network routers; for example, to route minimum sized Gigabit Ethernet packets, an IP router must process about 1.5/spl times/10/sup 6/ packets per second per port. Using the "rule-of-thumb" that it takes roughly 1000 packets per second for every 10/sup 6/ bits per second of line rate, an OC-192 line requires 10/spl times/10/sup 6/ routing lookups per second; well above current router capabilities. One limitation of router performance is the route lookup mechanism. IP routing requires that a router perform a longest-prefix-match address lookup for each incoming datagram in order to determine the datagram's next hop. We present a route lookup mechanism that when implemented in a pipelined fashion in hardware, can achieve one route lookup every memory access. With current 50 ns DRAM, this corresponds to approximately 20/spl times/10/sup 6/ packets per second; much faster than current commercially available routing lookup schemes. We also present novel schemes for performing quick updates to the forwarding table in hardware. We demonstrate using real routing update patterns that the routing tables can be updated with negligible overhead to the central processor.
@inproceedings{gupta1998routing,
title={Routing Lookups in Hardware at Memory Access Speeds},
author={Gupta, Pankaj and Lin, Steven and McKeown, Nick},
booktitle={IEEE INFOCOM},
year={1998}
}
In this paper, we present the Tiny Tera: a small packet switch with an aggregate bandwidth of 320Gb/s. The Tiny Tera is a CMOS-based input-queued, fixed-size packet switch suitable for a wide range of applications such as a highperformance ATM switch, the core of an Internet router or as a fast multiprocessor interconnect. Using off-the-shelf technology, we plan to demonstrate that a very highbandwidth switch can be built without the need for esoteric optical switching technology. By employing novel scheduling algorithms for both unicast and multicast traffic, the switch will have a maximum throughput close to 100%. Using novel highspeed chip-to-chip serial link technology, we plan to reduce the physical size and complexity of the switch, as well as the system pin-count.
@inproceedings{mckeown1996tinytera,
title={The Tiny Tera: A Packet Switch Core},
author={McKeown, Nick and Izzard, Martin and Mekkittikul, Adisak and Ellersick, Bill and Horowitz, Mark},
booktitle={Hot Interconnects V},
year={1996}
}
In this paper, we present the aggregate bandwidth of 320Gb/s. The fixed-size packet switch suitable for a wide range of applications such as a highperformance ATM switch, the core of an Internet router or as a fast multiprocessor interconnect. Using off-the-shelf technology, we plan to demonstrate that a very highbandwidth switch can be built without technology. By employing novel scheduling algorithms for both unicast and multicast traffic, the switch will have a maximum throughput close to 100%. Using novel highspeed chip-to-chip serial link technology, we plan to reduce the physical size and complexity of the switch, as well as the system pin-count.
@inproceedings{mckeown1996tiny,
title = {The Tiny Tera: A Small High-Bandwidth Packet Switch Core},
author = {Nick McKeown, Martin Izzard, Adisak Mekkittikul, Bill Ellersick, Mark Horowitz},
booktitle = {Proceedings of Hot Inteconnects IV, pp. 161-173, Stanford, August 1996},
year = {1996}
}
Second is to compensate for delay variations in the clock distribution. A 32 x 32 synchronous crossbar chip was designed in a 0.27pm During normal operations, all 32 serial link blocks work as Dumb CMOS technology for use in a high-speed network switch [I]. The Ends of the Asymmetric Serial Links, which do not adjust their own crossbar chip uses 32 Asymmetric Serial Links [2][3] to achieve high transmitter and receiver clocks. However, for testing purposes, two speed at the interfaces and to reduce both power and area. The crossbar serial link blocks were designed as Smart Ends with PLLs to adjust switch core is implemented with static CMOS multi-stage multiplexors their clocks, but may also be operated as Dumb Ends. with multicast capability. The chip operates successfully with links The crossbar controller manages the initial calibration process, running at 1.6 Gb/s. The measured bit-error-rate is < when all disabling links before they are properly synchronized. As a result, when channels and the switch core are operating. The crossbar chip consumes a serial link is disconnected, it will never be synchronized, so it will 5W and provides a total bandwidth above 50 Gb/s. always be disabled by the controller without any explicit logic. The Asymmetric Serial Links use differential-pair open drain Architecture transmitters and current-integrating receivers [6].
@inproceedings{chang1999crossbar,
title={A 50 Gb/s 32x32 CMOS Crossbar Chip Using Asymmetric Serial Links},
author={Chang, Kun-Yung Ken and Chuang, Shang-Tse and McKeown, Nick and Horowitz, Mark},
booktitle={Symposium on VLSI Circuits},
year={1999}
}
In this paper, we present the aggregate bandwidth of 320Gb/s. The fixed-size packet switch suitable for a wide range of applications such as a highperformance ATM switch, the core of an Internet router or as a fast multiprocessor interconnect. Using off-the-shelf technology, we plan to demonstrate that a very highbandwidth switch can be built without technology. By employing novel scheduling algorithms for both unicast and multicast traffic, the switch will have a maximum throughput close to 100%. Using novel highspeed chip-to-chip serial link technology, we plan to reduce the physical size and complexity of the switch, as well as the system pin-count.
@inproceedings{mckeown1996tiny,
title = {The Tiny Tera: A Small High-Bandwidth ATM Switch},
author = {Nick McKeown, Martin Izzard, Adisak Mekkittikul;},
booktitle = {Proceedings of SPIE 96, Vol. 2917, pp. 387-397, Boston, November 1996},
year = {1996}
}
This paper describes the design of a novel CMOS 2 Gb/s asymmetric serial link. The serial link is designed for systems that use high speed chip-to-chip communications. In such designs, power dissipation is a common problem, particularly when multiple serial links are required on one chip. The power arises primarily from the phase adjustment circuitry used to align data with the clock. This circuitry is usually placed at the receiver, but in our asymmetric link design we take a different approach. We first assume that a link consists of two unidirectional connections — one for each direction of the link. We move the phase adjustment circuitry from one end of the link to the other, adjusting the phase of the transmitter rather than the receiver. Although this does not reduce overall system power, it allows us to choose the location of the phase adjustment circuitry, moving it from chips with a large number of links to chips with a smaller number. The link was designed for use in the Tiny Tera packet switch, which has a Simulations suggests that the power dissipation of serial links on the crossbar switch can be reduced by a factor of 4.
@inproceedings{chang1997,
title = {A 2 Gb/s Asymmetric Serial Link for High-Bandwidth Packet Switches},
author = {Ken K.-Y. Chang, William Ellersick, Shang-Tse Chuang, Stefanos Sidiropoulos, Mark Horowitz, Nick McKeown},
booktitle = {Hot Interconnects VI, pp. 171-179, Stanford University, August 1997},
year = {1997}
}
We analyze the performance of circuit switching as an alternative to packet switching in the Internet. While packet switching has traditionally been considered superior for bursty data traffic, we show that circuit switching can be surprisingly efficient for certain Internet traffic patterns. We compare the throughput and delay characteristics of circuit and packet switching under realistic traffic conditions.
@article{molinero-fernandez2003performance,
title = {The performance of circuit switching in the Internet},
author = {Pablo Molinero-Fernandez, Nick McKeown},
journal = {OSA Journal of Optical Networking, Vol. 2, No. 4, March 2003},
year = {2003}
}
The Internet Protocol (IP) has rapidly become the dominant protocol for data networks and is increasingly being used for voice, video, and other communications. We examine whether IP will eventually replace all other communications technologies, or whether some applications will continue to use specialized networks and protocols. We consider the technical, economic, and operational factors that will determine the extent to which IP takes over the world of communications.
@inproceedings{molinero2002ip,
title={Is IP Going to Take Over the World (of Communications)?},
author={Molinero-Fernandez, Pablo and McKeown, Nick and Zhang, Hui},
booktitle={HotNets-I},
year={2002}
}
We propose TCP Switching, a mechanism that exposes circuit-like behavior to IP networks. The key idea is to identify long-lived TCP flows and switch them onto dedicated circuits, while short flows continue to be packet-switched. We show that this hybrid approach can reduce the load on packet routers while maintaining the flexibility of IP networking.
@inproceedings{molinero-fernandez2001tcp,
title = {TCP Switching: Exposing circuits to IP},
author = {Pablo Molinero-Fernandez, Nick McKeown},
booktitle = {Hot Interconnects IX, Stanford University, August 2001},
year = {2001}
}
Nick McKeown, Costas Calamvokis, and Shang-tse Chuang
Hot Chips 13, Stanford, CA, August 2001
We describe a 2.5 Tb/s switch core with an LCS (Linecard to Switch) interface. The switch core is designed for high-bandwidth packet switching and uses a single-stage crossbar architecture with the iSLIP scheduling algorithm. We describe the chip design, the interface protocol, and the performance characteristics of the switch core.
@inproceedings{mckeown2001tbs,
title = {A 2.5Tb/s Switch Core with LCS Interface},
author = {Nick McKeown, Costas Calamvokis, and Shang-tse Chuang},
booktitle = {Hot Chips 13, Stanford, CA, August 2001},
year = {2001}
}
We propose TCP Switching, a mechanism that exposes circuit-like behavior to IP networks. The key idea is to identify long-lived TCP flows and switch them onto dedicated circuits, while short flows continue to be packet-switched. We show that this hybrid approach can reduce the load on packet routers while maintaining the flexibility of IP networking.
@article{molinero-fernandez2002tcp,
title = {TCP Switching: Exposing circuits to IP},
author = {Pablo Molinero-Fernandez, Nick McKeown},
journal = {IEEE Micro, Vol. 22, No. 1, Jan/Feb 2002},
year = {2002}
}
Recently there has been much interest in combining the speed of layer-2 switching with the features of layer-3 routing. This has been prompted by numerous proposals, including: IP Switching, Tag Switching, ARIS, CSR, and IP over ATM. In this paper, we study IP Switching using a detailed simulation model. We simulate the handling of IP traffic by an IP Switch, comparing it with a conventional store-and-forward IP router. We find that IP Switching can significantly reduce the amount of IP route processing needed in the router, and that most data packets can be switched directly at layer 2.
@inproceedings{lin1997simulation,
title={A Simulation Study of IP Switching},
author={Lin, Steven and McKeown, Nick},
booktitle={ACM SIGCOMM},
year={1997}
}
We describe the Bay Bridge, a high-speed bridge/router designed to interconnect FDDI and SMDS networks. The Bay Bridge operates at rates significantly higher than existing bridge/routers, using a custom hardware architecture optimized for high-throughput packet forwarding. We describe the design, implementation, and performance of the system.
@inproceedings{mckeown1992bay,
title = {The Bay Bridge: A High Speed Bridge/Router},
author = {Nick McKeown, Richard Edell, and My T. Le},
booktitle = {IFIP Workshop, Stockholm, Sweden, 1992},
year = {1992}
}
We describe the Bay Bridge, a high-speed bridge/router designed to interconnect FDDI and SMDS networks. The Bay Bridge operates at rates significantly higher than existing bridge/routers, using a custom hardware architecture optimized for high-throughput packet forwarding. We describe the design, implementation, and performance of the system.
@techreport{mckeownarchitecture,
title = {Architecture and Performance of The BayBridge: A High Speed Bridge/Router between FDDI and SMDS},
author = {Nick McKeown, Richard Edell, and My T. Le},
institution = {Technical Report},
year = {}
}
We describe a high-performance SMDS (Switched Multi-megabit Data Service) interface operating at the STS-3c rate of 155 Mb/s. The interface is designed for use in high-speed bridge/routers and provides efficient packet processing for SMDS traffic. We describe the architecture, implementation, and performance characteristics of the interface.
@techreport{lehigh,
title = {A High Performance SMDS Interface at STS-3c Rate},
author = {My T. Le, Nick McKeown, and Richard Edell},
institution = {Technical Report},
year = {}
}
We describe an output-buffered ATM packet switch architecture for integrated-services communication networks. Output-buffered switches provide ideal performance characteristics, including deterministic delay bounds and lossless operation. We analyze the design requirements and present an architecture that makes output buffering practical for ATM networks carrying mixed traffic types.
@inproceedings{kesidis1997outputbuffer,
title = {Output-buffer ATM Packet Switching for Integrated-Services Communication Networks},
author = {George Kesidis, and Nick McKeown},
booktitle = {Presented at ICC '97, Montreal, Canada, Aug 1997},
year = {1997}
}
In the ideal CS1 classroom, we should understand programming process---how student code evolves over time. However, for graphics-based programming assignments, the task of understanding and grading final solutions, let alone thousands of intermediate steps, is incredibly labor-intensive. In this work, we present a challenge, a dataset, and a promising first solution to autonomously use image output to identify functional, intermediate stages of a student solution. By using computer vision techniques to associate visual output of intermediate student code with functional progress, we supplement a lot of the teacher labor associated with understanding graphics-based, open-ended assignments. We hope our publication of the dataset used in our case study sparks discussion in the community on how to analyze programs with visual program output.
@inproceedings{yan2019pyramid,
title={The PyramidSnapshot Challenge: Understanding Student Process from Visual Output of Programs},
author={Yan, Lisa and McKeown, Nick and Piech, Chris},
booktitle={ACM SIGCSE},
year={2019}
}
Co-Teaching Computer Science Across Borders: Human-Centric Learning at ScaleBest Paper 2020
Chris Piech, Lisa Yan, Lisa Einstein, Ana Saavedra, Baris Bozkurt, Eliska Sestakova, Ondrej Guth, Nick McKeown
Programming is fast becoming a required skill set for students in every country. We present CS Bridge, a model for cross-border co-teaching of CS1, along with a corresponding open-source course-in-a-box curriculum made for easy localization. In the CS Bridge model, instructors and student-teachers from different countries come together to teach a short, stand-alone CS1 course to hundreds of local high school students. The corresponding open-source curriculum has been specifically designed to be easily adapted to a wide variety of local teaching practices, languages, and cultures. Over the past six years, the curriculum has been used to teach CS1 material to over 1,000 high school students in Colombia, the Czech Republic, Turkey, and Guinea. A large majority of our students continue on to study CS or CS-related fields in university. More importantly, many of our undergraduate student-teachers stay involved with teaching beyond the program. Joint teaching creates a positive, high-quality learning experience for students around the world and a powerful, high-impact professional development experience for the teaching team---instructors and student-teachers alike.
@inproceedings{piech2020coteaching,
title={Co-Teaching Computer Science Across Borders: Human-Centric Learning at Scale},
author={Piech, Chris and Yan, Lisa and Einstein, Lisa and Saavedra, Ana and Bozkurt, Baris and Sestakova, Eliska and Guth, Ondrej and McKeown, Nick},
booktitle={ACM L@S},
year={2020}
}
We present TMOSS, a tool that uses intermediate snapshots of student programming assignments to identify excessive collaboration. Unlike traditional plagiarism detection tools such as MOSS that compare only final submissions, TMOSS analyzes the evolution of student code over time, comparing intermediate versions to detect patterns of code sharing that would otherwise be missed. We evaluate TMOSS on a large introductory programming course at Stanford and show that it can more accurately identify cases where students have shared code.
@inproceedings{yan2018tmoss,
title={TMOSS: Using Intermediate Assignment Work to Understand Excessive Collaboration in Large Classes},
author={Yan, Lisa and McKeown, Nick and Sahami, Mehran and Piech, Chris},
booktitle={ACM SIGCSE},
year={2018}
}
ourexperience,studentswhoexperience“buildingtheirown Internet” gain a thorough knowledge of how the Internet In the past five years, the graduate networking course at works,howtoreadandimplementRFCs,andhowtobuild Stanford has assigned over 200 students the task of repronetwork systems. ducing results from over 40 networking papers. We began For a more advanced graduate class in networking, it is the project as a means of teaching both engineering rigor less obvious what the most appropriate programming asandcriticalthinking,qualitiesthatarenecessaryforcareers signments are. Should students build more advanced pieces innetworkingresearchandindustry. Wehaveobservedthat of the Internet—such as firewalls, load-balancers, and new reproducing research can simultaneously be a tool for edutransport layers? This has the advantage of giving them cation and a means for students to contribute to the netmore experience building network systems, but lacks a reworkingcommunity. Throughthiseditorialwedescribeour search ingenuity component where they can dream up and project in reproducing network research and show through testtheirownideas. Andsoitismorecommoningraduate anecdotal evidence that this project is important for both studiesforstudentstodoamorecreativeopen-endedproject the classroom and the networking community at large, and of their own design, perhaps using a simulator, testbed or wehopetoencourageotherinstitutionstohostsimilarclass analytical tools. In our earlier experience with CS244, we projects. opted for the second style, and had students create openendedprojectsoftheirowndesign. Butwekeptfindingthe CCSConcepts projects to be lacking—mostly because it is hard to build Social and professional topics Computing educa- a meaningful networking system or a persuasive prototype • ! tion; Networks Network performance evaluation; in such a short time. Often, students picked projects that • ! turnedouttobetooambitious,andonanincompleteproto-
@article{yan2017learning,
title={Learning Networking by Reproducing Research Results},
author={Yan, Lisa and McKeown, Nick},
journal={ACM SIGCOMM Computer Communication Review},
year={2017}
}
Our goal is to enable fast prototyping of networking hardware (e.g. modified Ethernet switches and IP routers) for teachingandresearch. Tothisend,webuiltandmadeavailablethe NetFPGAplatform. Startingfromopen-sourcereferencedesigns,studentsandresearcherscreatetheirdesigns in Verilog,andthendownloadthemtothe NetFPGAboard where they can process packets at line-rate for 4-ports of 1GE. The board is becoming widely used for teaching and research, and so it has become important to make it easy tore-usemodulesanddesigns. We havecreatedastandard interfacebetweenmodules,makingiteasiertoplugmodules together in pipelines, and to create new re-usable designs. In this paper we describe our modular design, and how we haveusedittobuildseveralsystems,includingourIProuter reference design and some extensions to it.
@inproceedings{naous2008netfpga,
title = {NetFPGA: Reusable Router Architecture for Experimental Research},
author = {Jad Naous, Glen Gibb, Sara Bolouki, and Nick McKeown},
booktitle = {To appear: SIGCOMM PRESTO Workshop, Seattle, WA, August 2008},
year = {2008}
}
We describe how the NetFPGA platform, an FPGA-based research design platform, has fueled advances in networking research. NetFPGA provides researchers with a hardware platform for building and testing networking systems at gigabit rates. We survey the research projects that have used NetFPGA and describe how the platform has enabled rapid prototyping of new networking ideas.
@article{blott2010fpga,
title = {FPGA Research Design Platform Fuels Network Advances},
author = {Michaela Blott, Jonathan Ellithorpe, Nick McKeown, Kees Vissers, Hongyi Zeng},
journal = {Xcell Journal, Fourth Quarter, 2010},
year = {2010}
}
The NetFPGA platform enables students and researchers to build high-performance networking systems using Field Programmable Gate Array (FPGA) hardware. A new version of the NetFPGA platform has been developed and is available for use by the academic community. The NetFPGA platform has modular interfaces that enable development of complex hardware designs by integration of simple building blocks. FPGA logic is used to implement the core data processing functions while software running on an attached host computer or embedded cores within the device implement control functions. Reference designs and component libraries have been developed for the CS344 course at Stanford University and an open-source Verilog reference design is available for download from the project website.
@article{gibb2008netfpga,
title={NetFPGA: An Open Platform for Teaching How to Build Gigabit-Rate Network Switches and Routers},
author={Gibb, Glen and Lockwood, John W. and Naous, Jad and Hartke, Paul and McKeown, Nick},
journal={IEEE Transactions on Education},
year={2008}
}
We describe our experience using reconfigurable networking hardware to teach networking concepts. Students build complete networking systems, including routers and switches, using the NetFPGA platform. The hands-on approach gives students a deeper understanding of how networking hardware works and enables them to experiment with new ideas in a realistic setting.
@inproceedings{casado2005teaching,
title = {Teaching Networking Hardware},
author = {Martin Casado, Gregory Watson, Nick McKeown},
booktitle = {ITiCSE, Monte de Caparica, Portugal, June 2005},
year = {2005}
}
Wepresentaneducationalplatformforteachingthe design,debugginganddeploymentofrealnetworkingequipment an intheoperational Internet.Theemphasisofourworkisonteachdownloading ing and, therefore, on providing an environment that is flexible, boardscanbeprogrammedanddebuggedoverthenetworkand robust, low cost and easy to use. The platform is built around thereforeallowstudentstobuildsophisticatedhardware,deploy ’NetFPGAs’–customboardscontainingeightethernetportsand andtestitfromtheothersideofthecountry,withouteverhavtwoFPGAs.NetFPGAboards,whenusedwithVNS(Virtual Network System-anothertoolwehavedeveloped),canbeintegrated ingtoseethephysicalboard. intodynamicallyconfigurablenetworktopologiesreachablefrom the Internet.VNSenablesauser-spaceprocessrunningonanyregiesatthelinklayerbymappingbetweenVLANs. motecomputertofunctionasasystemcontrollerforthe NetFPGA boards. NetFPGA and VNS are used at Stanford in a graduate asmallnetworkofcommodityPCsrunning Linuxcanemulate levelnetworking coursetoteachrouterimplementationin hardthousandsofdifferentisolatedvirtualnetworksthatarereachwareandsoftware. able runningrealservices. in
@inproceedings{casado2005reconfigurable,
title = {Reconfigurable Networking Hardware: A Classroom Tool},
author = {Martin Casado, Gregory Watson, Nick McKeown},
booktitle = {Hot Interconnects 13, Stanford, August 2005},
year = {2005}
}
Wepresent Clack,agraphicalenvironmentforteachingstudents how Internetroutersworkandothercorenetworkingconcepts. Clackisarouterwrittenasa Java Applet, androuteslivenetwork traffic in real-time. Students can look inside the router to see how packets are processed, and watch the dynamics of thequeues. Theycanmodifyandenhancetherouter,makingit processpacketsastheywish. Clackprovidesmultipleviewsof the operational router including the full network topology, the router’s software components, and a packet-level view of the trafficasitpassesthroughtherouter. Clack’sdetailedvisualinterfacetothesoftwareinternalsofafunctioningrouter,aswell asitsabilitytomodifyandobservelive Internettraffic,provide aunqiueenvironmenttoaidinnetworkingeducation. Overthelasttwoyears, Clackhasbeenusedintheclassroom atsixuniversities. Feedbackfromthestudentsthroughanonymous,formalevaluationshasbeenpositive.Inthispaperwedescribethegoalsanddesignof Clackaswellasourexperiences usingitintheclassroom. CRCategories: K.3.2[Computersand Education]: Computer Information Science Education—;
@inproceedings{wendlandt2006clack,
title = {The Clack Graphical Router: Visualizing Network Software},
author = {Dan Wendlandt, Martin Casado, Paul Tarjan, Nick McKeown},
booktitle = {ACM Symposium on Software Visualization, September 2006},
year = {2006}
}
The NetFPGAplatformenablesstudentsandresearchers tobuildhigh-performancenetworkingsystemsinhardware. A new version of the NetFPGA platform has been developedandis availablefor useby theacademiccommunity. The NetFPGA2.1platformnowhasinterfacesthatcanbe parameterized,thereforeenablingdevelopmentofmodular hardware designs with varied word sizes. It also includes more logic andfaster memory than the previous platform. Field Programmable Gate Array (FPGA) logic is used to implement the core data processing functions while softwarerunningonembeddedcoreswithintheFPGAand/or programsrunningonanattachedhostcomputerimplement only control functions. Reference designs and component librarieshavebeendevelopedfortheCS344courseat Stanford University. Open-source Verilogcodeisavailablefor downloadfromtheprojectwebsite.
@inproceedings{lockwood2007netfpga,
title = {NetFPGA - An Open Platform for Gigabit-rate Network Switching and Routing},
author = {John W. Lockwood, Nick McKeown, Greg Watson, Glen Gibb, Paul Hartke, Jad Naous, Ramanan Raghuraman, and Jianying Luo},
booktitle = {MSE 2007, San Diego, June 2007},
year = {2007}
}
We describe the Stanford Virtual Router, a software tool designed for teaching networking concepts and simulating network behavior. The virtual router allows students to build and configure network topologies, implement routing protocols, and observe packet-level behavior without requiring physical networking hardware. We describe the tool's architecture and our experience using it in networking courses.
@inproceedings{casado2002stanford,
title = {The Stanford Virtual Router: a teaching tool and network simulator},
author = {Martin Casado, Vikram Vijayaraghavan, Guido Appenzeller, Nick McKeown},
booktitle = {ACM SIGCOMM Computer Communication Review, Volume 32 , Issue 3 (July 2002)},
year = {2002}
}
ML & Law
A New Direction for Machine Learning in Criminal Law
Kristen Bell, Jenny Hong, Nick McKeown, Catalin Voss
Machine learning is increasingly being used in the criminal justice system, particularly for recidivism prediction and bail decisions. We propose a new direction: instead of using machine learning to make predictions about people, we should use it to improve the quality and consistency of human decision-making in criminal law. We identify applications where ML can reduce bias and improve accuracy, including analyzing police reports, improving forensic evidence analysis, and helping to identify potential wrongful convictions.
@article{bell2021new,
title={A New Direction for Machine Learning in Criminal Law},
author={Bell, Kristen and Hong, Jenny and McKeown, Nick and Voss, Catalin},
year={2021}
}
We explore whether overlay hosting services can make IP ossification irrelevant. IP ossification refers to the difficulty of deploying new protocols and services in the Internet due to the large installed base of IP routers. We argue that overlay networks and hosting services can provide a practical path for deploying new network services without requiring changes to the underlying IP infrastructure.
@inproceedings{turner2007can,
title = {Can Overlay hosting services make ip ossification irrelevant},
author = {J Turner, and N McKeown},
booktitle = {cabernet.cs.princeton.edu, 2007},
year = {2007}
}
We believe this paper is the first extensive user-study of whitelisting email addresses. While whitelists are common in social net- (e.g., working and instant messaging buddylists), they have not caught on in email systems. Instead, most of us use spam filters that try to identify all the senders we do not want to accept email from. With whitelists weonlyneedtoidentifythemuchsmallerset can of users who legitimately send us email - generally most of these users are known to us. In our study we built and deployed a whitelisting email service - named DOEmail - along with a Thunderbird add-on to try and make whitelists easy to manage. During the past two years, over 120 users have used DOEmail. As expected, we find that almost no spam makes it to users’ inboxes, and less than 1% of legitimate email is mis-classified. We measure how hard it is for users to manage their whitelists, and conclude that after initial setup, the burden is very low. Our systemuses challenge-responsesto identifywhetherunknownsendersarelegitimate. The jury is still out as to how effective they are, and we continue to study them.
@inproceedings{erickson2008effectiveness,
title = {The Effectiveness of Whitelisting: a User-Study},
author = {David Erickson, Martin Casado, and Nick McKeown},
booktitle = {Conference on Email and Anti-Spam, Mountain View, CA, August 2008},
year = {2008}
}
Ourgoalistobuildpassivemonitoringequipmentforuseat10Gb/s(e.g. 10GEandOC-192)andabove. WealreadyhaveinplaceanOC-48passivemonitoringsystem for capturing and storing a detailed record for every packet. Butbecauseofconstraintsonstorageandbusbandwidth thiswill not befeasible at 10Gb/sand above. Therefore, taking advantage of the fact that packets can be consideredasbelongingtoflows,oursystemwillstoreper-flow records that are created at the time of capture, and stored alongside small per-packet records. This way storage requirements can be reduced several-fold. Results indicate thatitwillbepossibletocaptureandstoredetailedflowinformationatOC-192withoutlosingmuchinformationcomparedtoourOC-48packettraces.
@inproceedings{iannaccone2001monitoring,
title={Monitoring Very High Speed Links},
author={Iannaccone, Gianluca and Diot, Christophe and Graham, Ian and McKeown, Nick},
booktitle={ACM SIGCOMM Internet Measurement Workshop},
year={2001}
}
We describe Xdistribute, a system for distributing processes across multiple machines. Xdistribute provides a simple programming model that allows applications to take advantage of distributed computing resources transparently. We describe the system architecture, the distribution mechanisms, and the performance characteristics of the system.
@techreport{pettyxdistribute,
title = {Xdistribute: A Process Distribution System},
author = {Karl Petty, and Nick McKeown},
institution = {Technical Report},
year = {}
}
We address the problem of how to bill users for TCP traffic and how to design pricing schemes that promote efficient use of network resources. We analyze the interaction between TCP's congestion control mechanism and usage-based pricing, and show that naive pricing approaches can lead to undesirable outcomes. We propose pricing mechanisms that are compatible with TCP behavior and provide proper incentives for users to choose appropriate service levels.
@article{edell1995billing,
title={Billing Users and Pricing for TCP},
author={Edell, Richard and McKeown, Nick and Varaiya, Pravin},
journal={IEEE JSAC},
year={1995}
}
Steven E. Shladover, Charles A. Desoer, J. Karl Hedrick, Masayoshi Tomizuka, Jean Walrand, Wei-Bin Zhang, Donn H. McMahon, Huei Peng, Shahab Sheikholeslam, Nick McKeown
IEEE Transactions on Vehicular Technology, Vol. 40, No. 1, February 1991
This paper describes the accomplishments and directions of the California PATH program in automatic vehicle control. The research and development work progresses from individual vehicle control, through coordination of closely spaced vehicles in platoons, to the full automation of mixed traffic on a highway. We describe the key technology developments and field demonstrations, including longitudinal and lateral vehicle control, vehicle-to-vehicle communication, and automated highway systems.
@article{shladover1991automatic,
title={Automatic Vehicle Control Developments in the PATH Program},
author={Shladover, Steven E. and Desoer, Charles A. and Hedrick, J. Karl and Tomizuka, Masayoshi and Walrand, Jean and Zhang, Wei-Bin and McMahon, Donn H. and Peng, Huei and Sheikholeslam, Shahab and McKeown, Nick},
journal={IEEE Transactions on Vehicular Technology},
volume={40},
number={1},
year={1991}
}