The distributed edge: How MidoNet enables hyper-scale networks for cloud service providers

Introduction

Before working at Midokura, the engineers and software architects who have designed and implemented our network virtualization overlay software have earned their stripes as system engineers building large-scale information systems like websites and the corresponding data backends handling millions of concurrent customers on a daily basis.  To successfully operate such high-traffic systems you have to very much adopt the mindset of distributed systems to the very core of every function that your software provides.

Today, you can build software for a multi-tiered LAMP+Apache Tomcat+NoSQL application architecture running on 10-20 servers (on premises or in the cloud), dealing with ten thousand customers, or even a hundred thousand, without thinking too much about parallelism, a single point of failure, scale-out paths and choke points which may become visible some time in the future when load and customer visits increase.

However, when building a network virtualization solution at Midokura, we have to ensure that for every network abstraction we provide (load balancing, firewalling, L3 routing, L2 switching) our solution scales out to deployment sizes such as 1000 hypervisors and more than tens of millions of concurrent inbound TCP connections.  This article describes how we avoid singular choke points and how we successfully prevented architectural limitations in the scale-out paths of our software.

Standing on the shoulders of giants

MidoNet relies on netlink, openvswitch-datapath, tunneling protocols and a peer-to-peer architecture formed by semi-autonomous, topology-aware, intelligent agents.  No, we have not recreated the HAL 9000 brain functions (yet).  But our agents form the distributed cortex that makes it possible for the overlay to grow organically using two possible dimensions: hypervisor scale-out and edge gateway scale-out.

During a Linuxtag 2014 presentation in Berlin, one of the architects of our our solution, Pino de Candia, explained to the audience how the success of our product is primarily based on the success of Linux and all the technical details and advancements in networking that, when put together, allow us to build robust, integrated systems, and for system operators, is very easy to maintain network platform.  All of this is achieved without reinventing the wheel since many elements are already provided by Linux.  We are adding our distributed intelligence that is coming from the experience of our engineers who knows how to build such systems.

Where no agent has gone before

An important, open and industry-known building block of our solution is Apache ZooKeeper [referred to as ZooKeeper from now on].  Using ZooKeeper for agent discovery and to store the virtual topology we can use a reliable, proven (and by itself highly scalable) software to enable our network platform.

Cattle, not pets

The basic building block of a MidoNet network overlay is the agent.
Running a network gateway or running on a hypervisor does not make a difference to the agent software.  They are like cattle for a MidoNet system, since they do not contain intelligent configuration information. The agents do not take any special local configuration for whatever their role is.  This allows maximum flexibility and resilience when doing configuration management.  All the agents are learning their job description from ZooKeeper.

  • Am I running on a hypervisor and have virtual machines on my hypervisor tied with their tap devices to vports inside the overlay?  No problem, I can handle that.
  • Am I a gateway agent?  Fine, let’s do l2 bridging.  Or VLAN translation with l2 bridging.  Great!
  • Am I supposed to act as part of a distributed layer 3 router because I am running on a server close to the network edge and I have a virtual port of a logical router bound to one of my local network interface cards?  Good, let’s do this.

 

Distributed edge Layer-3 (L3) routing, firewall and denial of service protection

How does our distributed L3 router work?
As we mentioned in the previous section, the agents contain a lot of intelligence and can make a lot of decisions based on dynamic configuration information retrieved from ZooKeeper.Let us look at a typical customer scenario with 5 dedicated gateway nodes and 10 hypervisor nodes:  On each gateway node, the MidoNet agent is installed and a dedicated network card is available to be used by the agent.When the main logical router is created (let us call it MidoNet Provider Router for the rest of this article) it does not exist as a Unix process on a single machine somewhere.

Maybe the following sentence helps clarify it.

The Provider Router (and also the concept of every virtual router and virtual switch in MidoNet) should be perceived as the collaboration of multiple equal cooperating agents.

It is an object in a configuration namespace, only visible as simple data in ZooKeeper that defines the name, the virtual ports (sometimes called interfaces), the security rules of these virtual ports, the static routes and the bgp it will talk to neighbour routers.  Without an agent taking this information and doing something useful the provider router will not exist.

No single MidoNet agent is normally responsible to be the provider router.  Of course you can install a MidoNet system where one agent will be responsible for doing all the work of the Provider Router: this is the case when you only have one gateway node running (which we do not recommend for obvious reasons).

Let us get back to the example with the 5 gateways.

Each of those gateways has a network card that has been dedicated to MidoNet.

For the agents this means that there is information in ZooKeeper how to bind each network card on each gateway node to a virtual port of the provider router.

When the agent starts, it will find this information in the ZooKeeper configuration that is related to configuring the hosts where the agent runs and it will find and bind one or more physical interfaces to a logical port that is defined inside the network overlay.  This is the only situation where a gateway node has to become aware of the fact that it is a gateway node.  And even this configuration is not stored on the node itself but in a central configuration which makes it very easy for system administrators to add and remove gateway nodes in a very slim provisioning process.  Another border case is running a hypervisor also as a gateway node: by adding an additional network card to a hypervisor you can make it part of the network edge.

Let us now recap where we currently stand.
We have the distributed intelligent agents who have added each of their local network cards as virtual ports to the logical MidoNet Provider Router.

What happens now is that we create BGP sessions.
Each of those virtual ports gets an IP address.  To keep it simple, we use an arbitrary network range: 200.200.199.0/24.  Our agents get the following ip addresses: 200.200.199.11, 200.200.199.12, 200.200.199.13, 200.200.199.14 and 200.200.199.15.  Those will be the IP addresses used for talking to their bgp peers, available at 200.200.199.21, 22, 23, 24 and 25.
Let us assume the BGP peers speak equal cost multpath (ECMP) BGP. What will happen now is that each of our five agents will advertise the following network: 200.200.200.0/24.  This is what we have configured as the floating IP range in our OpenStack installation.  It means that when associate a floating IP to a virtual machine that this vm then will be reachable via our five gateways because the five peer gateways all have a route for this network going through all five of our MidoNet gateways.

Each gateway that has a physical network card being attached to the MidoNet provider router by means of a virtual port can now act as an equal part of this router.  It can simulate incoming traffic and make sure that IP packets destined for a VM will be forwarded to the hypervisor this vm is sitting on, regardless on which of the five gateways this IP packet has arrived.  It can also receive IP packets from a VM having a floating IP and make sure these IP packets go to the internet through the bgp session it has open with its peer.

If a bgp peer or a midonet gateway fails: no big deal, the other gateways will be there.
Are you running into performance issues: add 5 more gateways and you now have 10 edge gateways instead of 5.  All you have to configure is the plumbing of the 5 additional network cards becoming part of the provider router and the bgp sessions for those.  Keeping it simple is one of the design patterns how to build distributed systems that never fail.

This is the typical layer 3 distributed gateway case.

So, how does firewalling work?
We evaluate security groups on the edge of the network. By getting the topology from ZooKeeper, our gateway agents have all the information present to decide whether to drop packets that are not allowed by the firewall rules and we do this already at the perimeter of your network infrastructure.

How does that work?
The gateway agent is a sophisticated flow simulator for the path the IP packet would take to the destination virtual machine sitting on any hypervisor in one of your racks.  Each agent can run this flow simulation and it is also the way IP packets from virtual machines make it to other virtual machines by running this flow simulation on hypervisor agents.  The result of a flow simulation is either a decision to forward packets through a tunnel (VXLAN or GRE) or to drop unwanted packets when a firewall rule says so.

The gateway agents being flow simulators and all gateway agents enforcing the firewall rules in the same way you can now easily see why this scales.  By adding more gateway nodes you will add more cpu and memory to the distributed firewall which perhaps can now be more precisely be referred to as ‘parallel flow calculator for evaluating firewall rules’.

You can also add blackhole routes for unwanted traffic and our MidoNet gateways will use those routes to look at incoming packets and drop packets coming from those CIDR blocks directly at the edge when the packet enters the gateway acting as a l3 router, saving precious IO resources on the hypervisors and relieving your IP fabric from unnecessary traffic and dangerous congestion issues.

Distributed layer 4 load balancing

In the last paragraph we covered the distributed firewall and how the flow simulator enforces security groups in a parallel fashion on each MidoNet gateway.  The distributed layer 4 load balancer works quite similar, only that its primary purpose is not dropping unwanted packets but to find a good member of the backend pool we can send incoming TCP packets to.

Before going into the details let us discuss the two possible ways of load balancing.

Often found in SSL offloaders, a typical method for a load balancer to operate is to accept a TCP connection, handle the SSL, then open a second TCP connection to a backend web server and relay the upper OSI layer protocol over these two connections.

This is the way Varnish works in default mode if you do not enable piping (https://www.varnish-software.com/blog/using-pipe-varnish).

Another possible way is to receive the TCP connection, do not look at the content but rewrite the destination IP (and sometimes destination TCP port), then forward the IP packet to a healthy backend.

In MidoNet, a distributed layer 4 load balancer consists of a public IP (the vip) and a set of backends in so-called pools.  We use the health check feature of haproxy to determine if a backend is eligible for receiving connections that appear on our gateways.

When you configure a layer 4 load balancer then all MidoNet gateways will accept IP packets with initial and non-initial TCP segments for this load balancer vip because they can all simulate the path the IP packet has to take inside the overlay to end up on one of the working backends.  If a TCP session already exists for this particular connection then every agent can find out if this IP packet has to go to a certain backend already.  We use Cassandra to store this stateful information about which TCP sessions belong to which virtual machine backend.

What is also interesting is that every IP packet going from a load balancer to a backend is taking only a single virtual hop: regardless how many physical or logical routers and switches are between the gateway and the hypervisor running the vm, the result of the flow simulation on the gateway is always a point-to-point GRE or VXLAN tunnel between the gateway and the hypervisor to achieve maximum efficiency for forwarding packets into the overlay.

What this means for a solution architect is that you can achieve full 1:N scaleout with adding more MidoNet gateways once the load on the existing load balancers is getting too high.

MidoNet supports Layer 4 load balancers since the 1.5 release.

Additionally, starting with MidoNet 1.6 all of this will be available inside Horizon where you will be able to create a load balancer using the LBaaS self-service-feature inside Openstack.

A distributed system as a platform for network services

After understanding the way our firewall and load balancer works an experienced reader may now find an interesting pattern.  We are using intelligent, independent agents who use a global topology to make local decisions about how they treat IP traffic.  Think global, act local. All those agents work together, yet they do not depend on each other and they will take over work from other failing agents.  Loose coupling and automatic redundancy- another feat of distributed, robust systems.

The unique pattern of agents working together in a decentralized fashion allows us to parallelize all computational efforts for achieving the customer value of the network overlay.

To the computational effort of flow simulation for the overlay it is not important if the server is a network edge component (a gateway) or a machine hosting virtual machines (a hypervisor).

In fact these simulations are the only real hardware requirement you have to consider for employing our solution: the agents need cpu and memory on the hypervisors and the gateways for calculating the flows. For the traffic itself we (and several other NVO products) use openvswitch-datapath to let the IP packets stay in the fast path of kernel space.

Our solution runs on any kind of IP fabric together with any kind of Linux server hardware plus we are now giving you a perfect reason to start using network cards with VxLAN offloading in your servers.

As MidoNet grows and develops,  the generic distributed agent architecture of our platform allows us to be able to add more services and more features to our solution without having to make any trade-offs on our promise of building highly distributed, highly scalable systems for empowering the next generation of cloud service providers.

Conclusion

In this blog article, we looked at the scalability that a truly distributed, peer-to-peer based software defined networking solution provides and how it can deliver high performance, hyper scale firewall and load balancing capabilities at the edge of the network infrastructure of a typical cloud service provider using MidoNet.

Ideas like this start small.  The fun starts when they grow bigger and bigger.

Alexander Gabert

About Alexander Gabert

Systems Engineer - Midokura Alexander Gabert spent the last 6 years doing systems engineering and system operations projects in large companies throughout germany, working for inovex in Karlsruhe. During that time he was responsible for running clusters of databases, big data applications (Hadoop) and high-volume web servers with Varnish caches and CDN infrastructure in front of them. Coming from this background as an infrastructure engineer he looks at our solution from the view point of our customers. After having worked together with several networking departments in big companies as someone responsible for applications and (virtual) servers he knows from first-hand practical experience what the actual benefits will be when we introduce network virtualization to face todays data center networking challenges.

Comments are closed.

Post Navigation