Troubleshooting an SDN-based cloud environment is challenging. The network operator interacts with the logical overlay, but lacks direct visibility into the physical underlay where many problems occur. A tenant may complain about poor application performance. But the traditional tools for debugging networking and performance problems, widely documented online, are ill-suited for debugging performance on software running in virtualized compute and network.
When debugging application performance, we have some immediate suspects related to the infrastructure: lack of storage, slow storage access, heavy load on the hypervisor CPU, noisy neighbors, overlay network issues, underlay networking issues. In this blog post, I focus on troubleshooting network-related causes.
To appreciate the challenge, let’s examine the operator’s process for finding the root cause of such performance problem.
Note: while my examples use OpenStack, the analysis is applicable to any other cloud framework.
Let’s look at two scenarios:
Scenario  – The client does not provide any additional information
To measure network performance, the operator may check connectivity issues and latency between compute nodes (hosting the client VMs), e.g. by checking ping round trips.
The operator will:
1. Identify the involved machines
2. Find their IP addresses
3. ssh into one compute node and run ping
For  the operator may use Horizon dashboard or Nova CLI.
For example, to identify the host for a VM called ‘alon-1’, we use the ‘nova list’ command to find the VM UUID, followed by ‘nova show’ command:
For , the operator can open his datacenter network scheme, or use Nova CLI again, ‘nova hypervisor-show’ (as this info is not available in Horizon dashboard).
Now, having the IP addresses, one can ssh into the machine and do the ping test. If our operator is lucky, he spots an unusual ping round trip. I say “lucky,” because by the time the operator gets to run ping (after receiving the customer complaint, digging into Neutron and finding the host’s IP addresses), the problem condition (i.e. the user complaint), caused by temporary CPU load or network heavy traffic, may already have cleared for some unrelated reason.
Scenario  – The client provides some hints
Let’s assume that our tenant ran netperf between the involved VMs to measure UDP’s request/response rate (UDPRR below), and did found a low rate. This already points to a networking problem between these VMs.
At this stage, even our sophisticated tenant cannot debug any further, without visibility into the physical location of the VMs, and without access to the involved hypervisors.
To illustrate this, I ran a netperf test under different conditions:
1. This one is the benchmark:
2. This one shows poor performance when running ‘stress’ (loading the cpu to 99%) on the netperf server machine:
3. This one shows even worse performance when the network is congested:
As we see, both (2) ‘CPU overloading’ and (3) ‘network congestion’ can cause huge decreases in network performance.
[Note: TCP is not affected when the CPU is loaded, because of the use of TCP offload mechanisms].
In my previous blog post on this topic, I explained that overlay networking is all about software. Software is running on CPUs, of course. The intermingling of software and hardware and networking means that in this case (troubleshooting network performance in the cloud) we don’t really know whether to blame the physical network (switches/routers/cable) or the virtual one (software/CPU).
So, how does our operator continue from here?
Checking current CPU usage is simple using Linux tools. However, for the case described, we need to constantly monitor, log, and perhaps alert (at a threshold). Such tools would allow the operator to search for historical events that might have affected past performance.
Let us assume the operator has this information available, and can exclude CPU load as a contributor to the problem. Did the network cause the problem? That is a harder question to answer, this time because of network-information overload.
Independent and non-integrated network monitoring tools provide so much scattered, irrelevant or even “blinding” information, they frequently leave an operator with more questions than answers. To all this, the overlay adds another dimension of complexity.
This gives us some background to appreciate the following chart which is taken from a report based of over 740 respondents which reveal their biggest challenges, key initiatives, and top reasons for moving to the cloud.:
In my next post I discuss how two new tools from Midokura, MEM Insights™ and MEM Fabric™, facilitate network troubleshooting and alleviate this major source of operator headache and anxiety.