An OpenStack Crime Story solved by tcpdump, sysdig, and iostat

By Loris Degioanni - OCTOBER 16, 2014

SHARE:


Editor’s note: This post from Lukas Pustina, Performance Engineer at codecentric, originally appeared on the codecentric blog.

This is the first in a new series of guest blog posts, written by real sysdig users telling their own stories in their own words. If you’re interested in posting here, please reach out – we’d love to hear from you!


This is the story of how the tiniest things can sometimes be the biggest culprits. First and foremost, this is a detective story. So come and follow me on a little crime scene investigation that uncovered an incredibly counterintuitive and almost criminal default setting in Ubuntu 14.04 crippling the virtual network performance along with it’s network latency.

And just in case you don’t like detective stories – and this goes out to all you nerds out there – I promise it will be worth your while, because this is also an introduction to keepalived, tcpdump, and sysdig. So sit back and enjoy!

The problems came out of the blue. We were running an OpenStack installation for the next generation CenterDevice cloud.

CenterDevice is a cloud-based document solution and our clients rely on its 24/7 availability. In order to achieve this availability, all our components are multiply redundant. For example, our load balancers are two separate virtual machines running HAProxy. Both instances manage a highly available IP address via keepalived.

This system worked very well, until both load balancers became the victims of an evil crime. Both virtual machines started flapping their highly available virtual IP addresses and we were receiving alerting e-mails by the dozens, but there was no obvious change to the system that could have explained this behavior.

And this is how the story begins.

When I was called to investigate this weird behavior and answer the question of what was happening to the poor little load balancers, I started by taking a close look at the crime scene: keepalived. The following listing shows our master configuration of keepalived on virtual host loadbalancer011

% sudo cat /etc/keepalived/keepalived.conf
global_defs {
        notification_email {
                [email protected]
        }
        notification_email_from [email protected]
        smtp_server 10.10.9.8
        smtp_connect_timeout 30
        router_id loadbalancer01
}
 
vrrp_script chk_haproxy {
        script "killall -0 haproxy"
        interval 2
        weight 2
}
 
vrrp_instance loadbalancer {
        state MASTER
        interface eth0
        virtual_router_id 205
        priority 101
        smtp_alert
        advert_int 1
        authentication {
                auth_type PASS
                auth_pass totally_secret
        }
        virtual_ipaddress {
                192.168.205.7
        }
        track_script {
                chk_haproxy
        }
}

The same configuration on our second load balancer loadbalancer02 looks exactly the same, with notification email and router id changed accordingly as well as a lower priority. This all looked fine to me and it was immediately clear that keepalived.conf was not the one to blame. I needed to figure out a different reason why the two keepalived were continuously flapping the virtual IP address.

Now, it is important to understand how VRRP (the protocol keepalived uses to check the availability of its partners) works. All partners continuously exchange keep alive packets via the multicast address vrrp.mcast.net which resolves to 224.0.0.18. These packets use IP protocol number 112. If the backup does not receive these keep alive packets from the current master, it assumes the partner is dead, murdered, or has otherwise gone AWOL and takes over the virtual IP address, now acting as the new master. If the old master decides to check back in, the virtual IP address is exchanged again.

Since we were observing this back and forth exchange, I suspected the virtual network had unlawfully disregarded its responsibility. I immediately rushed to the terminal and started to monitor the traffic. This might seem easy, but it is far more complex than you think. The figure below, taken from openstack.redhat.com, shows an overview of how Neutron creates virtual networks.

OpenStack Neutron Architecture

Since both load balancers were running on different nodes, the VRRP packets traveled from A to J and back to Q. So where were the missing packets?

I decided to tail the packets at A, E, and J. Maybe that was where the culprit was hiding. I started tcpdump on loadbalancer01, the compute node node01, and the network controller control01 looking for missing packets. And there were packets. Lots of packets. But I could not see missing packets until I stopped tcpdump:

$ tcpdump -l host vrrp.mcast.net
...
45389 packets captured
699 packets received by filter
127 packets dropped by kernel

Dropped packets by the kernel? Were these our victims? Unfortunately not.

tcpdump uses a little buffer in the kernel to store captured packets. If too many new packets arrive before the user process tcpdump can decode them, the kernel drops them to make room for freshly arriving packets.

Not very helpful when you are trying to find where packets get, well, dropped. But there is the handy parameter -B for tcpdump that increases this buffer. I tried again:

$ tcpdump -l -B 10000 host vrrp.mcast.net

Much better, no more dropped packets by the kernel. And now, I saw something. While the VRRP packets were dumped on my screen, I noticed that sometimes there was a lag time of longer than one second between VRRP keepalives. This struck me as odd. This should not happen as the interval for VRRP packets had been set to one second in the configurations. I felt that I was on to something, but just needed to dig a bit deeper.

I let tcpdump show me the time differences between succeeding packets and detect differences bigger than one second.

$ tcpdump -i any -l -B 10000 -ttt host vrrp.mcast.net  | grep -v '^00:00:01'
code

Oh my god. Look at the screenshot above 2. There is a delay of more than 3.5 seconds. Of course loadbalancer02 assumes his partner went MIA.

But wait, the delay already starts on loadbalancer01? I have been fooled! It is not the virtual network. It had to be the master keepalived host! But why? Why should the virtual machine hold packets back? There must be some evil hiding in this machine and I will find and face it…


After discovering the culprit might be hiding inside the virtual machine, I fetched my assistant sysdig and we ssh’ed into loadblancer01. Together, we started to ask questions – very intense, uncomfortable questions to finally get to the bottom of this.

Sysdig is a relatively new tool that basically traces all syscalls and uses Lua scripts – called chisels – to derive conclusions about the captured sequence of syscalls. In general, it is like tcpdump for syscalls. Since every process that wants to do anything useful has to use syscalls, a very detailed view of your system’s behavior evolves. Sysdig is developing into a serious swiss army knife and you should give it a try – read these three blog posts to learn about sysdig’s capabilities which convinced me.

Sysdig being the good detective assistant went to work immediately and kept a transcript about all questions and answers in a file.

$ sysdig -w transcript

I was especially interested in the sendmsg syscall that actually sends network packets issued by the keepalived process.

After a syscall is invoked by a process, the process is paused until the syscall has been finished by the kernel and only then is the execution of the process resumed.

I was interested in the invocation, which is the event direction from user process to kernel.

$ sysdig -r transcript proc.name=keepalived and evt.type=sendmsg and evt.dir='>'

That stirred up quite some evidence. But I wanted to find the time gaps larger than one second. I drew Lua from my concealed holster inside my jacket and fired a quick chisel right between two sendmsg syscalls:

$ sysdig -r transcript -c find_time_gap proc.name=keepalived and evt.type=sendmsg and evt.dir='>'
370964 17:01:26.375374738 (5.03543187) keepalived > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0

Got you! A time gap of more than five seconds turned up. I inspected the events closely looking for event number 370964.

$ sysdig -r transcript proc.name=keepalived and evt.type=sendmsg and evt.dir='>' | grep -C 1 370964
368250 17:01:21.339942868 0 keepalived (10400) > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0
370964 17:01:26.375374738 1 keepalived (10400) > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0
371446 17:01:26.377770247 1 keepalived (10400) > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0

There, in lines 1 and 2. There it was. A gap much larger than one second. Now it was clear. The perpetrator was hiding inside the virtual machine causing lags in the processing of keepalived’s sendmsg. But who was it?

After some ruling out of innocent bystanders, I found this:

$ sysdig -r transcript | tail -n +368250 | grep -v "keepalived|haproxy|zabbix"
369051 17:01:23.621071175 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369052 17:01:23.621077045 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369053 17:01:23.625105578 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369054 17:01:23.625112785 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369055 17:01:23.628568892 3  (0) > switch next=25978(sysdig) pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369056 17:01:23.628597921 3 sysdig (25978) > switch next=0 pgft_maj=0 pgft_min=2143 vm_size=99684 vm_rss=6772 vm_swap=0
369057 17:01:23.629068124 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369058 17:01:23.629073516 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369059 17:01:23.633104746 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369060 17:01:23.633110185 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369061 17:01:23.637061129 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369062 17:01:23.637065648 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369063 17:01:23.641104396 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369064 17:01:23.641109883 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369065 17:01:23.645069607 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369066 17:01:23.645075551 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369067 17:01:23.649071836 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369068 17:01:23.649077700 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369069 17:01:23.653073066 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369070 17:01:23.653078791 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369071 17:01:23.657069807 3  (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369072 17:01:23.657075525 3  (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369073 17:01:23.658678681 3  (0) > switch next=25978(sysdig) pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
369074 17:01:23.658698453 3 sysdig (25978) > switch next=0 pgft_maj=0 pgft_min=2143 vm_size=99684 vm_rss=6772 vm_swap=0

The processes masked behind pids 0 and 7 were consuming the virtual CPU. 0 and 7? These low process ids indicate kernel threads! I double checked with sysdig’s supervisor, just to be sure. I was right.

The kernel, the perpetrator? This hard working, ever reliable guy? No, I could not believe it. There must be more. And by the way, the virtual machine was not under load. So what could the virtual machine’s kernel be doing … except … not doing anything at all? Maybe it was waiting for IO! Could that be? Since all hardware of a virtual machine is, well, virtual, the IO wait could only be on the baremetal level.

I left loadbalancer01 for its bare metal node01. On my way, I was trying to wrap my head around what was happening. What was this all about? So many proxies, so many things happening in the shadows. So many false suspects. This is not a regular case. There is something much more vicious happening behind the scenes. I must find it…


When I arrived at baremetal host node01, hosting virtual host loadbalancer01, I was anxious to see the IO statistics. The machine must be under heavy IO load when the virtual machine’s messages are waiting for up to five seconds.

I switched on my iostat flashlight and saw this:

$ iostat
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              12.00         0.00       144.00          0        144
sdc               0.00         0.00         0.00          0          0
sdb               6.00         0.00        24.00          0         24
sdd               0.00         0.00         0.00          0          0
sde               0.00         0.00         0.00          0          0
sdf              20.00         0.00       118.00          0        118
sdg               0.00         0.00         0.00          0          0
sdi              22.00         0.00       112.00          0        112
sdh               0.00         0.00         0.00          0          0
sdj               0.00         0.00         0.00          0          0
sdk              21.00         0.00        96.50          0         96
sdl               0.00         0.00         0.00          0          0
sdm               9.00         0.00        64.00          0         64

Nothing? Nothing at all? No IO on the disks? Maybe my bigger flashlight iotop could help:

$ iotop

Unfortunately, what I saw was too ghastly to show here and therefore I decided to omit the screenshots of iotop3. It was pure horror. Six qemu processes eating the physical CPUs alive in IO.

So, no disk IO, but super high IO caused by qemu. It must be network IO then. But all performance counters show almost no network activity. What if this IO wasn’t real, but virtual? It could be the virtual network driver! It had to be the virtual network driver.

I checked the OpenStack configuration. It was set to use the para-virtualized network driver vhost_net.

I checked the running qemu processes. They were also configured to use the para-virtualized network driver.

$ ps aux | grep qemu
libvirt+  6875 66.4  8.3 63752992 11063572 ?   Sl   Sep05 4781:47 /usr/bin/qemu-system-x86_64
 -name instance-000000dd -S ... -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 ...

I was getting closer! I checked the kernel modules. Kernel module vhost_net was loaded and active.

$ lsmod | grep net
vhost_net              18104  2
vhost                  29009  1 vhost_net
macvtap                18255  1 vhost_net

I checked the qemu-kvm configuration and froze.

$ cat /etc/default/qemu-kvm
# To disable qemu-kvm's page merging feature, set KSM_ENABLED=0 and
# sudo restart qemu-kvm
KSM_ENABLED=1
SLEEP_MILLISECS=200
# To load the vhost_net module, which in some cases can speed up
# network performance, set VHOST_NET_ENABLED to 1.
VHOST_NET_ENABLED=0
 
# Set this to 1 if you want hugepages to be available to kvm under
# /run/hugepages/kvm
KVM_HUGEPAGES=0

vhost_net was disabled by default for qemu-kvm. We were running all packets through userspace and qemu instead of passing them to the kernel directly as vhost_net does! That’s where the lag was coming from!

I acted immediately to rescue the victims. I made the huge, extremely complicated, full 1 byte change on all our compute nodes by modifying a VHOST_NET_ENABLED=0 to a VHOST_NET_ENABLED=1, restarted all virtual machines and finally, after days of constantly screaming in pain, the flapping between both load balancers stopped.

I did it! I saved them!

But I couldn’t stop here. I wanted to find out, who did that to the poor little load balancers. Who was behind this conspiracy of crippled network latency?

I knew there was only one way to finally catch the guy. I set a trap. I installed a fresh, clean, virgin Ubuntu 14.04 in a virtual machine and then, well, then I waited — for apt-get install qemu-kvm to finish:

$ sudo apt-get install qemu-kvm
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
  acl cpu-checker ipxe-qemu libaio1 libasound2 libasound2-data libasyncns0
  libbluetooth3 libboost-system1.54.0 libboost-thread1.54.0 libbrlapi0.6
  libcaca0 libfdt1 libflac8 libjpeg-turbo8 libjpeg8 libnspr4 libnss3
  libnss3-nssdb libogg0 libpulse0 librados2 librbd1 libsdl1.2debian
  libseccomp2 libsndfile1 libspice-server1 libusbredirparser1 libvorbis0a
  libvorbisenc2 libxen-4.4 libxenstore3.0 libyajl2 msr-tools qemu-keymaps
  qemu-system-common qemu-system-x86 qemu-utils seabios sharutils
Suggested packages:
  libasound2-plugins alsa-utils pulseaudio samba vde2 sgabios debootstrap
  bsd-mailx mailx
The following NEW packages will be installed:
  acl cpu-checker ipxe-qemu libaio1 libasound2 libasound2-data libasyncns0
  libbluetooth3 libboost-system1.54.0 libboost-thread1.54.0 libbrlapi0.6
  libcaca0 libfdt1 libflac8 libjpeg-turbo8 libjpeg8 libnspr4 libnss3
  libnss3-nssdb libogg0 libpulse0 librados2 librbd1 libsdl1.2debian
  libseccomp2 libsndfile1 libspice-server1 libusbredirparser1 libvorbis0a
  libvorbisenc2 libxen-4.4 libxenstore3.0 libyajl2 msr-tools qemu-keymaps
  qemu-kvm qemu-system-common qemu-system-x86 qemu-utils seabios sharutils
0 upgraded, 41 newly installed, 0 to remove and 2 not upgraded.
Need to get 3631 kB/8671 kB of archives.
After this operation, 42.0 MB of additional disk space will be used.
Do you want to continue? [Y/n] <
...
Setting up qemu-system-x86 (2.0.0+dfsg-2ubuntu1.3) ...
qemu-kvm start/running
Setting up qemu-utils (2.0.0+dfsg-2ubuntu1.3) ...
Processing triggers for ureadahead (0.100.0-16) ...
Setting up qemu-kvm (2.0.0+dfsg-2ubuntu1.3) ...
Processing triggers for libc-bin (2.19-0ubuntu6.3) ...

And then, I let the trap snap:

$ cat /etc/default/qemu-kvm
# To disable qemu-kvm's page merging feature, set KSM_ENABLED=0 and
# sudo restart qemu-kvm
KSM_ENABLED=1
SLEEP_MILLISECS=200
# To load the vhost_net module, which in some cases can speed up
# network performance, set VHOST_NET_ENABLED to 1.
VHOST_NET_ENABLED=0
 
# Set this to 1 if you want hugepages to be available to kvm under
# /run/hugepages/kvm
KVM_HUGEPAGES=0

I could not believe it! It was Ubuntu’s own default setting. Ubuntu, the very foundation of our cloud, decided to turn vhost_net off by default, despite all modern hardware supporting it. Ubuntu was convicted and I could finally rest.

This is the end of my detective story. I found and arrested the criminal Ubuntu default setting and was able to prevent him from further crippling our virtual network latencies.

Please feel free to contact me and ask questions about the details of my journey. I’m already negotiating to sell the movie rights. But maybe there will be another season of OpenStack Crime Investigation in the future. So stay tuned!

Footnotes

1. All personal details like IP, E-Mail addresses etc. have been changed to protect the innocent.

2. Yes, you see correctly. I am running screen inside tmux . This allows me to sustain the screen tabs even during logouts.

3. Eh, and because I lost them.

Subscribe and get the latest updates