[one-users] Very high unavailable service

Sat Aug 25 18:27:10 PDT 2012

I run high-availability squid servers on virtual machines although not yet in OpenNebula.
It can be done with very high availability.
I am not familiar with Ubuntu Server 12.04 but if it has libvirt 0.9.7 or better, and you are
Using KVM hypervisor, you should be able to use the cpu-pinning and numa-aware features of libvirt to pin
each virtual machine to a given physical cpu.   That will beat the migration issue you are seeing now.
With Xen hypervisor you can (and should) also pin.
I think if you beat the cpu and memory pinning problem you will be OK.

However, you did not say what network topology you are using for your virtual machine, and what kind of virtual network drivers,
That is important too.    Also-is your squid cache mostly disk-resident or mostly RAM-resident?  If the former then the virtual disk drivers matter too, a lot.

Steve Timm

From: users-bounces at lists.opennebula.org [mailto:users-bounces at lists.opennebula.org] On Behalf Of Erico Augusto Cavalcanti Guedes
Sent: Saturday, August 25, 2012 6:33 PM
To: users at lists.opennebula.org
Subject: [one-users] Very high unavailable service

Dears,

I 'm running Squid Web Cache Proxy server on Ubuntu Server 12.04 VMs (kernel 3.2.0-23-generic-pae), OpenNebula 3.4.
My private cloud is composed by one frontend and three nodes. VMs are running on that 3 nodes, initially one by node.
Outside cloud, there are 2 hosts, one working as web clients and another as web server, using Web Polygraph Benchmakring Tool.

The goal of tests is stress Squid cache running on VMs.
When same test is executed outside the cloud, using the three nodes as Physical Machines, there are 100% of cache service availability.
Nevertheless, when cache service is provided by VMs, nothing better than 45% of service availability is reached.
Web clients do not receive responses from squid when it is running on VMs in 55% of the time.

I have monitored load average of VMs and PMs where VMs are been executed. First load average field reaches 15 after some hours of tests on VMs, and 3 on physical machines.
Furthermore, there is a set of processes, called migration/X, that are champions in CPU TIME when VMs are in execution. A sample:

top - 20:01:38 up 1 day,  3:36,  1 user,  load average: 5.50, 5.47, 4.20

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+    TIME COMMAND
   13 root      RT   0     0    0    0 S    0  0.0 408:27.25 408:27 migration/2
    8 root      RT   0     0    0    0 S    0  0.0 404:13.63 404:13 migration/1
    6 root      RT   0     0    0    0 S    0  0.0 401:36.78 401:36 migration/0
   17 root      RT   0     0    0    0 S    0  0.0 400:59.10 400:59 migration/3

It isn't possible to offer web cache service via VMs in the way the service is behaving, with so small availability.

So, my questions:

1. Does anybody has experienced a similar problem of unresponsive service? (Whatever service).
2. How to state the bootleneck that is overloading the system, so that it can be minimized?

Thanks a lot,

Erico.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20120826/aa9e13e7/attachment-0002.htm>