[one-users] Very high unavailable service

Mon Aug 27 13:34:39 PDT 2012

Hi again,

cpu pinning does not result in improved availability.
Executing test with just VCPU=1 and CPU=0.5 results on 55% of availability.
Before CPU pinning, there was 51% of availability.

Facts points to disk access, as sad on previous messages.
There is a huge overload when squid is reading disks from VMs.

I'll be grateful for any guidance.

Thank you in advance,

Erico.

2012/8/27 Erico Augusto Cavalcanti Guedes <eacg at cin.ufpe.br>

> Hi,
>
> 2012/8/26 Ruben S. Montero <rsmontero at opennebula.org>
>
> Hi
>>
>> If you want to try the cpu pinning suggested by Steven, simply add a
>> RAW attribute in your VM template. Something similar to:
>>
>> RAW=[
>>   TYPE="kvm",
>>   DATA="<cputune><vcpupin vcpu=\"0\" cpuset=\"1\"/></cputune>" ]
>>
>
> Added to vm template. Running an expirement right now to observe
> availability.
>
>
>>
>> Cheers
>>
>> Ruben
>>
>>
>> On Sun, Aug 26, 2012 at 3:27 AM, Steven C Timm <timm at fnal.gov> wrote:
>> > I run high-availability squid servers on virtual machines although not
>> yet
>> > in OpenNebula.
>> >
>> > It can be done with very high availability.
>> >
>> > I am not familiar with Ubuntu Server 12.04 but if it has libvirt 0.9.7
>> or
>> > better, and you are
>> >
>> > Using KVM hypervisor, you should be able to use the cpu-pinning and
>> > numa-aware features of libvirt to pin
>>
>> >
>> > each virtual machine to a given physical cpu.   That will beat the
>> migration
>> > issue you are seeing now.
>>
>
> Done. migration processes was beaten, as you sad. Thank you.
>
>
>>  >
>> > With Xen hypervisor you can (and should) also pin.
>> >
>> > I think if you beat the cpu and memory pinning problem you will be OK.
>> >
>> >
>> >
>> > However, you did not say what network topology you are using for your
>> > virtual machine,
>>
>
> That is my vnet template:
>
> TYPE = RANGED
> BRIDGE = br0
> NETWORK_SIZE    = C
> NETWORK_ADDRESS = 192.168.15.0
> NETMASK         = 255.255.255.0
> GATEWAY         = 192.168.15.10
> IP_START        = 192.168.15.110
> IP_END          = 192.168.15.190
>
> futhermore, there is one squid instance in execution on each VM, connected
> on a mesh architecture, with LRU replacement policy. Web-polygraph clients
> are configured to send requests on round-robin, to each of VMs (only three
> currently: 192.168.15.110-112). Web-polygraph server, that represents
> Internet, is running on 192.168.15.10, closing the topology.
>
> Remember that when same test is performed on Physical Machines,
> availability is 100% invariably.
>
> and what kind of virtual network drivers,
>>
>
> VMs are configured on VirtualBox, based on Ubuntu Server 12.04.
> Executing:
> ps ax | grep kvm
>
> on 192.168.15.110 cloud node:
>
> /usr/bin/kvm -S -M pc-0.14 -enable-kvm -m 1024 -smp
> 1,sockets=1,cores=1,threads=1 -name one-609 -uuid
> a7115950-3f0e-0fab-ea20-3f829d835889 -nodefconfig -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-609.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc -no-acpi
> -boot c -drive
> file=/srv3/cloud/one/var/datastores/0/609/disk.0,if=none,id=drive-ide0-0-0,format=qcow2
> -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive
> file=/srv3/cloud/one/var/datastores/0/609/disk.1,if=none,id=drive-ide0-1-1,format=raw
> -device ide-drive,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -netdev
> tap,fd=20,id=hostnet0 -device
> rtl8139,netdev=hostnet0,id=net0,mac=02:00:c0:a8:0f:70,bus=pci.0,addr=0x3
> -usb -vnc 0.0.0.0:609 -vga cirrus -device
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4
>
> network driver used on VMs is rtl8139.
>
>
>>
>>
>  >
>> > That is important too.    Also—is your squid cache mostly disk-resident
>> or
>> > mostly RAM-resident?
>>
>
> disk-resident (from squid.conf):
> cache_mem 128 MB
> cache_dir aufs /var/spool/squid3 1024 16 256
>
> Note that disk space is reduced to 1024MB for testing purposes.
>
>
>> If the former then the virtual disk drivers matter
>> > too, a lot.
>>
>
> Some reading suggestions? I'll appreciate a lot.
>
> One additional information that can be relevant. Monitoring VMs processes
> on D status, through command:
> ps -eo pid,tid,class,rtprio,psr,pcpu,stat,wchan:25,comm | grep D
>   PID   TID CLS RTPRIO PSR %CPU STAT WCHAN                     COMMAND
>   170   170 TS       -   0  0.1 D    sleep_on_page             jbd2/sda1-8
>  1645  1645 TS       -   0 11.4 Dsl  get_write_access          squid3
>  1647  1647 TS       -   0  0.2 D    get_write_access
> autExperiment.sh
>  1679  1679 TS       -   0  0.8 D    get_write_access
> autExperiment.sh
>  1680  1680 TS       -   0  0.5 D    get_write_access
> autExperiment.sh
>
> It reveals that journally ext4 process(jdb2/sda1-8), squid and my
> monitoring scripting are on D status, always waiting on sleep_on_page and
> get_write_access kernel functions. I am researching the relevance/influence
> of these WCHANs. I performed tests without my monitoring scripts and
> availability is as bad as with them.
>
> Thanks for Steve and Ruben, for your time and answers.
>
> Erico.
>
>
>
>> >
>> >
>> >
>> > Steve Timm
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > From: users-bounces at lists.opennebula.org
>> > [mailto:users-bounces at lists.opennebula.org] On Behalf Of Erico Augusto
>> > Cavalcanti Guedes
>> > Sent: Saturday, August 25, 2012 6:33 PM
>> > To: users at lists.opennebula.org
>> > Subject: [one-users] Very high unavailable service
>> >
>> >
>> >
>> > Dears,
>> >
>> > I 'm running Squid Web Cache Proxy server on Ubuntu Server 12.04 VMs
>> (kernel
>> > 3.2.0-23-generic-pae), OpenNebula 3.4.
>> > My private cloud is composed by one frontend and three nodes. VMs are
>> > running on that 3 nodes, initially one by node.
>> > Outside cloud, there are 2 hosts, one working as web clients and
>> another as
>> > web server, using Web Polygraph Benchmakring Tool.
>> >
>> > The goal of tests is stress Squid cache running on VMs.
>> > When same test is executed outside the cloud, using the three nodes as
>> > Physical Machines, there are 100% of cache service availability.
>> > Nevertheless, when cache service is provided by VMs, nothing better
>> than 45%
>> > of service availability is reached.
>> > Web clients do not receive responses from squid when it is running on
>> VMs in
>> > 55% of the time.
>> >
>> > I have monitored load average of VMs and PMs where VMs are been
>> executed.
>> > First load average field reaches 15 after some hours of tests on VMs,
>> and 3
>> > on physical machines.
>> > Furthermore, there is a set of processes, called migration/X, that are
>> > champions in CPU TIME when VMs are in execution. A sample:
>> >
>> > top - 20:01:38 up 1 day,  3:36,  1 user,  load average: 5.50, 5.47, 4.20
>> >
>> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+    TIME
>> COMMAND
>> >    13 root      RT   0     0    0    0 S    0  0.0 408:27.25 408:27
>> > migration/2
>> >     8 root      RT   0     0    0    0 S    0  0.0 404:13.63 404:13
>> > migration/1
>> >     6 root      RT   0     0    0    0 S    0  0.0 401:36.78 401:36
>> > migration/0
>> >    17 root      RT   0     0    0    0 S    0  0.0 400:59.10 400:59
>> > migration/3
>> >
>> >
>> > It isn't possible to offer web cache service via VMs in the way the
>> service
>> > is behaving, with so small availability.
>> >
>> > So, my questions:
>> >
>> > 1. Does anybody has experienced a similar problem of unresponsive
>> service?
>> > (Whatever service).
>> > 2. How to state the bootleneck that is overloading the system, so that
>> it
>> > can be minimized?
>> >
>> > Thanks a lot,
>> >
>> > Erico.
>> >
>> >
>> > _______________________________________________
>> > Users mailing list
>> > Users at lists.opennebula.org
>> > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>> >
>>
>>
>>
>> --
>> Ruben S. Montero, PhD
>> Project co-Lead and Chief Architect
>> OpenNebula - The Open Source Solution for Data Center Virtualization
>> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20120827/6c750a86/attachment-0002.htm>