[one-users] Monitoring/Deployement issues: Virsh failes often

Floris Sluiter Floris.Sluiter at sara.nl
Fri Jul 30 03:49:16 PDT 2010


Hi Tino and List,

For us these  methods are not very acceptable in a production environment. We feel that we do need to monitor the status of the Cloud, both the hosts and the VMs, at least once every minute for each component. Stopping or drastically reducing the monitoring of VMs is not the way to go for us. If the method of monitoring causes the Cloud to fail, then the method needs changing, not the frequency of it...

I'll see what we can come up with to improve on this, we did already donate the SNMP driver for the hosts, maybe something similar can be done for the VMs (we are testing the read only method for virsh).

Kind regards,

Floris


-----Original Message-----
From: tinova79 at gmail.com [mailto:tinova79 at gmail.com] On Behalf Of Tino Vazquez
Sent: donderdag 29 juli 2010 19:29
To: Floris Sluiter
Cc: DuDu; users at lists.opennebula.org
Subject: Re: Monitoring/Deployement issues: Virsh failes often

Dear Floris,

We noticed this behavior with the scalability tests we put OpenNebula
through, for 2.0 there was a ticket opened regarding this [1]. It
happens with libvirt and also with the xen hypervisor. It is by no
means an OpenNebula scalability issue, since the intended libvirt
behavior is not the one exposed.

Notwithstanding, we introduced in the 2.0 version means to avoid this
unpredictable behavior. We have limited the number of simultaneous
deployments to a same host to one, to avoid this blocked socket issue.
This can be configured through the scheduler configuration file. More
information on this can be found on [2]. Also, we have introduced a
limitation on the simultaneous VM polling requests, also with the same
purpose.

With the changes detailed above, OpenNebula is able to deploy dozens
of thousands of virtual machines on a pool of hundreds of physical
servers at the same time.

The readonly method to perform the polling is a very neat and
interesting proposal, we will evaluate its inclusion in the 2.0
version. Thanks a lot for this valuable feedback.

Best regards,

-Tino

[1] http://dev.opennebula.org/issues/261
[2] http://www.opennebula.org/documentation:rel2.0:schg

--
Constantino Vázquez Blanco | dsa-research.org/tinova
Virtualization Technology Engineer / Researcher
OpenNebula Toolkit | opennebula.org



On Thu, Jul 29, 2010 at 6:28 PM, Floris Sluiter <Floris.Sluiter at sara.nl> wrote:
> Hi,
>
>
>
> We again had issues when having many VMs deployed on many hosts at the same
> time (log excerpts below) and deploying more.
>
> We saw over  25 runaway  VMs left behind running from the last two weeks,
> that one had marked as DONE, also deploy, copy and stop failed randomly
> quite often.
>
>
>
> It starts to be a major problem, when we can't run opennebula in a stable
> and predictable manner on larger Clouds.
>
> We have the following intervals configured, we do need to monitor more often
> then every 10 minutes we feel.
>
> HOST_MONITORING_INTERVAL = 20
>
> VM_POLLING_INTERVAL      = 30
>
> So we first used our snmp driver again, which solved a large part of the
> problems, but our cloud is still growing, so we reached the next limit.
>
>
>
> What seems to be happening is that the "virsh --connect qemu:///system
> dominfo" interferes with other virsh commands.Virsh locks libvirt-sock, so
> multiple processes can not connect at the same time.
>
> Solution we are now trying : do the monitoring of VMs in read only mode:
> "virsh -readonly --connect qemu:///system dominfo"
>
> Which we added/changed in the file: /usr/lib/one/mads/one_vmm_kvm.rb
>
> Now virsh doesn't lock the libvirt-sock as far as we can see
>
>
>
> Currently we do not see the error messages we had before, but some kind of
> robust, scalable and  fail safe monitoring solution for opennebula is
> needed.
>
>
>
> Hope this helps
>
> Kind regards,
>
>
>
> Floris
>
>
>
>
>
> Thu Jul 22 16:16:21 2010 [VMM][I]: Command execution fail: virsh --connect
> qemu:///system dominfo one-428
>
> Thu Jul 22 16:16:21 2010 [VMM][I]: STDERR follows.
>
> Thu Jul 22 16:16:21 2010 [VMM][I]: error: unable to connect to
> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
> denied
>
> Thu Jul 22 16:16:21 2010 [VMM][I]: error: failed to connect to the
> hypervisor
>
> Thu Jul 22 16:16:21 2010 [VMM][I]: ExitCode: 1
>
> Thu Jul 22 16:16:21 2010 [VMM][E]: Error monitoring VM, -
>
>
>
> And sometimes destroy would fail:
>
> Wed Jul 28 13:05:34 2010 [LCM][I]: New VM state is SAVE_STOP
>
> Wed Jul 28 13:05:34 2010 [VMM][I]: Command execution fail: 'touch
> /var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save
> one-585 /var/lib/one/585/images/checkpoint'
>
> Wed Jul 28 13:05:34 2010 [VMM][I]: STDERR follows.
>
> Wed Jul 28 13:05:34 2010 [VMM][I]: error: unable to connect to
> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
> denied
>
> Wed Jul 28 13:05:34 2010 [VMM][I]: error: failed to connect to the
> hypervisor
>
> Wed Jul 28 13:05:34 2010 [VMM][I]: ExitCode: 1
>
> Wed Jul 28 13:05:34 2010 [VMM][E]: Error saving VM state, -
>
> Wed Jul 28 13:05:35 2010 [LCM][I]: Fail to save VM state. Assuming that the
> VM is still RUNNING (will poll VM).
>
> Wed Jul 28 13:05:38 2010 [VMM][I]: Command execution fail: virsh --connect
> qemu:///system dominfo one-585
>
> Wed Jul 28 13:05:38 2010 [VMM][I]: STDERR follows.
>
> Wed Jul 28 13:05:38 2010 [VMM][I]: error: unable to connect to
> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
> denied
>
> Wed Jul 28 13:05:38 2010 [VMM][I]: error: failed to connect to the
> hypervisor
>
> Wed Jul 28 13:05:38 2010 [VMM][I]: ExitCode: 1
>
> Wed Jul 28 13:05:38 2010 [VMM][E]: Error monitoring VM, -
>
> .trying like 10 times .
>
> Wed Jul 28 13:09:14 2010 [VMM][E]: Error monitoring VM, -
>
> Wed Jul 28 13:09:56 2010 [LCM][I]: New VM state is SAVE_STOP
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: 'touch
> /var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save
> one-585 /var/lib/one/585/images/checkpoint'
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to
> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
> denied
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the
> hypervisor
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
>
> Wed Jul 28 13:09:56 2010 [VMM][E]: Error saving VM state, -
>
> Wed Jul 28 13:09:56 2010 [LCM][I]: Fail to save VM state. Assuming that the
> VM is still RUNNING (will poll VM).
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: virsh --connect
> qemu:///system dominfo one-585
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to
> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
> denied
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the
> hypervisor
>
> Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
>
> Wed Jul 28 13:09:56 2010 [VMM][E]: Error monitoring VM, -
>
> Wed Jul 28 13:10:24 2010 [VMM][I]: Command execution fail: virsh --connect
> qemu:///system dominfo one-585
>
> Wed Jul 28 13:10:24 2010 [VMM][I]: STDERR follows.
>
> Wed Jul 28 13:10:24 2010 [VMM][I]: error: unable to connect to
> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
> denied
>
> Wed Jul 28 13:10:24 2010 [VMM][I]: error: failed to connect to the
> hypervisor
>
> Wed Jul 28 13:10:24 2010 [VMM][I]: ExitCode: 1
>
> Wed Jul 28 13:10:24 2010 [VMM][E]: Error monitoring VM, -
>
> Wed Jul 28 13:10:45 2010 [DiM][I]: New VM state is DONE
>
> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Driver command for 585
> cancelled
>
> Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Deleting
> /var/lib/one/585/images
>
> Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Executed
> "ssh node13-one rm -rf /var/lib/one/585/images".
>
> Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: TRANSFER SUCCESS 585 -
>
> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Command execution
> fail: virsh --connect qemu:///system destroy one-585
>
> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 STDERR follows.
>
> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: unable to
> connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started:
> Permission denied
>
> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: failed to
> connect to the hypervisor
>
> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 ExitCode: 1
>
> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: CANCEL FAILURE 585 -
>
>
>
>
>
>
>
> From: users-bounces at lists.opennebula.org
> [mailto:users-bounces at lists.opennebula.org] On Behalf Of Floris Sluiter
> Sent: maandag 19 juli 2010 18:18
> To: 'Tino Vazquez'; DuDu
> Cc: users at lists.opennebula.org
> Subject: Re: [one-users] oned hang
>
>
>
> Hi Dudu, Tino and all,
>
>
>
> We have seen the exact same message (Command execution fail and bad
> interpreter: Text file busy)) on our cluster last week when we expanded it
> from 12 to 16 hosts (with add host)and deploying 10 Vmachines at the same
> time. We did not have multiple instances of opennebula running, we only
> added to a running one,  so it is unlikely that was the issue (the cluster
> was already running stable for a while). We investigated and thought it was
> a timing issue with the monitoring (ssh) driver set to 60 seconds and having
> many hosts and many VMs.
>
> We started using the ssh-monitoring driver again in after the latest update
> to opennebula, before that we used our in hous developed snmp monitoring
> driver.
>
> When we deployed our snmp driver, the error message stopped and for the last
> week we have a stable cloud again, now with 16 hosts.
>
> For people who think see the same timing issues as we did , the snmp_driver
> is available in the ecosystem (but make sure you know what snmp is before
> you try ;-)): http://opennebula.org/software:ecosystem:snmp_im_driver
>
> Regards,
>
>
>
> Floris
>
> HPC project leader
>
> Sara
>
>
>
>
>
> From: users-bounces at lists.opennebula.org
> [mailto:users-bounces at lists.opennebula.org] On Behalf Of Tino Vazquez
> Sent: maandag 19 juli 2010 16:15
> To: DuDu
> Cc: users at lists.opennebula.org
> Subject: Re: [one-users] oned hang
>
>
>
> Dear DuDu,
>
>
>
> This happens when two monitorization actions take place at the same time.
>
>
>
> First thing, which OpenNebula version are you using?
>
>
>
> Are you per chance running two OpenNebula instances? Did you change the host
> polling time?
>
>
>
> Regards,
>
>
>
> -Tino
>
> --
> Constantino Vázquez Blanco | dsa-research.org/tinova
> Virtualization Technology Engineer / Researcher
> OpenNebula Toolkit | opennebula.org
>
> On Wed, Jul 14, 2010 at 3:13 PM, DuDu <blackass at gmail.com> wrote:
>
>
>
> Hi,
>
>
>
> We deployed a small cluster of opennebula, with 8 hosts. It is the default
> opennebula installation, however, we found that after several days of
> running, oned hung. All CLI commands hang too. No new logs generated in
> one_xmlrpc.log. And there are quite some error message like the following in
> oned.log:
>
>
>
> [root at vm-container-31-0 logdir]# tail oned.log
> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup
> failed: xauth key data not generated
> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake
> authentication data for X11 forwarding.
> Wed Jul 14 14:51:02 2010 [InM][I]: bash:
> /tmp/one-im//one_im-c4718299a313d89398ea693104dcce5f: /bin/sh: bad
> interpreter: Text file busy
> Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126
> Wed Jul 14 14:51:02 2010 [InM][I]: Command execution fail: 'mkdir -p
> /tmp/one-im/; cat > /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822; if
> [ "x$?" != "x0" ]; then exit -1; fi; chmod +x
> /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822;
> /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822'
> Wed Jul 14 14:51:02 2010 [InM][I]: STDERR follows.
> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup
> failed: xauth key data not generated
> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake
> authentication data for X11 forwarding.
> Wed Jul 14 14:51:02 2010 [InM][I]: bash:
> /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822: /bin/sh: bad
> interpreter: Text file busy
> Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126
>
>
>
> We have to sigkill oned and restart it. And that solves all problems.
>
>
>
> Any idea of this?
>
>
>
> Thanks!
>
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>
>



More information about the Users mailing list