[one-users] Monitoring/Deployement issues: Virsh failes often

Thu Jul 29 09:28:46 PDT 2010

Hi,

We again had issues when having many VMs deployed on many hosts at the same time (log excerpts below) and deploying more.
We saw over  25 runaway  VMs left behind running from the last two weeks, that one had marked as DONE, also deploy, copy and stop failed randomly quite often.

It starts to be a major problem, when we can't run opennebula in a stable and predictable manner on larger Clouds...
We have the following intervals configured, we do need to monitor more often then every 10 minutes we feel.
HOST_MONITORING_INTERVAL = 20
VM_POLLING_INTERVAL      = 30
So we first used our snmp driver again, which solved a large part of the problems, but our cloud is still growing, so we reached the next limit...

What seems to be happening is that the "virsh --connect qemu:///system dominfo" interferes with other virsh commands.Virsh locks libvirt-sock, so multiple processes can not connect at the same time.
Solution we are now trying : do the monitoring of VMs in read only mode: "virsh -readonly --connect qemu:///system dominfo"
Which we added/changed in the file: /usr/lib/one/mads/one_vmm_kvm.rb
Now virsh doesn't lock the libvirt-sock as far as we can see

Currently we do not see the error messages we had before, but some kind of robust, scalable and  fail safe monitoring solution for opennebula is needed.

Hope this helps
Kind regards,

Floris

Thu Jul 22 16:16:21 2010 [VMM][I]: Command execution fail: virsh --connect qemu:///system dominfo one-428
Thu Jul 22 16:16:21 2010 [VMM][I]: STDERR follows.
Thu Jul 22 16:16:21 2010 [VMM][I]: error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission denied
Thu Jul 22 16:16:21 2010 [VMM][I]: error: failed to connect to the hypervisor
Thu Jul 22 16:16:21 2010 [VMM][I]: ExitCode: 1
Thu Jul 22 16:16:21 2010 [VMM][E]: Error monitoring VM, -

And sometimes destroy would fail:
Wed Jul 28 13:05:34 2010 [LCM][I]: New VM state is SAVE_STOP
Wed Jul 28 13:05:34 2010 [VMM][I]: Command execution fail: 'touch /var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save one-585 /var/lib/one/585/images/checkpoint'
Wed Jul 28 13:05:34 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:05:34 2010 [VMM][I]: error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission denied
Wed Jul 28 13:05:34 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:05:34 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:05:34 2010 [VMM][E]: Error saving VM state, -
Wed Jul 28 13:05:35 2010 [LCM][I]: Fail to save VM state. Assuming that the VM is still RUNNING (will poll VM).
Wed Jul 28 13:05:38 2010 [VMM][I]: Command execution fail: virsh --connect qemu:///system dominfo one-585
Wed Jul 28 13:05:38 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:05:38 2010 [VMM][I]: error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission denied
Wed Jul 28 13:05:38 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:05:38 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:05:38 2010 [VMM][E]: Error monitoring VM, -
...trying like 10 times ...
Wed Jul 28 13:09:14 2010 [VMM][E]: Error monitoring VM, -
Wed Jul 28 13:09:56 2010 [LCM][I]: New VM state is SAVE_STOP
Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: 'touch /var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save one-585 /var/lib/one/585/images/checkpoint'
Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission denied
Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:09:56 2010 [VMM][E]: Error saving VM state, -
Wed Jul 28 13:09:56 2010 [LCM][I]: Fail to save VM state. Assuming that the VM is still RUNNING (will poll VM).
Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: virsh --connect qemu:///system dominfo one-585
Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission denied
Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:09:56 2010 [VMM][E]: Error monitoring VM, -
Wed Jul 28 13:10:24 2010 [VMM][I]: Command execution fail: virsh --connect qemu:///system dominfo one-585
Wed Jul 28 13:10:24 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:10:24 2010 [VMM][I]: error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission denied
Wed Jul 28 13:10:24 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:10:24 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:10:24 2010 [VMM][E]: Error monitoring VM, -
Wed Jul 28 13:10:45 2010 [DiM][I]: New VM state is DONE
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Driver command for 585 cancelled
Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Deleting /var/lib/one/585/images
Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Executed "ssh node13-one rm -rf /var/lib/one/585/images".
Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: TRANSFER SUCCESS 585 -
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Command execution fail: virsh --connect qemu:///system destroy one-585
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 STDERR follows.
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission denied
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: failed to connect to the hypervisor
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 ExitCode: 1
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: CANCEL FAILURE 585 -

From: users-bounces at lists.opennebula.org [mailto:users-bounces at lists.opennebula.org] On Behalf Of Floris Sluiter
Sent: maandag 19 juli 2010 18:18
To: 'Tino Vazquez'; DuDu
Cc: users at lists.opennebula.org
Subject: Re: [one-users] oned hang

Hi Dudu, Tino and all,

We have seen the exact same message (Command execution fail and bad interpreter: Text file busy)) on our cluster last week when we expanded it from 12 to 16 hosts (with add host)and deploying 10 Vmachines at the same time. We did not have multiple instances of opennebula running, we only added to a running one,  so it is unlikely that was the issue (the cluster was already running stable for a while). We investigated and thought it was a timing issue with the monitoring (ssh) driver set to 60 seconds and having many hosts and many VMs.
We started using the ssh-monitoring driver again in after the latest update to opennebula, before that we used our in hous developed snmp monitoring driver.
When we deployed our snmp driver, the error message stopped and for the last week we have a stable cloud again, now with 16 hosts...

For people who think see the same timing issues as we did , the snmp_driver is available in the ecosystem (but make sure you know what snmp is before you try ;-)): http://opennebula.org/software:ecosystem:snmp_im_driver
Regards,

Floris
HPC project leader
Sara

From: users-bounces at lists.opennebula.org [mailto:users-bounces at lists.opennebula.org] On Behalf Of Tino Vazquez
Sent: maandag 19 juli 2010 16:15
To: DuDu
Cc: users at lists.opennebula.org
Subject: Re: [one-users] oned hang

Dear DuDu,

This happens when two monitorization actions take place at the same time.

First thing, which OpenNebula version are you using?

Are you per chance running two OpenNebula instances? Did you change the host polling time?

Regards,

-Tino

--
Constantino Vázquez Blanco | dsa-research.org/tinova<http://dsa-research.org/tinova>
Virtualization Technology Engineer / Researcher
OpenNebula Toolkit | opennebula.org<http://opennebula.org>
On Wed, Jul 14, 2010 at 3:13 PM, DuDu <blackass at gmail.com<mailto:blackass at gmail.com>> wrote:

Hi,

We deployed a small cluster of opennebula, with 8 hosts. It is the default opennebula installation, however, we found that after several days of running, oned hung. All CLI commands hang too. No new logs generated in one_xmlrpc.log. And there are quite some error message like the following in oned.log:

[root at vm-container-31-0 logdir]# tail oned.log
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake authentication data for X11 forwarding.
Wed Jul 14 14:51:02 2010 [InM][I]: bash: /tmp/one-im//one_im-c4718299a313d89398ea693104dcce5f: /bin/sh: bad interpreter: Text file busy
Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126
Wed Jul 14 14:51:02 2010 [InM][I]: Command execution fail: 'mkdir -p /tmp/one-im/; cat > /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822; if [ "x$?" != "x0" ]; then exit -1; fi; chmod +x /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822; /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822'
Wed Jul 14 14:51:02 2010 [InM][I]: STDERR follows.
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake authentication data for X11 forwarding.
Wed Jul 14 14:51:02 2010 [InM][I]: bash: /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822: /bin/sh: bad interpreter: Text file busy
Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126

We have to sigkill oned and restart it. And that solves all problems.

Any idea of this?

Thanks!

_______________________________________________
Users mailing list
Users at lists.opennebula.org<mailto:Users at lists.opennebula.org>
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20100729/2aff8a72/attachment-0003.htm>