[one-users] Monitoring/Deployement issues: Virsh failes often

Sun Aug 1 12:11:19 PDT 2010

Hi Floris,

It is a concurrency issue indeed, you are right. OpenNebula assumes an
exclusive use of the physical resources. Furthermore, it uses one unix
account, oneadmin, to perform all OpenNebula users operations, so no
situations as you described are possible if everything is set and used
right. The two introduced configuration variables, coupled with the
readonly flag which we will evaluate, thanks again, are means to deal
with this concurrency issue.

Regards,

-Tino

--
Constantino Vázquez Blanco | dsa-research.org/tinova
Virtualization Technology Engineer / Researcher
OpenNebula Toolkit | opennebula.org

On Fri, Jul 30, 2010 at 6:44 PM, Floris Sluiter <Floris.Sluiter at sara.nl> wrote:
> Hi Tino,
>
> I do not think that spacing alone will solve it. Maybe a solution would be to detect if virsh is locked, then wait and retry until  the lock is freed (for example with a mutex mechanism)? Because deploying to a host where another user is busy with a VM will probably still result in a fail. Or multiple users deploying/deleting on the same host. It is not a scheduling issue, it is a concurrency issue...
>
> Kind regards,
>
> Floris
>
>
>
> -----Original Message-----
> From: tinova79 at gmail.com [mailto:tinova79 at gmail.com] On Behalf Of Tino Vazquez
> Sent: vrijdag 30 juli 2010 18:07
> To: Floris Sluiter
> Cc: users at lists.opennebula.org
> Subject: Re: Monitoring/Deployement issues: Virsh failes often
>
> Hi Floris,
>
> We are not really proposing that you decrease the VM polling time,
> that's just a new default in the new version that can be changed at
> will.
>
> Furthermore, I think we are mixing two issues here:
>
> Monitoring Issue
> ------------------------
>
> OpenNebula performs a dominfo request to extract VM information. The
> way it is doing it now is blocking the libvirt socket, this causes
> issues with other operations that are being performed simultaneously.
> This can be solved by making this request readonly, as you proposed.
> Other possible solution, perfectly aligned and contemplated in the
> OpenNebula design, is to develop new probes for the Information
> Manager as other large deployments does that tackles SNMP, Ganglia,
> Nagios or similar tools inside the VMs and/or the physical hosts to
> avoid saturation of a particular hypervisor (as this case).
>
> Deployment Issue
> ---------------------------
>
> OpenNebula performs a domain create operation, which also blocks the
> socket. This basically causes the same behavior as the monitoring
> issue, but cannot be solved with the readonly flag, since this
> operation is not possible in readonly mode. What OpenNebula is
> providing here is means to circumvent limitations showed by libvirt by
> spacing the domain create operations.
>
>
> Summarizing, libvirt doesn't like more than one simultaneous operation
> with VMs since it blocks the socket. The monitoring issue can be
> solved with the readonly flag or by creating new monitoring probes,
> and therefore the VM polling frequency can be lifted at will if you
> feel that the default frequency doesn't cut your infrastructure needs.
> The deployment issue cannot be solved by any SNMP-like mechanism, and
> needs to be handled with care, we know that the current OpenNebula
> approach works for large deployments just by spacing the deployments.
>
> Best regards,
>
> -Tino
>
> --
> Constantino Vázquez Blanco | dsa-research.org/tinova
> Virtualization Technology Engineer / Researcher
> OpenNebula Toolkit | opennebula.org
>
>
>
> On Fri, Jul 30, 2010 at 12:49 PM, Floris Sluiter <Floris.Sluiter at sara.nl> wrote:
>> Hi Tino and List,
>>
>> For us these  methods are not very acceptable in a production environment. We feel that we do need to monitor the status of the Cloud, both the hosts and the VMs, at least once every minute for each component. Stopping or drastically reducing the monitoring of VMs is not the way to go for us. If the method of monitoring causes the Cloud to fail, then the method needs changing, not the frequency of it...
>>
>> I'll see what we can come up with to improve on this, we did already donate the SNMP driver for the hosts, maybe something similar can be done for the VMs (we are testing the read only method for virsh).
>>
>> Kind regards,
>>
>> Floris
>>
>>
>> -----Original Message-----
>> From: tinova79 at gmail.com [mailto:tinova79 at gmail.com] On Behalf Of Tino Vazquez
>> Sent: donderdag 29 juli 2010 19:29
>> To: Floris Sluiter
>> Cc: DuDu; users at lists.opennebula.org
>> Subject: Re: Monitoring/Deployement issues: Virsh failes often
>>
>> Dear Floris,
>>
>> We noticed this behavior with the scalability tests we put OpenNebula
>> through, for 2.0 there was a ticket opened regarding this [1]. It
>> happens with libvirt and also with the xen hypervisor. It is by no
>> means an OpenNebula scalability issue, since the intended libvirt
>> behavior is not the one exposed.
>>
>> Notwithstanding, we introduced in the 2.0 version means to avoid this
>> unpredictable behavior. We have limited the number of simultaneous
>> deployments to a same host to one, to avoid this blocked socket issue.
>> This can be configured through the scheduler configuration file. More
>> information on this can be found on [2]. Also, we have introduced a
>> limitation on the simultaneous VM polling requests, also with the same
>> purpose.
>>
>> With the changes detailed above, OpenNebula is able to deploy dozens
>> of thousands of virtual machines on a pool of hundreds of physical
>> servers at the same time.
>>
>> The readonly method to perform the polling is a very neat and
>> interesting proposal, we will evaluate its inclusion in the 2.0
>> version. Thanks a lot for this valuable feedback.
>>
>> Best regards,
>>
>> -Tino
>>
>> [1] http://dev.opennebula.org/issues/261
>> [2] http://www.opennebula.org/documentation:rel2.0:schg
>>
>> --
>> Constantino Vázquez Blanco | dsa-research.org/tinova
>> Virtualization Technology Engineer / Researcher
>> OpenNebula Toolkit | opennebula.org
>>
>>
>>
>> On Thu, Jul 29, 2010 at 6:28 PM, Floris Sluiter <Floris.Sluiter at sara.nl> wrote:
>>> Hi,
>>>
>>>
>>>
>>> We again had issues when having many VMs deployed on many hosts at the same
>>> time (log excerpts below) and deploying more.
>>>
>>> We saw over  25 runaway  VMs left behind running from the last two weeks,
>>> that one had marked as DONE, also deploy, copy and stop failed randomly
>>> quite often.
>>>
>>>
>>>
>>> It starts to be a major problem, when we can't run opennebula in a stable
>>> and predictable manner on larger Clouds.
>>>
>>> We have the following intervals configured, we do need to monitor more often
>>> then every 10 minutes we feel.
>>>
>>> HOST_MONITORING_INTERVAL = 20
>>>
>>> VM_POLLING_INTERVAL      = 30
>>>
>>> So we first used our snmp driver again, which solved a large part of the
>>> problems, but our cloud is still growing, so we reached the next limit.
>>>
>>>
>>>
>>> What seems to be happening is that the "virsh --connect qemu:///system
>>> dominfo" interferes with other virsh commands.Virsh locks libvirt-sock, so
>>> multiple processes can not connect at the same time.
>>>
>>> Solution we are now trying : do the monitoring of VMs in read only mode:
>>> "virsh -readonly --connect qemu:///system dominfo"
>>>
>>> Which we added/changed in the file: /usr/lib/one/mads/one_vmm_kvm.rb
>>>
>>> Now virsh doesn't lock the libvirt-sock as far as we can see
>>>
>>>
>>>
>>> Currently we do not see the error messages we had before, but some kind of
>>> robust, scalable and  fail safe monitoring solution for opennebula is
>>> needed.
>>>
>>>
>>>
>>> Hope this helps
>>>
>>> Kind regards,
>>>
>>>
>>>
>>> Floris
>>>
>>>
>>>
>>>
>>>
>>> Thu Jul 22 16:16:21 2010 [VMM][I]: Command execution fail: virsh --connect
>>> qemu:///system dominfo one-428
>>>
>>> Thu Jul 22 16:16:21 2010 [VMM][I]: STDERR follows.
>>>
>>> Thu Jul 22 16:16:21 2010 [VMM][I]: error: unable to connect to
>>> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
>>> denied
>>>
>>> Thu Jul 22 16:16:21 2010 [VMM][I]: error: failed to connect to the
>>> hypervisor
>>>
>>> Thu Jul 22 16:16:21 2010 [VMM][I]: ExitCode: 1
>>>
>>> Thu Jul 22 16:16:21 2010 [VMM][E]: Error monitoring VM, -
>>>
>>>
>>>
>>> And sometimes destroy would fail:
>>>
>>> Wed Jul 28 13:05:34 2010 [LCM][I]: New VM state is SAVE_STOP
>>>
>>> Wed Jul 28 13:05:34 2010 [VMM][I]: Command execution fail: 'touch
>>> /var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save
>>> one-585 /var/lib/one/585/images/checkpoint'
>>>
>>> Wed Jul 28 13:05:34 2010 [VMM][I]: STDERR follows.
>>>
>>> Wed Jul 28 13:05:34 2010 [VMM][I]: error: unable to connect to
>>> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
>>> denied
>>>
>>> Wed Jul 28 13:05:34 2010 [VMM][I]: error: failed to connect to the
>>> hypervisor
>>>
>>> Wed Jul 28 13:05:34 2010 [VMM][I]: ExitCode: 1
>>>
>>> Wed Jul 28 13:05:34 2010 [VMM][E]: Error saving VM state, -
>>>
>>> Wed Jul 28 13:05:35 2010 [LCM][I]: Fail to save VM state. Assuming that the
>>> VM is still RUNNING (will poll VM).
>>>
>>> Wed Jul 28 13:05:38 2010 [VMM][I]: Command execution fail: virsh --connect
>>> qemu:///system dominfo one-585
>>>
>>> Wed Jul 28 13:05:38 2010 [VMM][I]: STDERR follows.
>>>
>>> Wed Jul 28 13:05:38 2010 [VMM][I]: error: unable to connect to
>>> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
>>> denied
>>>
>>> Wed Jul 28 13:05:38 2010 [VMM][I]: error: failed to connect to the
>>> hypervisor
>>>
>>> Wed Jul 28 13:05:38 2010 [VMM][I]: ExitCode: 1
>>>
>>> Wed Jul 28 13:05:38 2010 [VMM][E]: Error monitoring VM, -
>>>
>>> .trying like 10 times .
>>>
>>> Wed Jul 28 13:09:14 2010 [VMM][E]: Error monitoring VM, -
>>>
>>> Wed Jul 28 13:09:56 2010 [LCM][I]: New VM state is SAVE_STOP
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: 'touch
>>> /var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save
>>> one-585 /var/lib/one/585/images/checkpoint'
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to
>>> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
>>> denied
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the
>>> hypervisor
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][E]: Error saving VM state, -
>>>
>>> Wed Jul 28 13:09:56 2010 [LCM][I]: Fail to save VM state. Assuming that the
>>> VM is still RUNNING (will poll VM).
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: virsh --connect
>>> qemu:///system dominfo one-585
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to
>>> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
>>> denied
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the
>>> hypervisor
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
>>>
>>> Wed Jul 28 13:09:56 2010 [VMM][E]: Error monitoring VM, -
>>>
>>> Wed Jul 28 13:10:24 2010 [VMM][I]: Command execution fail: virsh --connect
>>> qemu:///system dominfo one-585
>>>
>>> Wed Jul 28 13:10:24 2010 [VMM][I]: STDERR follows.
>>>
>>> Wed Jul 28 13:10:24 2010 [VMM][I]: error: unable to connect to
>>> '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission
>>> denied
>>>
>>> Wed Jul 28 13:10:24 2010 [VMM][I]: error: failed to connect to the
>>> hypervisor
>>>
>>> Wed Jul 28 13:10:24 2010 [VMM][I]: ExitCode: 1
>>>
>>> Wed Jul 28 13:10:24 2010 [VMM][E]: Error monitoring VM, -
>>>
>>> Wed Jul 28 13:10:45 2010 [DiM][I]: New VM state is DONE
>>>
>>> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Driver command for 585
>>> cancelled
>>>
>>> Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Deleting
>>> /var/lib/one/585/images
>>>
>>> Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Executed
>>> "ssh node13-one rm -rf /var/lib/one/585/images".
>>>
>>> Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: TRANSFER SUCCESS 585 -
>>>
>>> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Command execution
>>> fail: virsh --connect qemu:///system destroy one-585
>>>
>>> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 STDERR follows.
>>>
>>> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: unable to
>>> connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started:
>>> Permission denied
>>>
>>> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: failed to
>>> connect to the hypervisor
>>>
>>> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 ExitCode: 1
>>>
>>> Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: CANCEL FAILURE 585 -
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: users-bounces at lists.opennebula.org
>>> [mailto:users-bounces at lists.opennebula.org] On Behalf Of Floris Sluiter
>>> Sent: maandag 19 juli 2010 18:18
>>> To: 'Tino Vazquez'; DuDu
>>> Cc: users at lists.opennebula.org
>>> Subject: Re: [one-users] oned hang
>>>
>>>
>>>
>>> Hi Dudu, Tino and all,
>>>
>>>
>>>
>>> We have seen the exact same message (Command execution fail and bad
>>> interpreter: Text file busy)) on our cluster last week when we expanded it
>>> from 12 to 16 hosts (with add host)and deploying 10 Vmachines at the same
>>> time. We did not have multiple instances of opennebula running, we only
>>> added to a running one,  so it is unlikely that was the issue (the cluster
>>> was already running stable for a while). We investigated and thought it was
>>> a timing issue with the monitoring (ssh) driver set to 60 seconds and having
>>> many hosts and many VMs.
>>>
>>> We started using the ssh-monitoring driver again in after the latest update
>>> to opennebula, before that we used our in hous developed snmp monitoring
>>> driver.
>>>
>>> When we deployed our snmp driver, the error message stopped and for the last
>>> week we have a stable cloud again, now with 16 hosts.
>>>
>>> For people who think see the same timing issues as we did , the snmp_driver
>>> is available in the ecosystem (but make sure you know what snmp is before
>>> you try ;-)): http://opennebula.org/software:ecosystem:snmp_im_driver
>>>
>>> Regards,
>>>
>>>
>>>
>>> Floris
>>>
>>> HPC project leader
>>>
>>> Sara
>>>
>>>
>>>
>>>
>>>
>>> From: users-bounces at lists.opennebula.org
>>> [mailto:users-bounces at lists.opennebula.org] On Behalf Of Tino Vazquez
>>> Sent: maandag 19 juli 2010 16:15
>>> To: DuDu
>>> Cc: users at lists.opennebula.org
>>> Subject: Re: [one-users] oned hang
>>>
>>>
>>>
>>> Dear DuDu,
>>>
>>>
>>>
>>> This happens when two monitorization actions take place at the same time.
>>>
>>>
>>>
>>> First thing, which OpenNebula version are you using?
>>>
>>>
>>>
>>> Are you per chance running two OpenNebula instances? Did you change the host
>>> polling time?
>>>
>>>
>>>
>>> Regards,
>>>
>>>
>>>
>>> -Tino
>>>
>>> --
>>> Constantino Vázquez Blanco | dsa-research.org/tinova
>>> Virtualization Technology Engineer / Researcher
>>> OpenNebula Toolkit | opennebula.org
>>>
>>> On Wed, Jul 14, 2010 at 3:13 PM, DuDu <blackass at gmail.com> wrote:
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> We deployed a small cluster of opennebula, with 8 hosts. It is the default
>>> opennebula installation, however, we found that after several days of
>>> running, oned hung. All CLI commands hang too. No new logs generated in
>>> one_xmlrpc.log. And there are quite some error message like the following in
>>> oned.log:
>>>
>>>
>>>
>>> [root at vm-container-31-0 logdir]# tail oned.log
>>> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup
>>> failed: xauth key data not generated
>>> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake
>>> authentication data for X11 forwarding.
>>> Wed Jul 14 14:51:02 2010 [InM][I]: bash:
>>> /tmp/one-im//one_im-c4718299a313d89398ea693104dcce5f: /bin/sh: bad
>>> interpreter: Text file busy
>>> Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126
>>> Wed Jul 14 14:51:02 2010 [InM][I]: Command execution fail: 'mkdir -p
>>> /tmp/one-im/; cat > /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822; if
>>> [ "x$?" != "x0" ]; then exit -1; fi; chmod +x
>>> /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822;
>>> /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822'
>>> Wed Jul 14 14:51:02 2010 [InM][I]: STDERR follows.
>>> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup
>>> failed: xauth key data not generated
>>> Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake
>>> authentication data for X11 forwarding.
>>> Wed Jul 14 14:51:02 2010 [InM][I]: bash:
>>> /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822: /bin/sh: bad
>>> interpreter: Text file busy
>>> Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126
>>>
>>>
>>>
>>> We have to sigkill oned and restart it. And that solves all problems.
>>>
>>>
>>>
>>> Any idea of this?
>>>
>>>
>>>
>>> Thanks!
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at lists.opennebula.org
>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>>
>>>
>>
>