[one-users] Recovering from hung KVM virtual machine

Steven Timm timm at fnal.gov
Thu Jun 7 07:15:46 PDT 2012


On Thu, 7 Jun 2012, Tino Vazquez wrote:

> Hi Steven,
>
> comments inline,
>
> On Tue, Jun 5, 2012 at 9:10 PM, Steven Timm <timm at fnal.gov> wrote:
>>
>> My production cloud is still running OpenNebula 2.0 with Sci. Linux
>> 5 VM hosts using KVM hypervisor.
>>
>> We have seen two fairly frequent failure modes:
>>
>> 1) The virtual machine gets hung in such a way that it is pingable
>> and you get a login prompt on the VNC console, but once you try
>> to log in, nothing happens, you don't even get a password: prompt usually.
>> The only thing we have found is to log into the VM host, do virsh destroy,
>> and then onevm restart will bring the VM back.  VM's that have hung
>> once this way tend to hang again and again.
>
> This issue is being address in the current release, a "onevm reset"
> operation would be available in ONE v3.6. Please see
> http://dev.opennebula.org/issues/1055

Yes, this would be very helpful.
>
>>
>> 2) A onevm stop or onevm suspend command makes the VM attempt to
>> generate a checkpoint file, and the KVM process hangs in the middle
>> of this.  this leaves libvirt on the VM Host in a state where it is
>> unresponsive.. any virsh list fails. You have to kill the parent
>> KVM process and sometimes restart libvirt, and then a onevm restart
>> will bring the VM back.
>
> Point 1) was about the guest OS freezing, but this is more about
> processes running in the physical server. IMHO this has more to do
> with sysadmin work than end user's concern, ie, users shouldn't have
> to concern themselves with libvirt (it can potentially even disrupt
> other users VMs).
>
> How frequent is this? I would be interested in replicating it, to find
> a workaround or filing a bug in libvirt.

I see this happen 2-3 times a month on a cloud that has been
averaging about 130 VM's.
We are in the process of migrating our cloud to SLF6 hosts which run a 
much higher version of libvirt and KVM, that might fix it on its own.
but I will probably file a bug against libvirt (and KVM for the other
problem above, too).
\
A related question--for those running KVM hypervisors in their clouds,
are you using swap space or not?  One possible explanation for the
hung VM's in #1 above is that part of their memory gets put out to swap
and the system can't get it back in time.





>
> Kind regards,
>
> -Tino
>
>>
>> How this relates to OpenNebula--it does not seem that there
>> is any way using onevm commands to recover from either of these states.
>> In the case of both of these, onevm stop and onevm shutdown will
>> fail.  It would be nice to let the user
>> have a way to recover his/her own VM with out operator intervention
>> as is currently required.
>>
>> Steve
>>
>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D  (630) 840-8525
>> timm at fnal.gov  http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Group Leader.
>> Lead of FermiCloud project.
>> _______________________________________________
>> Users mailing list
>> Users at lists.opennebula.org
>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>
> --
> Constantino Vázquez Blanco, MSc
> Project Engineer
> OpenNebula - The Open-Source Solution for Data Center Virtualization
> www.OpenNebula.org | @tinova79 | @OpenNebula
>

------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.


More information about the Users mailing list