[one-users] Recovering from hung KVM virtual machine

Tino Vazquez tinova at opennebula.org
Thu Jun 7 05:30:33 PDT 2012


Hi Steven,

comments inline,

On Tue, Jun 5, 2012 at 9:10 PM, Steven Timm <timm at fnal.gov> wrote:
>
> My production cloud is still running OpenNebula 2.0 with Sci. Linux
> 5 VM hosts using KVM hypervisor.
>
> We have seen two fairly frequent failure modes:
>
> 1) The virtual machine gets hung in such a way that it is pingable
> and you get a login prompt on the VNC console, but once you try
> to log in, nothing happens, you don't even get a password: prompt usually.
> The only thing we have found is to log into the VM host, do virsh destroy,
> and then onevm restart will bring the VM back.  VM's that have hung
> once this way tend to hang again and again.

This issue is being address in the current release, a "onevm reset"
operation would be available in ONE v3.6. Please see
http://dev.opennebula.org/issues/1055

>
> 2) A onevm stop or onevm suspend command makes the VM attempt to
> generate a checkpoint file, and the KVM process hangs in the middle
> of this.  this leaves libvirt on the VM Host in a state where it is
> unresponsive.. any virsh list fails. You have to kill the parent
> KVM process and sometimes restart libvirt, and then a onevm restart
> will bring the VM back.

Point 1) was about the guest OS freezing, but this is more about
processes running in the physical server. IMHO this has more to do
with sysadmin work than end user's concern, ie, users shouldn't have
to concern themselves with libvirt (it can potentially even disrupt
other users VMs).

How frequent is this? I would be interested in replicating it, to find
a workaround or filing a bug in libvirt.

Kind regards,

-Tino

>
> How this relates to OpenNebula--it does not seem that there
> is any way using onevm commands to recover from either of these states.
> In the case of both of these, onevm stop and onevm shutdown will
> fail.  It would be nice to let the user
> have a way to recover his/her own VM with out operator intervention
> as is currently required.
>
> Steve
>
>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D  (630) 840-8525
> timm at fnal.gov  http://home.fnal.gov/~timm/
> Fermilab Computing Division, Scientific Computing Facilities,
> Grid Facilities Department, FermiGrid Services Group, Group Leader.
> Lead of FermiCloud project.
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

--
Constantino Vázquez Blanco, MSc
Project Engineer
OpenNebula - The Open-Source Solution for Data Center Virtualization
www.OpenNebula.org | @tinova79 | @OpenNebula



More information about the Users mailing list