[one-users] Recovering from hung KVM virtual machine

Steven Timm timm at fnal.gov
Tue Jun 5 12:10:56 PDT 2012


My production cloud is still running OpenNebula 2.0 with Sci. Linux
5 VM hosts using KVM hypervisor.

We have seen two fairly frequent failure modes:

1) The virtual machine gets hung in such a way that it is pingable
and you get a login prompt on the VNC console, but once you try
to log in, nothing happens, you don't even get a password: prompt usually.
The only thing we have found is to log into the VM host, do virsh destroy,
and then onevm restart will bring the VM back.  VM's that have hung
once this way tend to hang again and again.

2) A onevm stop or onevm suspend command makes the VM attempt to
generate a checkpoint file, and the KVM process hangs in the middle
of this.  this leaves libvirt on the VM Host in a state where it is
unresponsive.. any virsh list fails. You have to kill the parent
KVM process and sometimes restart libvirt, and then a onevm restart
will bring the VM back.

How this relates to OpenNebula--it does not seem that there
is any way using onevm commands to recover from either of these states.
In the case of both of these, onevm stop and onevm shutdown will
fail.  It would be nice to let the user
have a way to recover his/her own VM with out operator intervention
as is currently required.

Steve


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.



More information about the Users mailing list