[one-users] VMs freezing after livemigrating

Harder, Stefan Stefan.Harder at fokus.fraunhofer.de
Wed Mar 24 06:55:33 PDT 2010


Hi Javier,

thanks for your answer.

The state in virsh on the node we livemigrate the VM to is "running". And on the old node the VM disappears. There are no logs which show some unusual behavior inside the VM.

If we do suspend via OpenNebula the VM goes into susp state but the log shows an error:

*****BEGIN*****
Wed Mar 24 14:46:30 2010 [DiM][D]: Suspending VM 111
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Command execution fail: 'touch /srv/cloud/one/var/111/images/checkpoint;virsh --connect qemu:///system save one-111 /srv/cloud/one/var/111/images/checkpoint'

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 STDERR follows.

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 /usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir /srv/cloud/one in PATH, mode 040777

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Connecting to uri: qemu:///system

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error: Failed to save domain one-111 to /srv/cloud/one/var/111/images/checkpoint

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error: operation failed: failed to create '/srv/cloud/one/var/111/images/checkpoint'

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: SAVE FAILURE 111 -
*****END*****




If we then try to resume the VM the state changes to fail and the log shows:





*****BEGIN*****
Wed Mar 24 14:49:43 2010 [DiM][D]: Resuming VM 111
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Command execution fail: virsh --connect qemu:///system restore /srv/cloud/one/var/111/images/checkpoint

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 STDERR follows.

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 /usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir /srv/cloud/one in PATH, mode 040777

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Connecting to uri: qemu:///system

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error: Failed to restore domain from /srv/cloud/one/var/111/images/checkpoint

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error: operation failed: cannot read domain image

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: RESTORE FAILURE 111 -

Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh: Deleting /srv/cloud/one/var/111/images

Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh: Executed "rm -rf /srv/cloud/one/var/111/images".
*****END*****



If we do it directly via virsh the VM resumes and it runs like before. This is not a VNC issue since if we ping the machine the whole time it answers not until suspending and resuming it via virsh on the physical node.

We faced some other problems compiling a newer version of the libvirt from sources (since we thought the ubuntu packaged version may be too old). Which system configuration and package versions do you use? We thought about a clean new installation on Ubuntu 9.04 since we use 9.10 now.

Best,

Stefan


> -----Ursprüngliche Nachricht-----
> Von: Javier Fontan [mailto:jfontan at gmail.com]
> Gesendet: Mittwoch, 24. März 2010 12:33
> An: Harder, Stefan
> Cc: users at lists.opennebula.org
> Betreff: Re: [one-users] VMs freezing after livemigrating
> 
> Hello,
> 
> I never had that problem myself. Can you check that the state in vish
> is running? I suppose you check that the VM is frozen connecting using
> VNC. Can you also check in your unfrozen machine logs for any strange
> message dealing with cpu or something that can be stopping it from
> awaking again?
> 
> Bye
> 
> 
> On Thu, Mar 18, 2010 at 11:47 AM, Harder, Stefan
> <Stefan.Harder at fokus.fraunhofer.de> wrote:
> > Hi,
> >
> > after solving some issues livemigrating works in my testenvironment
> (3
> > servers, one of them is the cloud controller and node at the same
> time
> > and the other two are only nodes). The problem I have now is that the
> > VMs freeze after livemigrating. The only way to get them back alive
> is
> > to do a "virsh suspend <name>" and "virsh resume <name>" on the
> physical
> > node where the VM was migrated to. Is this issue or even a solution
> > known to you?
> >
> > Best regards,
> >
> > Stefan
> > _______________________________________________
> > Users mailing list
> > Users at lists.opennebula.org
> > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
> >
> 
> 
> 
> --
> Javier Fontan, Grid & Virtualization Technology Engineer/Researcher
> DSA Research Group: http://dsa-research.org
> Globus GridWay Metascheduler: http://www.GridWay.org
> OpenNebula Virtual Infrastructure Engine: http://www.OpenNebula.org


More information about the Users mailing list