[one-users] VMs freezing after livemigrating

Fri Apr 2 10:15:47 PDT 2010

Hello Stefan

We are also having the same issue. But when I use opennebula to suspend and
resume, I am able to access the console and logon to VM.

Here are our setup details

Host information using facter
---------------------------------------------------
kernel => Linux
kernelrelease => 2.6.31-16-server
lsbdistcodename => karmic
lsbdistdescription => Ubuntu 9.10
--------------------------------------------------
Libvirt version

  libvirt-bin
0.7.0-1ubuntu13.1
  libvirt0                          0.7.0-1ubuntu13.1

Opennebula 1.2

VM_DIR ="/nfs/path/to/storage"
Transfer Manager=NFS

This is actually a bug in Libivirt for which Redhat has released a fix long
back but Ubuntu has this Fix only in Lucid.  Lucid  is scheduled for release
on April 29, 2010.

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/448674

Since it takes a while to test Lucid completely before using it in
production we are going with the work around.

Here are the logs from opennebula.

user at managementnode:~$ onevm list
  ID     NAME STAT CPU     MEM        HOSTNAME        TIME
 686 migratev runn   0  262144    10.10.20.159 01 22:25:50
user at managementnode:~$ onevm suspend 686

Fri Apr  2 11:56:53 2010 [LCM][I]: New VM state is SAVE_SUSPEND
Fri Apr  2 11:58:35 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Apr  2 11:58:35 2010 [VMM][I]: ExitCode: 0
Fri Apr  2 11:58:35 2010 [DiM][I]: New VM state is SUSPENDED

user at host:/nfs/path/to/storage/686$ ls -al images/
total 84556
drwxrwxrwx  2 oneadmin nogroup        5 2010-04-02 16:53 .
drwxr-xr-x+ 3 oneadmin nogroup        3 2010-03-31 18:27 ..
-rw-------+ 1 root     root    92243033 2010-04-02 16:54 checkpoint
-rw-r--r--+ 1 oneadmin nogroup      549 2010-03-31 18:27 deployment.0
lrwxrwxrwx  1 oneadmin nogroup       34 2010-03-31 18:27 disk.0 ->
/nfs/path/to/storage/images/migratevm0

user at managementnode:~$ onevm list
 686 migratev susp   0  262144    10.10.20.159 01 22:29:53

unable to connect to host. connection refused 111

user at managementnode:~$ onevm resume 686

Fri Apr  2 12:02:00 2010 [DiM][I]: New VM state is ACTIVE.
Fri Apr  2 12:02:00 2010 [LCM][I]: Restoring VM
Fri Apr  2 12:02:00 2010 [LCM][I]: New state is BOOT
Fri Apr  2 12:02:01 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Apr  2 12:02:01 2010 [VMM][I]: ExitCode: 0
Fri Apr  2 12:02:01 2010 [LCM][I]: New VM state is RUNNING

Ranga

On Wed, Mar 24, 2010 at 7:55 AM, Harder, Stefan <
Stefan.Harder at fokus.fraunhofer.de> wrote:

> Hi Javier,
>
> thanks for your answer.
>
> The state in virsh on the node we livemigrate the VM to is "running". And
> on the old node the VM disappears. There are no logs which show some unusual
> behavior inside the VM.
>
> If we do suspend via OpenNebula the VM goes into susp state but the log
> shows an error:
>
> *****BEGIN*****
> Wed Mar 24 14:46:30 2010 [DiM][D]: Suspending VM 111
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Command
> execution fail: 'touch /srv/cloud/one/var/111/images/checkpoint;virsh
> --connect qemu:///system save one-111
> /srv/cloud/one/var/111/images/checkpoint'
>
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 STDERR
> follows.
>
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111
> /usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir
> /srv/cloud/one in PATH, mode 040777
>
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Connecting
> to uri: qemu:///system
>
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error:
> Failed to save domain one-111 to /srv/cloud/one/var/111/images/checkpoint
>
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error:
> operation failed: failed to create
> '/srv/cloud/one/var/111/images/checkpoint'
>
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1
>
> Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: SAVE FAILURE 111 -
> *****END*****
>
>
>
>
> If we then try to resume the VM the state changes to fail and the log
> shows:
>
>
>
>
>
> *****BEGIN*****
> Wed Mar 24 14:49:43 2010 [DiM][D]: Resuming VM 111
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Command
> execution fail: virsh --connect qemu:///system restore
> /srv/cloud/one/var/111/images/checkpoint
>
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 STDERR
> follows.
>
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111
> /usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir
> /srv/cloud/one in PATH, mode 040777
>
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Connecting
> to uri: qemu:///system
>
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error:
> Failed to restore domain from /srv/cloud/one/var/111/images/checkpoint
>
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error:
> operation failed: cannot read domain image
>
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1
>
> Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: RESTORE FAILURE 111 -
>
> Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh:
> Deleting /srv/cloud/one/var/111/images
>
> Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh:
> Executed "rm -rf /srv/cloud/one/var/111/images".
> *****END*****
>
>
>
> If we do it directly via virsh the VM resumes and it runs like before. This
> is not a VNC issue since if we ping the machine the whole time it answers
> not until suspending and resuming it via virsh on the physical node.
>
> We faced some other problems compiling a newer version of the libvirt from
> sources (since we thought the ubuntu packaged version may be too old). Which
> system configuration and package versions do you use? We thought about a
> clean new installation on Ubuntu 9.04 since we use 9.10 now.
>
> Best,
>
> Stefan
>
>
> > -----Ursprüngliche Nachricht-----
> > Von: Javier Fontan [mailto:jfontan at gmail.com]
> > Gesendet: Mittwoch, 24. März 2010 12:33
> > An: Harder, Stefan
> > Cc: users at lists.opennebula.org
> > Betreff: Re: [one-users] VMs freezing after livemigrating
> >
> > Hello,
> >
> > I never had that problem myself. Can you check that the state in vish
> > is running? I suppose you check that the VM is frozen connecting using
> > VNC. Can you also check in your unfrozen machine logs for any strange
> > message dealing with cpu or something that can be stopping it from
> > awaking again?
> >
> > Bye
> >
> >
> > On Thu, Mar 18, 2010 at 11:47 AM, Harder, Stefan
> > <Stefan.Harder at fokus.fraunhofer.de> wrote:
> > > Hi,
> > >
> > > after solving some issues livemigrating works in my testenvironment
> > (3
> > > servers, one of them is the cloud controller and node at the same
> > time
> > > and the other two are only nodes). The problem I have now is that the
> > > VMs freeze after livemigrating. The only way to get them back alive
> > is
> > > to do a "virsh suspend <name>" and "virsh resume <name>" on the
> > physical
> > > node where the VM was migrated to. Is this issue or even a solution
> > > known to you?
> > >
> > > Best regards,
> > >
> > > Stefan
> > > _______________________________________________
> > > Users mailing list
> > > Users at lists.opennebula.org
> > > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
> > >
> >
> >
> >
> > --
> > Javier Fontan, Grid & Virtualization Technology Engineer/Researcher
> > DSA Research Group: http://dsa-research.org
> > Globus GridWay Metascheduler: http://www.GridWay.org
> > OpenNebula Virtual Infrastructure Engine: http://www.OpenNebula.org
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20100402/ef1ae09c/attachment-0002.htm>