[one-users] VMs freezing after livemigrating

Harder, Stefan stefan.harder at fokus.fraunhofer.de
Tue Apr 6 03:52:03 PDT 2010


Hi Ranga,

 

we  set up our environment again with Ubuntu Server 9.04 64bit since we’ve
read that these freezing issues only come up with Ubuntu 9.10
(https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/448674). Then we
installed OpenNebula from source (one-1.4.0.tar.gz) and before that all
needed packages and now it works. We have two machines one is the cloud
controller and node at the same time, the other one is only a node.

 

Here some details:

 

Kernel: 2.6.28-11-server 

libvirt-bin: 0.6.1-0ubuntu5.1

libvirt-ruby1.8: 0.0.7-1

 

Maybe this will work for you too. 

 

Regards,

 

Stefan

 

Von: Rangababu Chakravarthula [mailto:rbabu at hexagrid.com] 
Gesendet: Freitag, 2. April 2010 19:16
An: Harder, Stefan
Cc: Javier Fontan; users at lists.opennebula.org
Betreff: Re: [one-users] VMs freezing after livemigrating

 

Hello Stefan

We are also having the same issue. But when I use opennebula to suspend and
resume, I am able to access the console and logon to VM.

Here are our setup details

Host information using facter
---------------------------------------------------
kernel => Linux
kernelrelease => 2.6.31-16-server
lsbdistcodename => karmic
lsbdistdescription => Ubuntu 9.10
--------------------------------------------------
Libvirt version

  libvirt-bin                       0.7.0-1ubuntu13.1

  libvirt0                          0.7.0-1ubuntu13.1 

Opennebula 1.2

VM_DIR ="/nfs/path/to/storage"
Transfer Manager=NFS

This is actually a bug in Libivirt for which Redhat has released a fix long
back but Ubuntu has this Fix only in Lucid.  Lucid  is scheduled for release
on April 29, 2010.

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/448674

Since it takes a while to test Lucid completely before using it in
production we are going with the work around.

Here are the logs from opennebula. 

user at managementnode:~$ onevm list
  ID     NAME STAT CPU     MEM        HOSTNAME        TIME
 686 migratev runn   0  262144    10.10.20.159 01 22:25:50
user at managementnode:~$ onevm suspend 686

Fri Apr  2 11:56:53 2010 [LCM][I]: New VM state is SAVE_SUSPEND
Fri Apr  2 11:58:35 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Apr  2 11:58:35 2010 [VMM][I]: ExitCode: 0
Fri Apr  2 11:58:35 2010 [DiM][I]: New VM state is SUSPENDED


user at host:/nfs/path/to/storage/686$ ls -al images/
total 84556
drwxrwxrwx  2 oneadmin nogroup        5 2010-04-02 16:53 .
drwxr-xr-x+ 3 oneadmin nogroup        3 2010-03-31 18:27 ..
-rw-------+ 1 root     root    92243033 2010-04-02 16:54 checkpoint
-rw-r--r--+ 1 oneadmin nogroup      549 2010-03-31 18:27 deployment.0
lrwxrwxrwx  1 oneadmin nogroup       34 2010-03-31 18:27 disk.0 ->
/nfs/path/to/storage/images/migratevm0


user at managementnode:~$ onevm list
 686 migratev susp   0  262144    10.10.20.159 01 22:29:53

unable to connect to host. connection refused 111

user at managementnode:~$ onevm resume 686

Fri Apr  2 12:02:00 2010 [DiM][I]: New VM state is ACTIVE.
Fri Apr  2 12:02:00 2010 [LCM][I]: Restoring VM
Fri Apr  2 12:02:00 2010 [LCM][I]: New state is BOOT
Fri Apr  2 12:02:01 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Apr  2 12:02:01 2010 [VMM][I]: ExitCode: 0
Fri Apr  2 12:02:01 2010 [LCM][I]: New VM state is RUNNING


Ranga

On Wed, Mar 24, 2010 at 7:55 AM, Harder, Stefan
<Stefan.Harder at fokus.fraunhofer.de> wrote:

Hi Javier,

thanks for your answer.

The state in virsh on the node we livemigrate the VM to is "running". And on
the old node the VM disappears. There are no logs which show some unusual
behavior inside the VM.

If we do suspend via OpenNebula the VM goes into susp state but the log
shows an error:

*****BEGIN*****
Wed Mar 24 14:46:30 2010 [DiM][D]: Suspending VM 111
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Command
execution fail: 'touch /srv/cloud/one/var/111/images/checkpoint;virsh
--connect qemu:///system save one-111
/srv/cloud/one/var/111/images/checkpoint'

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 STDERR
follows.

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111
/usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir
/srv/cloud/one in PATH, mode 040777

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Connecting to
uri: qemu:///system

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error: Failed
to save domain one-111 to /srv/cloud/one/var/111/images/checkpoint

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error:
operation failed: failed to create
'/srv/cloud/one/var/111/images/checkpoint'

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1

Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: SAVE FAILURE 111 -
*****END*****




If we then try to resume the VM the state changes to fail and the log shows:





*****BEGIN*****
Wed Mar 24 14:49:43 2010 [DiM][D]: Resuming VM 111
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Command
execution fail: virsh --connect qemu:///system restore
/srv/cloud/one/var/111/images/checkpoint

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 STDERR
follows.

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111
/usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir
/srv/cloud/one in PATH, mode 040777

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Connecting to
uri: qemu:///system

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error: Failed
to restore domain from /srv/cloud/one/var/111/images/checkpoint

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error:
operation failed: cannot read domain image

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1

Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: RESTORE FAILURE 111 -

Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh:
Deleting /srv/cloud/one/var/111/images

Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh:
Executed "rm -rf /srv/cloud/one/var/111/images".
*****END*****



If we do it directly via virsh the VM resumes and it runs like before. This
is not a VNC issue since if we ping the machine the whole time it answers
not until suspending and resuming it via virsh on the physical node.

We faced some other problems compiling a newer version of the libvirt from
sources (since we thought the ubuntu packaged version may be too old). Which
system configuration and package versions do you use? We thought about a
clean new installation on Ubuntu 9.04 since we use 9.10 now.

Best,

Stefan


> -----Ursprüngliche Nachricht-----
> Von: Javier Fontan [mailto:jfontan at gmail.com]
> Gesendet: Mittwoch, 24. März 2010 12:33
> An: Harder, Stefan
> Cc: users at lists.opennebula.org
> Betreff: Re: [one-users] VMs freezing after livemigrating

>
> Hello,
>
> I never had that problem myself. Can you check that the state in vish
> is running? I suppose you check that the VM is frozen connecting using
> VNC. Can you also check in your unfrozen machine logs for any strange
> message dealing with cpu or something that can be stopping it from
> awaking again?
>
> Bye
>
>
> On Thu, Mar 18, 2010 at 11:47 AM, Harder, Stefan
> <Stefan.Harder at fokus.fraunhofer.de> wrote:
> > Hi,
> >
> > after solving some issues livemigrating works in my testenvironment
> (3
> > servers, one of them is the cloud controller and node at the same
> time
> > and the other two are only nodes). The problem I have now is that the
> > VMs freeze after livemigrating. The only way to get them back alive
> is
> > to do a "virsh suspend <name>" and "virsh resume <name>" on the
> physical
> > node where the VM was migrated to. Is this issue or even a solution
> > known to you?
> >
> > Best regards,
> >
> > Stefan
> > _______________________________________________
> > Users mailing list
> > Users at lists.opennebula.org
> > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
> >
>
>
>
> --
> Javier Fontan, Grid & Virtualization Technology Engineer/Researcher
> DSA Research Group: http://dsa-research.org
> Globus GridWay Metascheduler: http://www.GridWay.org
> OpenNebula Virtual Infrastructure Engine: http://www.OpenNebula.org
_______________________________________________
Users mailing list
Users at lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20100406/aa3b435e/attachment-0003.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 7099 bytes
Desc: not available
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20100406/aa3b435e/attachment-0003.bin>


More information about the Users mailing list