[one-users] VMs freezing after livemigrating
Harder, Stefan
stefan.harder at fokus.fraunhofer.de
Tue Apr 6 03:52:03 PDT 2010
Hi Ranga,
we set up our environment again with Ubuntu Server 9.04 64bit since weve
read that these freezing issues only come up with Ubuntu 9.10
(https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/448674). Then we
installed OpenNebula from source (one-1.4.0.tar.gz) and before that all
needed packages and now it works. We have two machines one is the cloud
controller and node at the same time, the other one is only a node.
Here some details:
Kernel: 2.6.28-11-server
libvirt-bin: 0.6.1-0ubuntu5.1
libvirt-ruby1.8: 0.0.7-1
Maybe this will work for you too.
Regards,
Stefan
Von: Rangababu Chakravarthula [mailto:rbabu at hexagrid.com]
Gesendet: Freitag, 2. April 2010 19:16
An: Harder, Stefan
Cc: Javier Fontan; users at lists.opennebula.org
Betreff: Re: [one-users] VMs freezing after livemigrating
Hello Stefan
We are also having the same issue. But when I use opennebula to suspend and
resume, I am able to access the console and logon to VM.
Here are our setup details
Host information using facter
---------------------------------------------------
kernel => Linux
kernelrelease => 2.6.31-16-server
lsbdistcodename => karmic
lsbdistdescription => Ubuntu 9.10
--------------------------------------------------
Libvirt version
libvirt-bin 0.7.0-1ubuntu13.1
libvirt0 0.7.0-1ubuntu13.1
Opennebula 1.2
VM_DIR ="/nfs/path/to/storage"
Transfer Manager=NFS
This is actually a bug in Libivirt for which Redhat has released a fix long
back but Ubuntu has this Fix only in Lucid. Lucid is scheduled for release
on April 29, 2010.
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/448674
Since it takes a while to test Lucid completely before using it in
production we are going with the work around.
Here are the logs from opennebula.
user at managementnode:~$ onevm list
ID NAME STAT CPU MEM HOSTNAME TIME
686 migratev runn 0 262144 10.10.20.159 01 22:25:50
user at managementnode:~$ onevm suspend 686
Fri Apr 2 11:56:53 2010 [LCM][I]: New VM state is SAVE_SUSPEND
Fri Apr 2 11:58:35 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Apr 2 11:58:35 2010 [VMM][I]: ExitCode: 0
Fri Apr 2 11:58:35 2010 [DiM][I]: New VM state is SUSPENDED
user at host:/nfs/path/to/storage/686$ ls -al images/
total 84556
drwxrwxrwx 2 oneadmin nogroup 5 2010-04-02 16:53 .
drwxr-xr-x+ 3 oneadmin nogroup 3 2010-03-31 18:27 ..
-rw-------+ 1 root root 92243033 2010-04-02 16:54 checkpoint
-rw-r--r--+ 1 oneadmin nogroup 549 2010-03-31 18:27 deployment.0
lrwxrwxrwx 1 oneadmin nogroup 34 2010-03-31 18:27 disk.0 ->
/nfs/path/to/storage/images/migratevm0
user at managementnode:~$ onevm list
686 migratev susp 0 262144 10.10.20.159 01 22:29:53
unable to connect to host. connection refused 111
user at managementnode:~$ onevm resume 686
Fri Apr 2 12:02:00 2010 [DiM][I]: New VM state is ACTIVE.
Fri Apr 2 12:02:00 2010 [LCM][I]: Restoring VM
Fri Apr 2 12:02:00 2010 [LCM][I]: New state is BOOT
Fri Apr 2 12:02:01 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Apr 2 12:02:01 2010 [VMM][I]: ExitCode: 0
Fri Apr 2 12:02:01 2010 [LCM][I]: New VM state is RUNNING
Ranga
On Wed, Mar 24, 2010 at 7:55 AM, Harder, Stefan
<Stefan.Harder at fokus.fraunhofer.de> wrote:
Hi Javier,
thanks for your answer.
The state in virsh on the node we livemigrate the VM to is "running". And on
the old node the VM disappears. There are no logs which show some unusual
behavior inside the VM.
If we do suspend via OpenNebula the VM goes into susp state but the log
shows an error:
*****BEGIN*****
Wed Mar 24 14:46:30 2010 [DiM][D]: Suspending VM 111
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Command
execution fail: 'touch /srv/cloud/one/var/111/images/checkpoint;virsh
--connect qemu:///system save one-111
/srv/cloud/one/var/111/images/checkpoint'
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 STDERR
follows.
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111
/usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir
/srv/cloud/one in PATH, mode 040777
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 Connecting to
uri: qemu:///system
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error: Failed
to save domain one-111 to /srv/cloud/one/var/111/images/checkpoint
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 error:
operation failed: failed to create
'/srv/cloud/one/var/111/images/checkpoint'
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1
Wed Mar 24 14:46:30 2010 [VMM][D]: Message received: SAVE FAILURE 111 -
*****END*****
If we then try to resume the VM the state changes to fail and the log shows:
*****BEGIN*****
Wed Mar 24 14:49:43 2010 [DiM][D]: Resuming VM 111
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Command
execution fail: virsh --connect qemu:///system restore
/srv/cloud/one/var/111/images/checkpoint
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 STDERR
follows.
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111
/usr/lib/ruby/1.8/open3.rb:67: warning: Insecure world writable dir
/srv/cloud/one in PATH, mode 040777
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 Connecting to
uri: qemu:///system
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error: Failed
to restore domain from /srv/cloud/one/var/111/images/checkpoint
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 error:
operation failed: cannot read domain image
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: LOG - 111 ExitCode: 1
Wed Mar 24 14:49:44 2010 [VMM][D]: Message received: RESTORE FAILURE 111 -
Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh:
Deleting /srv/cloud/one/var/111/images
Wed Mar 24 14:49:44 2010 [TM][D]: Message received: LOG - 111 tm_delete.sh:
Executed "rm -rf /srv/cloud/one/var/111/images".
*****END*****
If we do it directly via virsh the VM resumes and it runs like before. This
is not a VNC issue since if we ping the machine the whole time it answers
not until suspending and resuming it via virsh on the physical node.
We faced some other problems compiling a newer version of the libvirt from
sources (since we thought the ubuntu packaged version may be too old). Which
system configuration and package versions do you use? We thought about a
clean new installation on Ubuntu 9.04 since we use 9.10 now.
Best,
Stefan
> -----Ursprüngliche Nachricht-----
> Von: Javier Fontan [mailto:jfontan at gmail.com]
> Gesendet: Mittwoch, 24. März 2010 12:33
> An: Harder, Stefan
> Cc: users at lists.opennebula.org
> Betreff: Re: [one-users] VMs freezing after livemigrating
>
> Hello,
>
> I never had that problem myself. Can you check that the state in vish
> is running? I suppose you check that the VM is frozen connecting using
> VNC. Can you also check in your unfrozen machine logs for any strange
> message dealing with cpu or something that can be stopping it from
> awaking again?
>
> Bye
>
>
> On Thu, Mar 18, 2010 at 11:47 AM, Harder, Stefan
> <Stefan.Harder at fokus.fraunhofer.de> wrote:
> > Hi,
> >
> > after solving some issues livemigrating works in my testenvironment
> (3
> > servers, one of them is the cloud controller and node at the same
> time
> > and the other two are only nodes). The problem I have now is that the
> > VMs freeze after livemigrating. The only way to get them back alive
> is
> > to do a "virsh suspend <name>" and "virsh resume <name>" on the
> physical
> > node where the VM was migrated to. Is this issue or even a solution
> > known to you?
> >
> > Best regards,
> >
> > Stefan
> > _______________________________________________
> > Users mailing list
> > Users at lists.opennebula.org
> > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
> >
>
>
>
> --
> Javier Fontan, Grid & Virtualization Technology Engineer/Researcher
> DSA Research Group: http://dsa-research.org
> Globus GridWay Metascheduler: http://www.GridWay.org
> OpenNebula Virtual Infrastructure Engine: http://www.OpenNebula.org
_______________________________________________
Users mailing list
Users at lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20100406/aa3b435e/attachment-0003.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 7099 bytes
Desc: not available
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20100406/aa3b435e/attachment-0003.bin>
More information about the Users
mailing list