[one-users] Problem (live)migrating VMs...
Gonçalo Borges
goncalo at lip.pt
Wed Aug 12 16:42:35 PDT 2009
Hi All...
I've mounted a testbed with 2 cluster nodes and a frontend to test
opennebula 1.4. By the way, when I ask to download version 1.4, it pulls
one-1.3.80. Is this right?
Nevertheless, I think I've configured and installed everything as
described in the docs, but since I'm a newbie in opennebula, most likely
I'm doing something wrong. I have set up a self contained opennebula
installation and a storage area for the images, both shared via iSCSI
between frontend and cluster nodes. At this point everything seems ok
since my cluster nodes are properly monitored, and I can start xen
virtual machines.
---*---
-bash-3.2$ onehost list
HID NAME RVM TCPU FCPU ACPU TMEM FMEM
STAT
1 core19 0 800 799 799 2516480
2371686 on
2 core05 2 800 800 800 2516480
2161971 on
---*---
-bash-3.2$ onevm list
ID USER NAME STAT CPU MEM HOSTNAME TIME
7 oneadmin sge02.nc runn 0 1048328 core05 00 00:13:55
8 oneadmin sge03.nc runn 0 1048412 core05 00 00:12:29
---*---
However, the one relevant thing it seems I'm able to do is to start VMs.
I'm interested in the live migration feature, and this was the first
thing I started to test. The result was no migration at all, and the
following logs:
### $ONE_LOCATION/var/oned.log ###
Wed Aug 12 23:46:43 2009 [ReM][D]: VirtualMachineMigrate invoked
Wed Aug 12 23:46:43 2009 [DiM][D]: Live-migrating VM 7
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Command
execution fail: sudo /usr/sbin/xm migrate -l one-7 core19
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Error:
can't connect: Connection refused
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: MIGRATE FAILURE 7 -
Wed Aug 12 23:46:47 2009 [VMM][D]: Message received: POLL SUCCESS 7
USEDMEMORY=1048384 USEDCPU=0.0 NETTX=7 NETRX=165 STATE=a
### $ONE_LOCATION/var/7/vm.log ###
Wed Aug 12 23:46:43 2009 [LCM][I]: New VM state is MIGRATE
Wed Aug 12 23:46:44 2009 [VMM][I]: Command execution fail: sudo
/usr/sbin/xm migrate -l one-7 core19
Wed Aug 12 23:46:44 2009 [VMM][I]: STDERR follows.
Wed Aug 12 23:46:44 2009 [VMM][I]: Error: can't connect: Connection refused
Wed Aug 12 23:46:44 2009 [VMM][I]: ExitCode: 1
Wed Aug 12 23:46:44 2009 [VMM][E]: Error live-migrating VM, -
Wed Aug 12 23:46:44 2009 [LCM][I]: Fail to life migrate VM. Assuming
that the VM is still RUNNING (will poll VM).
There are no FWs around to block connections, so I do not understand
where the message "Error: can't connect: Connection refused" is coming
from.
Afterwards I decided to go to a simple migrate. Here, it complains it
can not restore the machines.
### $ONE_LOCATION/var/oned.log ###
Wed Aug 12 23:56:58 2009 [DiM][D]: Migrating VM 7
Wed Aug 12 23:57:19 2009 [VMM][I]: Monitoring VM 8.
Wed Aug 12 23:57:22 2009 [VMM][D]: Message received: POLL SUCCESS 8
USEDMEMORY=1048320 USEDCPU=0.0 NETTX=8 NETRX=160 STATE=a
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: SAVE SUCCESS 7 -
Wed Aug 12 23:57:29 2009 [TM][D]: Message received: LOG - 7 tm_mv.sh:
Will not move, source and destination are equal
Wed Aug 12 23:57:29 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Command
execution fail: sudo /usr/sbin/xm restore
/srv01/cloud/images/7/images/checkpoint
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Error:
Restore failed
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: RESTORE FAILURE 7 -
Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7
tm_delete.sh: Deleting /srv01/cloud/images/7/images
Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7
tm_delete.sh: Executed "rm -rf /srv01/cloud/images/7/images".
Wed Aug 12 23:57:30 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -
### $ONE_LOCATION/var/7/vm.log ###
Wed Aug 12 23:56:58 2009 [LCM][I]: New VM state is SAVE_MIGRATE
Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is PROLOG_MIGRATE
Wed Aug 12 23:57:29 2009 [TM][I]: tm_mv.sh: Will not move, source and
destination are equal
Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is BOOT
Wed Aug 12 23:57:29 2009 [VMM][I]: Command execution fail: sudo
/usr/sbin/xm restore /srv01/cloud/images/7/images/checkpoint
Wed Aug 12 23:57:29 2009 [VMM][I]: STDERR follows.
Wed Aug 12 23:57:29 2009 [VMM][I]: Error: Restore failed
Wed Aug 12 23:57:29 2009 [VMM][I]: ExitCode: 1
Wed Aug 12 23:57:29 2009 [VMM][E]: Error restoring VM, -
Wed Aug 12 23:57:29 2009 [DiM][I]: New VM state is FAILED
Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh:
Deleting /srv01/cloud/images/7/images
Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh:
Executed "rm -rf /srv01/cloud/images/7/images".
Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: TRANSFER SUCCESS 7 -
Even a stop and resume command fail with the following logs:
### $ONE_LOCATION/var/oned.log ###
Thu Aug 13 00:25:01 2009 [InM][I]: Monitoring host core19 (1)
Thu Aug 13 00:25:02 2009 [VMM][D]: Message received: SAVE SUCCESS 10 -
Thu Aug 13 00:25:03 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
Will not move, is not saving image
Thu Aug 13 00:25:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -
Thu Aug 13 00:25:05 2009 [InM][D]: Host 1 successfully monitored.
Thu Aug 13 00:25:12 2009 [ReM][D]: VirtualMachineDeploy invoked
Thu Aug 13 00:25:31 2009 [InM][I]: Monitoring host core05 (2)
Thu Aug 13 00:25:34 2009 [InM][D]: Host 2 successfully monitored.
Thu Aug 13 00:25:36 2009 [ReM][D]: VirtualMachineAction invoked
Thu Aug 13 00:25:36 2009 [DiM][D]: Restarting VM 10
Thu Aug 13 00:25:36 2009 [DiM][E]: Could not restart VM 10, wrong state.
Thu Aug 13 00:25:52 2009 [ReM][D]: VirtualMachineAction invoked
Thu Aug 13 00:25:52 2009 [DiM][D]: Resuming VM 10
Thu Aug 13 00:26:01 2009 [InM][I]: Monitoring host core19 (1)
Thu Aug 13 00:26:02 2009 [ReM][D]: VirtualMachineDeploy invoked
Thu Aug 13 00:26:02 2009 [DiM][D]: Deploying VM 10
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 Command
execution fail: /srv01/cloud/one/lib/tm_commands/nfs/tm_mv.sh
one01.ncg.ingrid.pt:/srv01/cloud/one/var/10/images
core19:/srv01/cloud/images/10/images
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 STDERR follows.
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR
MESSAGE --8<------
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 mv: cannot
stat `/srv01/cloud/one/var/10/images': No such file or directory
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR
MESSAGE ------>8--
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ExitCode: 255
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
Moving /srv01/cloud/one/var/10/images
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
ERROR: Command "mv /srv01/cloud/one/var/10/images
/srv01/cloud/images/10/images" failed.
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
ERROR: mv: cannot stat `/srv01/cloud/one/var/10/images': No such file or
directory
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: TRANSFER FAILURE 10
mv: cannot stat `/srv01/cloud/one/var/10/images': No such file or directory
Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10
tm_delete.sh: Deleting /srv01/cloud/images/10/images
Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10
tm_delete.sh: Executed "rm -rf /srv01/cloud/images/10/images".
Thu Aug 13 00:26:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -
So, any feedback on these issues is most welcome.
Another different issue I'll like to ask is if this opennebula version
supports recover of virtual machines. Some colleague of mine seen in
previous one versions that, if one cluster node goes down, the VMs
running there were marked has failed in the DB, and were never
restarted, even if that physical host recovers completely. What I would
like to see (and most site admins) is the start of those VMs. I do not
care about checkpointing. I just would like to see the VMs starting. If
the VMs start in some inconsistent way, that is a completely different
separated question. Nevertheless, 90% of the times, a simple file system
check is sufficient to recover any machine.
Thanks for any feedback. Probably, I can only react on Monday.
Cheers
Goncalo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20090813/3caad251/attachment.html>
More information about the Users
mailing list