[one-users] Problem (live)migrating VMs...

Wed Aug 12 16:42:35 PDT 2009

Hi All...

I've mounted a testbed with 2 cluster nodes and a frontend to test 
opennebula 1.4. By the way, when I ask to download version 1.4, it pulls 
one-1.3.80. Is this right?

Nevertheless, I think I've configured and installed everything as 
described in the docs, but since I'm a newbie in opennebula, most likely 
I'm doing something wrong. I have set up a self contained opennebula 
installation and a storage area for the images, both shared via iSCSI 
between frontend and cluster nodes. At this point everything seems ok 
since my cluster nodes are properly monitored, and I can start xen 
virtual machines.

---*---

-bash-3.2$ onehost list
  HID NAME                      RVM   TCPU   FCPU   ACPU    TMEM    FMEM 
STAT
    1 core19                      0    800    799    799 2516480 
2371686   on
    2 core05                      2    800    800    800 2516480 
2161971   on

---*---

-bash-3.2$ onevm list
   ID     USER     NAME STAT CPU     MEM        HOSTNAME        TIME
    7 oneadmin sge02.nc runn   0 1048328          core05 00 00:13:55
    8 oneadmin sge03.nc runn   0 1048412          core05 00 00:12:29

---*---

However, the one relevant thing it seems I'm able to do is to start VMs. 
I'm interested in the live migration feature, and this was the first 
thing I started to test. The result was no migration at all, and the 
following logs:

### $ONE_LOCATION/var/oned.log ###

Wed Aug 12 23:46:43 2009 [ReM][D]: VirtualMachineMigrate invoked
Wed Aug 12 23:46:43 2009 [DiM][D]: Live-migrating VM 7
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Command 
execution fail: sudo /usr/sbin/xm migrate -l one-7 core19
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Error: 
can't connect: Connection refused
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: MIGRATE FAILURE 7 -
Wed Aug 12 23:46:47 2009 [VMM][D]: Message received: POLL SUCCESS 7 
USEDMEMORY=1048384 USEDCPU=0.0 NETTX=7 NETRX=165  STATE=a

### $ONE_LOCATION/var/7/vm.log ###

Wed Aug 12 23:46:43 2009 [LCM][I]: New VM state is MIGRATE
Wed Aug 12 23:46:44 2009 [VMM][I]: Command execution fail: sudo 
/usr/sbin/xm migrate -l one-7 core19
Wed Aug 12 23:46:44 2009 [VMM][I]: STDERR follows.
Wed Aug 12 23:46:44 2009 [VMM][I]: Error: can't connect: Connection refused
Wed Aug 12 23:46:44 2009 [VMM][I]: ExitCode: 1
Wed Aug 12 23:46:44 2009 [VMM][E]: Error live-migrating VM, -
Wed Aug 12 23:46:44 2009 [LCM][I]: Fail to life migrate VM. Assuming 
that the VM is still RUNNING (will poll VM).

There are no FWs around to block connections, so I do not understand 
where the message "Error: can't connect: Connection refused" is coming 
from.

Afterwards I decided to go to a simple migrate. Here, it complains it 
can not restore the machines.

### $ONE_LOCATION/var/oned.log ###

Wed Aug 12 23:56:58 2009 [DiM][D]: Migrating VM 7
Wed Aug 12 23:57:19 2009 [VMM][I]: Monitoring VM 8.
Wed Aug 12 23:57:22 2009 [VMM][D]: Message received: POLL SUCCESS 8 
USEDMEMORY=1048320 USEDCPU=0.0 NETTX=8 NETRX=160  STATE=a
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: SAVE SUCCESS 7 -
Wed Aug 12 23:57:29 2009 [TM][D]: Message received: LOG - 7 tm_mv.sh: 
Will not move, source and destination are equal
Wed Aug 12 23:57:29 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Command 
execution fail: sudo /usr/sbin/xm restore 
/srv01/cloud/images/7/images/checkpoint
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Error: 
Restore failed
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: RESTORE FAILURE 7 -
Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7 
tm_delete.sh: Deleting /srv01/cloud/images/7/images
Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7 
tm_delete.sh: Executed "rm -rf /srv01/cloud/images/7/images".
Wed Aug 12 23:57:30 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -

### $ONE_LOCATION/var/7/vm.log ###

Wed Aug 12 23:56:58 2009 [LCM][I]: New VM state is SAVE_MIGRATE
Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is PROLOG_MIGRATE
Wed Aug 12 23:57:29 2009 [TM][I]: tm_mv.sh: Will not move, source and 
destination are equal
Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is BOOT
Wed Aug 12 23:57:29 2009 [VMM][I]: Command execution fail: sudo 
/usr/sbin/xm restore /srv01/cloud/images/7/images/checkpoint
Wed Aug 12 23:57:29 2009 [VMM][I]: STDERR follows.
Wed Aug 12 23:57:29 2009 [VMM][I]: Error: Restore failed
Wed Aug 12 23:57:29 2009 [VMM][I]: ExitCode: 1
Wed Aug 12 23:57:29 2009 [VMM][E]: Error restoring VM, -
Wed Aug 12 23:57:29 2009 [DiM][I]: New VM state is FAILED
Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh: 
Deleting /srv01/cloud/images/7/images
Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh: 
Executed "rm -rf /srv01/cloud/images/7/images".
Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: TRANSFER SUCCESS 7 -

Even a stop and resume command fail with the following logs:

### $ONE_LOCATION/var/oned.log ###

Thu Aug 13 00:25:01 2009 [InM][I]: Monitoring host core19 (1)
Thu Aug 13 00:25:02 2009 [VMM][D]: Message received: SAVE SUCCESS 10 -
Thu Aug 13 00:25:03 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh: 
Will not move, is not saving image
Thu Aug 13 00:25:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -
Thu Aug 13 00:25:05 2009 [InM][D]: Host 1 successfully monitored.
Thu Aug 13 00:25:12 2009 [ReM][D]: VirtualMachineDeploy invoked
Thu Aug 13 00:25:31 2009 [InM][I]: Monitoring host core05 (2)
Thu Aug 13 00:25:34 2009 [InM][D]: Host 2 successfully monitored.
Thu Aug 13 00:25:36 2009 [ReM][D]: VirtualMachineAction invoked
Thu Aug 13 00:25:36 2009 [DiM][D]: Restarting VM 10
Thu Aug 13 00:25:36 2009 [DiM][E]: Could not restart VM 10, wrong state.
Thu Aug 13 00:25:52 2009 [ReM][D]: VirtualMachineAction invoked
Thu Aug 13 00:25:52 2009 [DiM][D]: Resuming VM 10
Thu Aug 13 00:26:01 2009 [InM][I]: Monitoring host core19 (1)
Thu Aug 13 00:26:02 2009 [ReM][D]: VirtualMachineDeploy invoked
Thu Aug 13 00:26:02 2009 [DiM][D]: Deploying VM 10
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 Command 
execution fail: /srv01/cloud/one/lib/tm_commands/nfs/tm_mv.sh 
one01.ncg.ingrid.pt:/srv01/cloud/one/var/10/images 
core19:/srv01/cloud/images/10/images
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 STDERR follows.
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR 
MESSAGE --8<------
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 mv: cannot 
stat `/srv01/cloud/one/var/10/images': No such file or directory
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR 
MESSAGE ------>8--
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ExitCode: 255
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh: 
Moving /srv01/cloud/one/var/10/images
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh: 
ERROR: Command "mv /srv01/cloud/one/var/10/images 
/srv01/cloud/images/10/images" failed.
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh: 
ERROR: mv: cannot stat `/srv01/cloud/one/var/10/images': No such file or 
directory
Thu Aug 13 00:26:02 2009 [TM][D]: Message received: TRANSFER FAILURE 10 
mv: cannot stat `/srv01/cloud/one/var/10/images': No such file or directory
Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10 
tm_delete.sh: Deleting /srv01/cloud/images/10/images
Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10 
tm_delete.sh: Executed "rm -rf /srv01/cloud/images/10/images".
Thu Aug 13 00:26:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -

So, any feedback on these issues is most welcome.

Another different issue I'll like to ask is if this opennebula version 
supports recover of virtual machines. Some colleague of mine seen in 
previous one versions that, if one cluster node goes down, the VMs 
running there were marked has failed in the DB, and were never 
restarted, even if that physical host recovers completely. What I would 
like to see (and most site admins) is the start of those VMs. I do not 
care about checkpointing. I just would like to see the VMs starting. If 
the VMs start in some inconsistent way, that is a completely different 
separated question. Nevertheless, 90% of the times, a simple file system 
check is sufficient to recover any machine.

Thanks for any feedback. Probably, I can only react on Monday.

Cheers
Goncalo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20090813/3caad251/attachment.html>