[one-users] Problem (live)migrating VMs...

Thu Aug 13 02:19:47 PDT 2009

Hi

* Live Migration

The "Error: can't connect: Connection refused" probably means can not
connect to xend in core19. Check that you have configured xen to
perform live migrations (ports and the like in xend.conf...)

* Migration / Save - Restore.

It seems there is a problem with Xen restoring the images. Could you
check the following:

1.- Start a VM through OpenNebula
2.- Go to the node where the VM is running and execute:
sudo /usr/sbin/xm save one-<VM_ID> /srv01/cloud/images/<VM_ID>/images/checkpoint
3.- Check that the checkpoint file is created and its ownership
4.- Restore the VM
sudo /usr/sbin/xm restore /srv01/cloud/images/<VM_ID>/images/checkpoint

Check that you can do the save/restore with the oneadmin account, if
you have problems sometimes the xen log files can be useful...

* Stop / Resume

I am not totally sure but from your log files it seems that VM_DIR is
set to  /srv01/cloud/images/. Try to use the default location
($ONE_LOCATION/var == /srv01/cloud/one/var) just comment out the
VM_DIR variable in oned.conf

Let see if it works with that setting...

Cheers!

PS: you are working with the right version, 1.3.80 == 1.4 Beta1

2009/8/13 Gonçalo Borges <goncalo at lip.pt>:
> Hi All...
>
> I've mounted a testbed with 2 cluster nodes and a frontend to test
> opennebula 1.4. By the way, when I ask to download version 1.4, it pulls
> one-1.3.80. Is this right?
>
> Nevertheless, I think I've configured and installed everything as described
> in the docs, but since I'm a newbie in opennebula, most likely I'm doing
> something wrong. I have set up a self contained opennebula installation and
> a storage area for the images, both shared via iSCSI between frontend and
> cluster nodes. At this point everything seems ok since my cluster nodes are
> properly monitored, and I can start xen virtual machines.
>
> ---*---
>
> -bash-3.2$ onehost list
>  HID NAME                      RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
>    1 core19                      0    800    799    799 2516480 2371686   on
>    2 core05                      2    800    800    800 2516480 2161971   on
>
>
> ---*---
>
> -bash-3.2$ onevm list
>   ID     USER     NAME STAT CPU     MEM        HOSTNAME        TIME
>    7 oneadmin sge02.nc runn   0 1048328          core05 00 00:13:55
>    8 oneadmin sge03.nc runn   0 1048412          core05 00 00:12:29
>
> ---*---
>
> However, the one relevant thing it seems I'm able to do is to start VMs. I'm
> interested in the live migration feature, and this was the first thing I
> started to test. The result was no migration at all, and the following logs:
>
> ### $ONE_LOCATION/var/oned.log ###
>
> Wed Aug 12 23:46:43 2009 [ReM][D]: VirtualMachineMigrate invoked
> Wed Aug 12 23:46:43 2009 [DiM][D]: Live-migrating VM 7
> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Command
> execution fail: sudo /usr/sbin/xm migrate -l one-7 core19
> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Error: can't
> connect: Connection refused
> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: MIGRATE FAILURE 7 -
> Wed Aug 12 23:46:47 2009 [VMM][D]: Message received: POLL SUCCESS 7
> USEDMEMORY=1048384 USEDCPU=0.0 NETTX=7 NETRX=165  STATE=a
>
> ### $ONE_LOCATION/var/7/vm.log ###
>
> Wed Aug 12 23:46:43 2009 [LCM][I]: New VM state is MIGRATE
> Wed Aug 12 23:46:44 2009 [VMM][I]: Command execution fail: sudo /usr/sbin/xm
> migrate -l one-7 core19
> Wed Aug 12 23:46:44 2009 [VMM][I]: STDERR follows.
> Wed Aug 12 23:46:44 2009 [VMM][I]: Error: can't connect: Connection refused
> Wed Aug 12 23:46:44 2009 [VMM][I]: ExitCode: 1
> Wed Aug 12 23:46:44 2009 [VMM][E]: Error live-migrating VM, -
> Wed Aug 12 23:46:44 2009 [LCM][I]: Fail to life migrate VM. Assuming that
> the VM is still RUNNING (will poll VM).
>
> There are no FWs around to block connections, so I do not understand where
> the message "Error: can't connect: Connection refused" is coming from.
>
> Afterwards I decided to go to a simple migrate. Here, it complains it can
> not restore the machines.
>
> ### $ONE_LOCATION/var/oned.log ###
>
> Wed Aug 12 23:56:58 2009 [DiM][D]: Migrating VM 7
> Wed Aug 12 23:57:19 2009 [VMM][I]: Monitoring VM 8.
> Wed Aug 12 23:57:22 2009 [VMM][D]: Message received: POLL SUCCESS 8
> USEDMEMORY=1048320 USEDCPU=0.0 NETTX=8 NETRX=160  STATE=a
> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: SAVE SUCCESS 7 -
> Wed Aug 12 23:57:29 2009 [TM][D]: Message received: LOG - 7 tm_mv.sh: Will
> not move, source and destination are equal
> Wed Aug 12 23:57:29 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -
> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Command
> execution fail: sudo /usr/sbin/xm restore
> /srv01/cloud/images/7/images/checkpoint
> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Error: Restore
> failed
> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: RESTORE FAILURE 7 -
> Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7 tm_delete.sh:
> Deleting /srv01/cloud/images/7/images
> Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7 tm_delete.sh:
> Executed "rm -rf /srv01/cloud/images/7/images".
> Wed Aug 12 23:57:30 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -
>
> ### $ONE_LOCATION/var/7/vm.log ###
>
> Wed Aug 12 23:56:58 2009 [LCM][I]: New VM state is SAVE_MIGRATE
> Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is PROLOG_MIGRATE
> Wed Aug 12 23:57:29 2009 [TM][I]: tm_mv.sh: Will not move, source and
> destination are equal
> Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is BOOT
> Wed Aug 12 23:57:29 2009 [VMM][I]: Command execution fail: sudo /usr/sbin/xm
> restore /srv01/cloud/images/7/images/checkpoint
> Wed Aug 12 23:57:29 2009 [VMM][I]: STDERR follows.
> Wed Aug 12 23:57:29 2009 [VMM][I]: Error: Restore failed
> Wed Aug 12 23:57:29 2009 [VMM][I]: ExitCode: 1
> Wed Aug 12 23:57:29 2009 [VMM][E]: Error restoring VM, -
> Wed Aug 12 23:57:29 2009 [DiM][I]: New VM state is FAILED
> Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh: Deleting
> /srv01/cloud/images/7/images
> Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh: Executed
> "rm -rf /srv01/cloud/images/7/images".
> Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: TRANSFER SUCCESS 7 -
>
> Even a stop and resume command fail with the following logs:
>
> ### $ONE_LOCATION/var/oned.log ###
>
> Thu Aug 13 00:25:01 2009 [InM][I]: Monitoring host core19 (1)
> Thu Aug 13 00:25:02 2009 [VMM][D]: Message received: SAVE SUCCESS 10 -
> Thu Aug 13 00:25:03 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh: Will
> not move, is not saving image
> Thu Aug 13 00:25:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -
> Thu Aug 13 00:25:05 2009 [InM][D]: Host 1 successfully monitored.
> Thu Aug 13 00:25:12 2009 [ReM][D]: VirtualMachineDeploy invoked
> Thu Aug 13 00:25:31 2009 [InM][I]: Monitoring host core05 (2)
> Thu Aug 13 00:25:34 2009 [InM][D]: Host 2 successfully monitored.
> Thu Aug 13 00:25:36 2009 [ReM][D]: VirtualMachineAction invoked
> Thu Aug 13 00:25:36 2009 [DiM][D]: Restarting VM 10
> Thu Aug 13 00:25:36 2009 [DiM][E]: Could not restart VM 10, wrong state.
> Thu Aug 13 00:25:52 2009 [ReM][D]: VirtualMachineAction invoked
> Thu Aug 13 00:25:52 2009 [DiM][D]: Resuming VM 10
> Thu Aug 13 00:26:01 2009 [InM][I]: Monitoring host core19 (1)
> Thu Aug 13 00:26:02 2009 [ReM][D]: VirtualMachineDeploy invoked
> Thu Aug 13 00:26:02 2009 [DiM][D]: Deploying VM 10
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 Command
> execution fail: /srv01/cloud/one/lib/tm_commands/nfs/tm_mv.sh
> one01.ncg.ingrid.pt:/srv01/cloud/one/var/10/images
> core19:/srv01/cloud/images/10/images
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 STDERR follows.
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR MESSAGE
> --8<------
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 mv: cannot stat
> `/srv01/cloud/one/var/10/images': No such file or directory
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR MESSAGE
> ------>8--
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ExitCode: 255
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
> Moving /srv01/cloud/one/var/10/images
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
> ERROR: Command "mv /srv01/cloud/one/var/10/images
> /srv01/cloud/images/10/images" failed.
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
> ERROR: mv: cannot stat `/srv01/cloud/one/var/10/images': No such file or
> directory
> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: TRANSFER FAILURE 10 mv:
> cannot stat `/srv01/cloud/one/var/10/images': No such file or directory
> Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10 tm_delete.sh:
> Deleting /srv01/cloud/images/10/images
> Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10 tm_delete.sh:
> Executed "rm -rf /srv01/cloud/images/10/images".
> Thu Aug 13 00:26:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -
>
> So, any feedback on these issues is most welcome.
>
> Another different issue I'll like to ask is if this opennebula version
> supports recover of virtual machines. Some colleague of mine seen in
> previous one versions that, if one cluster node goes down, the VMs running
> there were marked has failed in the DB, and were never restarted, even if
> that physical host recovers completely. What I would like to see (and most
> site admins) is the start of those VMs. I do not care about checkpointing. I
> just would like to see the VMs starting. If the VMs start in some
> inconsistent way, that is a completely different separated question.
> Nevertheless, 90% of the times, a simple file system check is sufficient to
> recover any machine.
>
> Thanks for any feedback. Probably, I can only react on Monday.
>
> Cheers
> Goncalo
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>
>

-- 
+---------------------------------------------------------------+
 Dr. Ruben Santiago Montero
 Associate Professor
 Distributed System Architecture Group (http://dsa-research.org)

 URL:    http://dsa-research.org/doku.php?id=people:ruben
 Weblog: http://blog.dsa-research.org/?author=7

 GridWay, http://www.gridway.org
 OpenNebula, http://www.opennebula.org
+---------------------------------------------------------------+