[one-users] Problem (live)migrating VMs...

Gonçalo Borges goncalo at lip.pt
Tue Aug 18 08:58:50 PDT 2009


Hi Ruben...

> * Live Migration
>
> The "Error: can't connect: Connection refused" probably means can not
> connect to xend in core19. Check that you have configured xen to
> perform live migrations (ports and the like in xend.conf...)
>
>    

Your suggestion was precious. I had to modify  /etc/xen/xend-config.sxp 
of all clusters nodes, and restart xend. After that, it worked. The 
following variables were modified as:

(xend-http-server yes)
(xend-relocation-server yes)
(xend-address '')
(xend-relocation-hosts-allow '')

This will enable comunication via ports 8000 and 8002:

[root at core05 ~]# netstat -nlp | grep 800
tcp        0      0 0.0.0.0:8000                
0.0.0.0:*                   LISTEN      21656/python
tcp        0      0 0.0.0.0:8002                
0.0.0.0:*                   LISTEN      21656/python

Regarding the other issues, your suggestions weren't successful and I'm 
investigating why... I'll get back when I have further details.

Cheers
Goncalo

> * Migration / Save - Restore.
>
> It seems there is a problem with Xen restoring the images. Could you
> check the following:
>
> 1.- Start a VM through OpenNebula
> 2.- Go to the node where the VM is running and execute:
> sudo /usr/sbin/xm save one-<VM_ID>  /srv01/cloud/images/<VM_ID>/images/checkpoint
> 3.- Check that the checkpoint file is created and its ownership
> 4.- Restore the VM
> sudo /usr/sbin/xm restore /srv01/cloud/images/<VM_ID>/images/checkpoint
>
> Check that you can do the save/restore with the oneadmin account, if
> you have problems sometimes the xen log files can be useful...
>
> * Stop / Resume
>
> I am not totally sure but from your log files it seems that VM_DIR is
> set to  /srv01/cloud/images/. Try to use the default location
> ($ONE_LOCATION/var == /srv01/cloud/one/var) just comment out the
> VM_DIR variable in oned.conf
>
> Let see if it works with that setting...
>
> Cheers!
>
> PS: you are working with the right version, 1.3.80 == 1.4 Beta1
>
> 2009/8/13 Gonçalo Borges<goncalo at lip.pt>:
>    
>> Hi All...
>>
>> I've mounted a testbed with 2 cluster nodes and a frontend to test
>> opennebula 1.4. By the way, when I ask to download version 1.4, it pulls
>> one-1.3.80. Is this right?
>>
>> Nevertheless, I think I've configured and installed everything as described
>> in the docs, but since I'm a newbie in opennebula, most likely I'm doing
>> something wrong. I have set up a self contained opennebula installation and
>> a storage area for the images, both shared via iSCSI between frontend and
>> cluster nodes. At this point everything seems ok since my cluster nodes are
>> properly monitored, and I can start xen virtual machines.
>>
>> ---*---
>>
>> -bash-3.2$ onehost list
>>   HID NAME                      RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
>>     1 core19                      0    800    799    799 2516480 2371686   on
>>     2 core05                      2    800    800    800 2516480 2161971   on
>>
>>
>> ---*---
>>
>> -bash-3.2$ onevm list
>>    ID     USER     NAME STAT CPU     MEM        HOSTNAME        TIME
>>     7 oneadmin sge02.nc runn   0 1048328          core05 00 00:13:55
>>     8 oneadmin sge03.nc runn   0 1048412          core05 00 00:12:29
>>
>> ---*---
>>
>> However, the one relevant thing it seems I'm able to do is to start VMs. I'm
>> interested in the live migration feature, and this was the first thing I
>> started to test. The result was no migration at all, and the following logs:
>>
>> ### $ONE_LOCATION/var/oned.log ###
>>
>> Wed Aug 12 23:46:43 2009 [ReM][D]: VirtualMachineMigrate invoked
>> Wed Aug 12 23:46:43 2009 [DiM][D]: Live-migrating VM 7
>> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Command
>> execution fail: sudo /usr/sbin/xm migrate -l one-7 core19
>> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
>> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 Error: can't
>> connect: Connection refused
>> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
>> Wed Aug 12 23:46:44 2009 [VMM][D]: Message received: MIGRATE FAILURE 7 -
>> Wed Aug 12 23:46:47 2009 [VMM][D]: Message received: POLL SUCCESS 7
>> USEDMEMORY=1048384 USEDCPU=0.0 NETTX=7 NETRX=165  STATE=a
>>
>> ### $ONE_LOCATION/var/7/vm.log ###
>>
>> Wed Aug 12 23:46:43 2009 [LCM][I]: New VM state is MIGRATE
>> Wed Aug 12 23:46:44 2009 [VMM][I]: Command execution fail: sudo /usr/sbin/xm
>> migrate -l one-7 core19
>> Wed Aug 12 23:46:44 2009 [VMM][I]: STDERR follows.
>> Wed Aug 12 23:46:44 2009 [VMM][I]: Error: can't connect: Connection refused
>> Wed Aug 12 23:46:44 2009 [VMM][I]: ExitCode: 1
>> Wed Aug 12 23:46:44 2009 [VMM][E]: Error live-migrating VM, -
>> Wed Aug 12 23:46:44 2009 [LCM][I]: Fail to life migrate VM. Assuming that
>> the VM is still RUNNING (will poll VM).
>>
>> There are no FWs around to block connections, so I do not understand where
>> the message "Error: can't connect: Connection refused" is coming from.
>>
>> Afterwards I decided to go to a simple migrate. Here, it complains it can
>> not restore the machines.
>>
>> ### $ONE_LOCATION/var/oned.log ###
>>
>> Wed Aug 12 23:56:58 2009 [DiM][D]: Migrating VM 7
>> Wed Aug 12 23:57:19 2009 [VMM][I]: Monitoring VM 8.
>> Wed Aug 12 23:57:22 2009 [VMM][D]: Message received: POLL SUCCESS 8
>> USEDMEMORY=1048320 USEDCPU=0.0 NETTX=8 NETRX=160  STATE=a
>> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: SAVE SUCCESS 7 -
>> Wed Aug 12 23:57:29 2009 [TM][D]: Message received: LOG - 7 tm_mv.sh: Will
>> not move, source and destination are equal
>> Wed Aug 12 23:57:29 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -
>> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Command
>> execution fail: sudo /usr/sbin/xm restore
>> /srv01/cloud/images/7/images/checkpoint
>> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 STDERR follows.
>> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 Error: Restore
>> failed
>> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: LOG - 7 ExitCode: 1
>> Wed Aug 12 23:57:29 2009 [VMM][D]: Message received: RESTORE FAILURE 7 -
>> Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7 tm_delete.sh:
>> Deleting /srv01/cloud/images/7/images
>> Wed Aug 12 23:57:30 2009 [TM][D]: Message received: LOG - 7 tm_delete.sh:
>> Executed "rm -rf /srv01/cloud/images/7/images".
>> Wed Aug 12 23:57:30 2009 [TM][D]: Message received: TRANSFER SUCCESS 7 -
>>
>> ### $ONE_LOCATION/var/7/vm.log ###
>>
>> Wed Aug 12 23:56:58 2009 [LCM][I]: New VM state is SAVE_MIGRATE
>> Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is PROLOG_MIGRATE
>> Wed Aug 12 23:57:29 2009 [TM][I]: tm_mv.sh: Will not move, source and
>> destination are equal
>> Wed Aug 12 23:57:29 2009 [LCM][I]: New VM state is BOOT
>> Wed Aug 12 23:57:29 2009 [VMM][I]: Command execution fail: sudo /usr/sbin/xm
>> restore /srv01/cloud/images/7/images/checkpoint
>> Wed Aug 12 23:57:29 2009 [VMM][I]: STDERR follows.
>> Wed Aug 12 23:57:29 2009 [VMM][I]: Error: Restore failed
>> Wed Aug 12 23:57:29 2009 [VMM][I]: ExitCode: 1
>> Wed Aug 12 23:57:29 2009 [VMM][E]: Error restoring VM, -
>> Wed Aug 12 23:57:29 2009 [DiM][I]: New VM state is FAILED
>> Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh: Deleting
>> /srv01/cloud/images/7/images
>> Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: LOG - 7 tm_delete.sh: Executed
>> "rm -rf /srv01/cloud/images/7/images".
>> Wed Aug 12 23:57:30 2009 [TM][W]: Ignored: TRANSFER SUCCESS 7 -
>>
>> Even a stop and resume command fail with the following logs:
>>
>> ### $ONE_LOCATION/var/oned.log ###
>>
>> Thu Aug 13 00:25:01 2009 [InM][I]: Monitoring host core19 (1)
>> Thu Aug 13 00:25:02 2009 [VMM][D]: Message received: SAVE SUCCESS 10 -
>> Thu Aug 13 00:25:03 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh: Will
>> not move, is not saving image
>> Thu Aug 13 00:25:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -
>> Thu Aug 13 00:25:05 2009 [InM][D]: Host 1 successfully monitored.
>> Thu Aug 13 00:25:12 2009 [ReM][D]: VirtualMachineDeploy invoked
>> Thu Aug 13 00:25:31 2009 [InM][I]: Monitoring host core05 (2)
>> Thu Aug 13 00:25:34 2009 [InM][D]: Host 2 successfully monitored.
>> Thu Aug 13 00:25:36 2009 [ReM][D]: VirtualMachineAction invoked
>> Thu Aug 13 00:25:36 2009 [DiM][D]: Restarting VM 10
>> Thu Aug 13 00:25:36 2009 [DiM][E]: Could not restart VM 10, wrong state.
>> Thu Aug 13 00:25:52 2009 [ReM][D]: VirtualMachineAction invoked
>> Thu Aug 13 00:25:52 2009 [DiM][D]: Resuming VM 10
>> Thu Aug 13 00:26:01 2009 [InM][I]: Monitoring host core19 (1)
>> Thu Aug 13 00:26:02 2009 [ReM][D]: VirtualMachineDeploy invoked
>> Thu Aug 13 00:26:02 2009 [DiM][D]: Deploying VM 10
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 Command
>> execution fail: /srv01/cloud/one/lib/tm_commands/nfs/tm_mv.sh
>> one01.ncg.ingrid.pt:/srv01/cloud/one/var/10/images
>> core19:/srv01/cloud/images/10/images
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 STDERR follows.
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR MESSAGE
>> --8<------
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 mv: cannot stat
>> `/srv01/cloud/one/var/10/images': No such file or directory
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ERROR MESSAGE
>> ------>8--
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 ExitCode: 255
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
>> Moving /srv01/cloud/one/var/10/images
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
>> ERROR: Command "mv /srv01/cloud/one/var/10/images
>> /srv01/cloud/images/10/images" failed.
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: LOG - 10 tm_mv.sh:
>> ERROR: mv: cannot stat `/srv01/cloud/one/var/10/images': No such file or
>> directory
>> Thu Aug 13 00:26:02 2009 [TM][D]: Message received: TRANSFER FAILURE 10 mv:
>> cannot stat `/srv01/cloud/one/var/10/images': No such file or directory
>> Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10 tm_delete.sh:
>> Deleting /srv01/cloud/images/10/images
>> Thu Aug 13 00:26:03 2009 [TM][D]: Message received: LOG - 10 tm_delete.sh:
>> Executed "rm -rf /srv01/cloud/images/10/images".
>> Thu Aug 13 00:26:03 2009 [TM][D]: Message received: TRANSFER SUCCESS 10 -
>>
>> So, any feedback on these issues is most welcome.
>>
>> Another different issue I'll like to ask is if this opennebula version
>> supports recover of virtual machines. Some colleague of mine seen in
>> previous one versions that, if one cluster node goes down, the VMs running
>> there were marked has failed in the DB, and were never restarted, even if
>> that physical host recovers completely. What I would like to see (and most
>> site admins) is the start of those VMs. I do not care about checkpointing. I
>> just would like to see the VMs starting. If the VMs start in some
>> inconsistent way, that is a completely different separated question.
>> Nevertheless, 90% of the times, a simple file system check is sufficient to
>> recover any machine.
>>
>> Thanks for any feedback. Probably, I can only react on Monday.
>>
>> Cheers
>> Goncalo
>> _______________________________________________
>> Users mailing list
>> Users at lists.opennebula.org
>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>
>>
>>      
>
>    



More information about the Users mailing list