[one-users] fault tolerance

Tino Vazquez tinova at opennebula.org
Fri Sep 14 03:18:38 PDT 2012


Hi,

Regarding the problem with "onehost sync", does oneadmin have writting
permissions over /var/lib/one/remotes?

Regards,

-Tino

--
Constantino Vázquez Blanco, MSc
Project Engineer
OpenNebula - The Open-Source Solution for Data Center Virtualization
www.OpenNebula.org | @tinova79 | @OpenNebula


On Fri, Sep 14, 2012 at 12:14 PM, Tino Vazquez <tinova at opennebula.org> wrote:
> Dear Gareth,
>
> It looks like there was a problem with the VM after it was put in
> PENDING. Could you please send us the log of the individual VM
> (/var/lib/one/<vid>/vm.log) to take a look?
>
> It should be possible to resubmit a VM in failed state, what is the
> error message you get?
>
> Regards,
>
> -Tino
>
> --
> Constantino Vázquez Blanco, MSc
> Project Engineer
> OpenNebula - The Open-Source Solution for Data Center Virtualization
> www.OpenNebula.org | @tinova79 | @OpenNebula
>
>
> On Wed, Sep 12, 2012 at 7:02 PM, Gareth de Vaux <opennebula at lordcow.org> wrote:
>> Hi all, how is host fault tolerance supposed to work? If I use the default hook:
>>
>> HOST_HOOK = [
>>     name      = "error",
>>     on        = "ERROR",
>>     command   = "ft/host_error.rb",
>>     arguments = "$HID -r n",
>>     remote    = "no" ]
>>
>> and block network access to a host with a VM running, the host goes into
>> the error state but the VM just goes to pending and then fails:
>>
>> $ onevm list
>>     ID USER     GROUP    NAME         STAT CPU     MEM        HOSTNAME        TIME
>>     10 oneadmin oneadmin BSD          fail   6    512M                    0d 00:46
>>
>>
>> There seems to be some failed attempts at moving the VM, logs included at the
>> end of the mail.
>>
>> How would you recover from this situation manually? You can't resubmit the
>> VM from a failed state, it seems you have to delete and recreate from scratch,
>> even though the image itself is happy:
>>
>> $ oneimage list
>>   ID USER     GROUP    NAME         DATASTORE     SIZE TYPE PER STAT  RVMS
>>    4 oneadmin oneadmin BSD          default         8G   OS Yes used     1
>>
>>
>> Also, how does the host itself recover if the failure involved a reboot?
>> The probes in /var/tmp/one have all been deleted:
>>
>> Wed Sep 12 18:46:49 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 23 cumulus; else
>>
>> and 'onehost sync' has permission problems as oneadmin:
>>
>> $ onehost sync
>> /usr/lib/ruby/1.9.1/fileutils.rb:1137:in `utime': Permission denied - /var/lib/one/remotes (Errno::EACCES)
>>         from /usr/lib/ruby/1.9.1/fileutils.rb:1137:in `block in touch'
>>         from /usr/lib/ruby/1.9.1/fileutils.rb:1134:in `each'
>>         from /usr/lib/ruby/1.9.1/fileutils.rb:1134:in `touch'
>>         from /usr/bin/onehost:162:in `block (2 levels) in <main>'
>>         from /usr/lib/one/ruby/cli/command_parser.rb:173:in `call'
>>         from /usr/lib/one/ruby/cli/command_parser.rb:173:in `run'
>>         from /usr/lib/one/ruby/cli/command_parser.rb:79:in `initialize'
>>         from /usr/bin/onehost:34:in `new'
>>         from /usr/bin/onehost:34:in `<main>'
>>
>> though 'onehost sync' addresses /var/lib and not /var/tmp?
>>
>>
>> I'm running opennebula 3.4.1-3.1 on debian wheezy. The oned.log (cirrus is
>> the opennebula controller and arcus is the host being simulated to fail
>> in the first instance):
>>
>> Wed Sep 12 17:56:08 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host cirrus (21)
>> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host stratus (22)
>> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host cumulus (23)
>> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host nimbus (24)
>> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host arcus (25)
>> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 Command execution fail: 'if [ -x "/var/tmp/one/vmm/kvm/poll" ]; then /var/tmp/one/vmm/kvm/poll one-10 arcus 10 arcus; else                              exit 42; fi'
>>
>> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 ssh: connect to host arcus port 22: Connection timed out
>>
>> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 ExitCode: 255
>>
>> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: POLL FAILURE 10 -
>>
>> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:56:28 2012 [InM][D]: Host 21 successfully monitored.
>> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:56:28 2012 [InM][D]: Host 22 successfully monitored.
>> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:56:28 2012 [InM][D]: Host 23 successfully monitored.
>> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:56:28 2012 [InM][D]: Host 24 successfully monitored.
>> Wed Sep 12 17:56:35 2012 [VMM][I]: Monitoring VM 10.
>> Wed Sep 12 17:56:38 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:56:38 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:56:38 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:57:07 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:57:07 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:57:07 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:57:20 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:57:23 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:57:28 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 25 arcus; else                              exit 42; fi'
>> Wed Sep 12 17:57:28 2012 [InM][I]: ssh: connect to host arcus port 22: Connection timed out
>> Wed Sep 12 17:57:28 2012 [InM][I]: ExitCode: 255
>> Wed Sep 12 17:57:28 2012 [InM][E]: Error monitoring host 25 : MONITOR FAILURE 25 -
>>
>> Wed Sep 12 17:57:28 2012 [ReM][D]: HostInfo method invoked
>> Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachineInfo method invoked
>> Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachineAction method invoked
>> Wed Sep 12 17:57:28 2012 [HKM][D]: Message received: LOG I 25 ExitCode: 0
>>
>> Wed Sep 12 17:57:28 2012 [HKM][D]: Message received: EXECUTE SUCCESS 25 error:
>>
>> Wed Sep 12 17:57:29 2012 [TM][D]: Message received: LOG I 10 ExitCode: 0
>>
>> Wed Sep 12 17:57:29 2012 [VMM][D]: Message received: LOG I 10 Driver command for 10 cancelled
>>
>> Wed Sep 12 17:57:36 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:57:36 2012 [InM][I]: Monitoring host cirrus (21)
>> Wed Sep 12 17:57:36 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:57:37 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host stratus (22)
>> Wed Sep 12 17:57:37 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:57:37 2012 [ReM][D]: VirtualMachineDeploy method invoked
>> Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host cumulus (23)
>> Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host nimbus (24)
>> Wed Sep 12 17:57:37 2012 [DiM][D]: Deploying VM 10
>> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 Command execution fail: /var/lib/one/remotes/tm/shared/ln cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 nimbus:/var/lib/one//datastores/0/10/disk.0
>>
>> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 ln: Linking ../../1/aab3c5409d45f015626af354c827a776 in nimbus:/var/lib/one//datastores/0/10/disk.0
>>
>> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG E 10 ln: Command "cd /var/lib/one/datastores/0/10; ln -s ../../1/aab3c5409d45f015626af354c827a776 /var/lib/one/datastores/0/10/disk.0" failed: ln: failed to create symbolic link `/var/lib/one/datastores/0/10/disk.0': File exists
>>
>> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG E 10 Error linking cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 to nimbus:/var/lib/one//datastores/0/10/disk.0
>>
>> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 ExitCode: 1
>>
>> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: TRANSFER FAILURE 10 Error linking cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 to nimbus:/var/lib/one//datastores/0/10/disk.0
>>
>> Wed Sep 12 17:57:39 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:57:40 2012 [InM][D]: Host 21 successfully monitored.
>> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:57:40 2012 [InM][D]: Host 22 successfully monitored.
>> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:57:40 2012 [InM][D]: Host 23 successfully monitored.
>> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:57:40 2012 [InM][D]: Host 24 successfully monitored.
>> Wed Sep 12 17:57:51 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:58:06 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:58:06 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:58:06 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 Command execution fail: /var/lib/one/remotes/tm/shared/delete arcus:/var/lib/one//datastores/0/10
>>
>> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 delete: Deleting /var/lib/one/datastores/0/10
>>
>> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG E 10 delete: Command "rm -rf /var/lib/one/datastores/0/10" failed: ssh: connect to host arcus port 22: Connection timed out
>>
>> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG E 10 Error deleting /var/lib/one/datastores/0/10
>>
>> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 ExitCode: 255
>>
>> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: TRANSFER FAILURE 10 Error deleting /var/lib/one/datastores/0/10
>>
>> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 Command execution fail: /var/tmp/one/vmm/kvm/cancel one-10 arcus 10 arcus
>>
>> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 ssh: connect to host arcus port 22: Connection timed out
>>
>> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 ExitSSHCode: 255
>>
>> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG E 10 Error connecting to arcus
>>
>> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 Failed to execute virtualization driver operation: cancel.
>>
>> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: CANCEL FAILURE 10 Error connecting to arcus
>>
>> Wed Sep 12 17:58:33 2012 [InM][I]: Monitoring host arcus (25)
>> Wed Sep 12 17:58:35 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:58:35 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:58:35 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:58:48 2012 [InM][I]: Monitoring host cirrus (21)
>> Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host stratus (22)
>> Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host cumulus (23)
>> Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host nimbus (24)
>> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:58:52 2012 [InM][D]: Host 21 successfully monitored.
>> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:58:52 2012 [InM][D]: Host 22 successfully monitored.
>> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:58:52 2012 [InM][D]: Host 23 successfully monitored.
>> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
>> Wed Sep 12 17:58:52 2012 [InM][D]: Host 24 successfully monitored.
>> Wed Sep 12 17:59:04 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:59:04 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:59:04 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:59:32 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:59:33 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:59:33 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:59:33 2012 [ReM][D]: AclInfo method invoked
>> Wed Sep 12 17:59:34 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:59:37 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 25 arcus; else                              exit 42; fi'
>> Wed Sep 12 17:59:37 2012 [InM][I]: ssh: connect to host arcus port 22: Connection timed out
>> Wed Sep 12 17:59:37 2012 [InM][I]: ExitCode: 255
>> Wed Sep 12 17:59:37 2012 [InM][E]: Error monitoring host 25 : MONITOR FAILURE 25 -
>>
>> Wed Sep 12 17:59:37 2012 [ReM][D]: HostInfo method invoked
>> Wed Sep 12 17:59:37 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
>> Wed Sep 12 17:59:37 2012 [HKM][D]: Message received: LOG I 25 ExitCode: 0
>>
>> Wed Sep 12 17:59:37 2012 [HKM][D]: Message received: EXECUTE SUCCESS 25 error:
>>
>> Wed Sep 12 17:59:37 2012 [ReM][D]: HostPoolInfo method invoked
>> Wed Sep 12 17:59:59 2012 [InM][I]: Monitoring host cirrus (21)
>> _______________________________________________
>> Users mailing list
>> Users at lists.opennebula.org
>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org



More information about the Users mailing list