[one-users] fault tolerance

Tino Vazquez tinova at opennebula.org
Fri Sep 14 03:14:37 PDT 2012


Dear Gareth,

It looks like there was a problem with the VM after it was put in
PENDING. Could you please send us the log of the individual VM
(/var/lib/one/<vid>/vm.log) to take a look?

It should be possible to resubmit a VM in failed state, what is the
error message you get?

Regards,

-Tino

--
Constantino Vázquez Blanco, MSc
Project Engineer
OpenNebula - The Open-Source Solution for Data Center Virtualization
www.OpenNebula.org | @tinova79 | @OpenNebula


On Wed, Sep 12, 2012 at 7:02 PM, Gareth de Vaux <opennebula at lordcow.org> wrote:
> Hi all, how is host fault tolerance supposed to work? If I use the default hook:
>
> HOST_HOOK = [
>     name      = "error",
>     on        = "ERROR",
>     command   = "ft/host_error.rb",
>     arguments = "$HID -r n",
>     remote    = "no" ]
>
> and block network access to a host with a VM running, the host goes into
> the error state but the VM just goes to pending and then fails:
>
> $ onevm list
>     ID USER     GROUP    NAME         STAT CPU     MEM        HOSTNAME        TIME
>     10 oneadmin oneadmin BSD          fail   6    512M                    0d 00:46
>
>
> There seems to be some failed attempts at moving the VM, logs included at the
> end of the mail.
>
> How would you recover from this situation manually? You can't resubmit the
> VM from a failed state, it seems you have to delete and recreate from scratch,
> even though the image itself is happy:
>
> $ oneimage list
>   ID USER     GROUP    NAME         DATASTORE     SIZE TYPE PER STAT  RVMS
>    4 oneadmin oneadmin BSD          default         8G   OS Yes used     1
>
>
> Also, how does the host itself recover if the failure involved a reboot?
> The probes in /var/tmp/one have all been deleted:
>
> Wed Sep 12 18:46:49 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 23 cumulus; else
>
> and 'onehost sync' has permission problems as oneadmin:
>
> $ onehost sync
> /usr/lib/ruby/1.9.1/fileutils.rb:1137:in `utime': Permission denied - /var/lib/one/remotes (Errno::EACCES)
>         from /usr/lib/ruby/1.9.1/fileutils.rb:1137:in `block in touch'
>         from /usr/lib/ruby/1.9.1/fileutils.rb:1134:in `each'
>         from /usr/lib/ruby/1.9.1/fileutils.rb:1134:in `touch'
>         from /usr/bin/onehost:162:in `block (2 levels) in <main>'
>         from /usr/lib/one/ruby/cli/command_parser.rb:173:in `call'
>         from /usr/lib/one/ruby/cli/command_parser.rb:173:in `run'
>         from /usr/lib/one/ruby/cli/command_parser.rb:79:in `initialize'
>         from /usr/bin/onehost:34:in `new'
>         from /usr/bin/onehost:34:in `<main>'
>
> though 'onehost sync' addresses /var/lib and not /var/tmp?
>
>
> I'm running opennebula 3.4.1-3.1 on debian wheezy. The oned.log (cirrus is
> the opennebula controller and arcus is the host being simulated to fail
> in the first instance):
>
> Wed Sep 12 17:56:08 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host cirrus (21)
> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host stratus (22)
> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host cumulus (23)
> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host nimbus (24)
> Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host arcus (25)
> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 Command execution fail: 'if [ -x "/var/tmp/one/vmm/kvm/poll" ]; then /var/tmp/one/vmm/kvm/poll one-10 arcus 10 arcus; else                              exit 42; fi'
>
> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 ssh: connect to host arcus port 22: Connection timed out
>
> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 ExitCode: 255
>
> Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: POLL FAILURE 10 -
>
> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:56:28 2012 [InM][D]: Host 21 successfully monitored.
> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:56:28 2012 [InM][D]: Host 22 successfully monitored.
> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:56:28 2012 [InM][D]: Host 23 successfully monitored.
> Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:56:28 2012 [InM][D]: Host 24 successfully monitored.
> Wed Sep 12 17:56:35 2012 [VMM][I]: Monitoring VM 10.
> Wed Sep 12 17:56:38 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:56:38 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:56:38 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:57:07 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:57:07 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:57:07 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:57:20 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:57:23 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:57:28 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 25 arcus; else                              exit 42; fi'
> Wed Sep 12 17:57:28 2012 [InM][I]: ssh: connect to host arcus port 22: Connection timed out
> Wed Sep 12 17:57:28 2012 [InM][I]: ExitCode: 255
> Wed Sep 12 17:57:28 2012 [InM][E]: Error monitoring host 25 : MONITOR FAILURE 25 -
>
> Wed Sep 12 17:57:28 2012 [ReM][D]: HostInfo method invoked
> Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachineInfo method invoked
> Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachineAction method invoked
> Wed Sep 12 17:57:28 2012 [HKM][D]: Message received: LOG I 25 ExitCode: 0
>
> Wed Sep 12 17:57:28 2012 [HKM][D]: Message received: EXECUTE SUCCESS 25 error:
>
> Wed Sep 12 17:57:29 2012 [TM][D]: Message received: LOG I 10 ExitCode: 0
>
> Wed Sep 12 17:57:29 2012 [VMM][D]: Message received: LOG I 10 Driver command for 10 cancelled
>
> Wed Sep 12 17:57:36 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:57:36 2012 [InM][I]: Monitoring host cirrus (21)
> Wed Sep 12 17:57:36 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:57:37 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host stratus (22)
> Wed Sep 12 17:57:37 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:57:37 2012 [ReM][D]: VirtualMachineDeploy method invoked
> Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host cumulus (23)
> Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host nimbus (24)
> Wed Sep 12 17:57:37 2012 [DiM][D]: Deploying VM 10
> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 Command execution fail: /var/lib/one/remotes/tm/shared/ln cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 nimbus:/var/lib/one//datastores/0/10/disk.0
>
> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 ln: Linking ../../1/aab3c5409d45f015626af354c827a776 in nimbus:/var/lib/one//datastores/0/10/disk.0
>
> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG E 10 ln: Command "cd /var/lib/one/datastores/0/10; ln -s ../../1/aab3c5409d45f015626af354c827a776 /var/lib/one/datastores/0/10/disk.0" failed: ln: failed to create symbolic link `/var/lib/one/datastores/0/10/disk.0': File exists
>
> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG E 10 Error linking cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 to nimbus:/var/lib/one//datastores/0/10/disk.0
>
> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 ExitCode: 1
>
> Wed Sep 12 17:57:38 2012 [TM][D]: Message received: TRANSFER FAILURE 10 Error linking cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 to nimbus:/var/lib/one//datastores/0/10/disk.0
>
> Wed Sep 12 17:57:39 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:57:40 2012 [InM][D]: Host 21 successfully monitored.
> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:57:40 2012 [InM][D]: Host 22 successfully monitored.
> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:57:40 2012 [InM][D]: Host 23 successfully monitored.
> Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:57:40 2012 [InM][D]: Host 24 successfully monitored.
> Wed Sep 12 17:57:51 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:58:06 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:58:06 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:58:06 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 Command execution fail: /var/lib/one/remotes/tm/shared/delete arcus:/var/lib/one//datastores/0/10
>
> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 delete: Deleting /var/lib/one/datastores/0/10
>
> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG E 10 delete: Command "rm -rf /var/lib/one/datastores/0/10" failed: ssh: connect to host arcus port 22: Connection timed out
>
> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG E 10 Error deleting /var/lib/one/datastores/0/10
>
> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 ExitCode: 255
>
> Wed Sep 12 17:58:32 2012 [TM][D]: Message received: TRANSFER FAILURE 10 Error deleting /var/lib/one/datastores/0/10
>
> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 Command execution fail: /var/tmp/one/vmm/kvm/cancel one-10 arcus 10 arcus
>
> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 ssh: connect to host arcus port 22: Connection timed out
>
> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 ExitSSHCode: 255
>
> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG E 10 Error connecting to arcus
>
> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 Failed to execute virtualization driver operation: cancel.
>
> Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: CANCEL FAILURE 10 Error connecting to arcus
>
> Wed Sep 12 17:58:33 2012 [InM][I]: Monitoring host arcus (25)
> Wed Sep 12 17:58:35 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:58:35 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:58:35 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:58:48 2012 [InM][I]: Monitoring host cirrus (21)
> Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host stratus (22)
> Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host cumulus (23)
> Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host nimbus (24)
> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:58:52 2012 [InM][D]: Host 21 successfully monitored.
> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:58:52 2012 [InM][D]: Host 22 successfully monitored.
> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:58:52 2012 [InM][D]: Host 23 successfully monitored.
> Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
> Wed Sep 12 17:58:52 2012 [InM][D]: Host 24 successfully monitored.
> Wed Sep 12 17:59:04 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:59:04 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:59:04 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:59:32 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:59:33 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:59:33 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:59:33 2012 [ReM][D]: AclInfo method invoked
> Wed Sep 12 17:59:34 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:59:37 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 25 arcus; else                              exit 42; fi'
> Wed Sep 12 17:59:37 2012 [InM][I]: ssh: connect to host arcus port 22: Connection timed out
> Wed Sep 12 17:59:37 2012 [InM][I]: ExitCode: 255
> Wed Sep 12 17:59:37 2012 [InM][E]: Error monitoring host 25 : MONITOR FAILURE 25 -
>
> Wed Sep 12 17:59:37 2012 [ReM][D]: HostInfo method invoked
> Wed Sep 12 17:59:37 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
> Wed Sep 12 17:59:37 2012 [HKM][D]: Message received: LOG I 25 ExitCode: 0
>
> Wed Sep 12 17:59:37 2012 [HKM][D]: Message received: EXECUTE SUCCESS 25 error:
>
> Wed Sep 12 17:59:37 2012 [ReM][D]: HostPoolInfo method invoked
> Wed Sep 12 17:59:59 2012 [InM][I]: Monitoring host cirrus (21)
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org



More information about the Users mailing list