[one-users] fault tolerance
Gareth de Vaux
opennebula at lordcow.org
Wed Sep 12 10:02:57 PDT 2012
Hi all, how is host fault tolerance supposed to work? If I use the default hook:
HOST_HOOK = [
name = "error",
on = "ERROR",
command = "ft/host_error.rb",
arguments = "$HID -r n",
remote = "no" ]
and block network access to a host with a VM running, the host goes into
the error state but the VM just goes to pending and then fails:
$ onevm list
ID USER GROUP NAME STAT CPU MEM HOSTNAME TIME
10 oneadmin oneadmin BSD fail 6 512M 0d 00:46
There seems to be some failed attempts at moving the VM, logs included at the
end of the mail.
How would you recover from this situation manually? You can't resubmit the
VM from a failed state, it seems you have to delete and recreate from scratch,
even though the image itself is happy:
$ oneimage list
ID USER GROUP NAME DATASTORE SIZE TYPE PER STAT RVMS
4 oneadmin oneadmin BSD default 8G OS Yes used 1
Also, how does the host itself recover if the failure involved a reboot?
The probes in /var/tmp/one have all been deleted:
Wed Sep 12 18:46:49 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 23 cumulus; else
and 'onehost sync' has permission problems as oneadmin:
$ onehost sync
/usr/lib/ruby/1.9.1/fileutils.rb:1137:in `utime': Permission denied - /var/lib/one/remotes (Errno::EACCES)
from /usr/lib/ruby/1.9.1/fileutils.rb:1137:in `block in touch'
from /usr/lib/ruby/1.9.1/fileutils.rb:1134:in `each'
from /usr/lib/ruby/1.9.1/fileutils.rb:1134:in `touch'
from /usr/bin/onehost:162:in `block (2 levels) in <main>'
from /usr/lib/one/ruby/cli/command_parser.rb:173:in `call'
from /usr/lib/one/ruby/cli/command_parser.rb:173:in `run'
from /usr/lib/one/ruby/cli/command_parser.rb:79:in `initialize'
from /usr/bin/onehost:34:in `new'
from /usr/bin/onehost:34:in `<main>'
though 'onehost sync' addresses /var/lib and not /var/tmp?
I'm running opennebula 3.4.1-3.1 on debian wheezy. The oned.log (cirrus is
the opennebula controller and arcus is the host being simulated to fail
in the first instance):
Wed Sep 12 17:56:08 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host cirrus (21)
Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host stratus (22)
Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host cumulus (23)
Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host nimbus (24)
Wed Sep 12 17:56:25 2012 [InM][I]: Monitoring host arcus (25)
Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 Command execution fail: 'if [ -x "/var/tmp/one/vmm/kvm/poll" ]; then /var/tmp/one/vmm/kvm/poll one-10 arcus 10 arcus; else exit 42; fi'
Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 ssh: connect to host arcus port 22: Connection timed out
Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: LOG I 10 ExitCode: 255
Wed Sep 12 17:56:27 2012 [VMM][D]: Message received: POLL FAILURE 10 -
Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:56:28 2012 [InM][D]: Host 21 successfully monitored.
Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:56:28 2012 [InM][D]: Host 22 successfully monitored.
Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:56:28 2012 [InM][D]: Host 23 successfully monitored.
Wed Sep 12 17:56:28 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:56:28 2012 [InM][D]: Host 24 successfully monitored.
Wed Sep 12 17:56:35 2012 [VMM][I]: Monitoring VM 10.
Wed Sep 12 17:56:38 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:56:38 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:56:38 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:57:07 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:57:07 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:57:07 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:57:20 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:57:23 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:57:28 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 25 arcus; else exit 42; fi'
Wed Sep 12 17:57:28 2012 [InM][I]: ssh: connect to host arcus port 22: Connection timed out
Wed Sep 12 17:57:28 2012 [InM][I]: ExitCode: 255
Wed Sep 12 17:57:28 2012 [InM][E]: Error monitoring host 25 : MONITOR FAILURE 25 -
Wed Sep 12 17:57:28 2012 [ReM][D]: HostInfo method invoked
Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachineInfo method invoked
Wed Sep 12 17:57:28 2012 [ReM][D]: VirtualMachineAction method invoked
Wed Sep 12 17:57:28 2012 [HKM][D]: Message received: LOG I 25 ExitCode: 0
Wed Sep 12 17:57:28 2012 [HKM][D]: Message received: EXECUTE SUCCESS 25 error:
Wed Sep 12 17:57:29 2012 [TM][D]: Message received: LOG I 10 ExitCode: 0
Wed Sep 12 17:57:29 2012 [VMM][D]: Message received: LOG I 10 Driver command for 10 cancelled
Wed Sep 12 17:57:36 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:57:36 2012 [InM][I]: Monitoring host cirrus (21)
Wed Sep 12 17:57:36 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:57:37 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host stratus (22)
Wed Sep 12 17:57:37 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:57:37 2012 [ReM][D]: VirtualMachineDeploy method invoked
Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host cumulus (23)
Wed Sep 12 17:57:37 2012 [InM][I]: Monitoring host nimbus (24)
Wed Sep 12 17:57:37 2012 [DiM][D]: Deploying VM 10
Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 Command execution fail: /var/lib/one/remotes/tm/shared/ln cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 nimbus:/var/lib/one//datastores/0/10/disk.0
Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 ln: Linking ../../1/aab3c5409d45f015626af354c827a776 in nimbus:/var/lib/one//datastores/0/10/disk.0
Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG E 10 ln: Command "cd /var/lib/one/datastores/0/10; ln -s ../../1/aab3c5409d45f015626af354c827a776 /var/lib/one/datastores/0/10/disk.0" failed: ln: failed to create symbolic link `/var/lib/one/datastores/0/10/disk.0': File exists
Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG E 10 Error linking cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 to nimbus:/var/lib/one//datastores/0/10/disk.0
Wed Sep 12 17:57:38 2012 [TM][D]: Message received: LOG I 10 ExitCode: 1
Wed Sep 12 17:57:38 2012 [TM][D]: Message received: TRANSFER FAILURE 10 Error linking cirrus:/var/lib/one/datastores/1/aab3c5409d45f015626af354c827a776 to nimbus:/var/lib/one//datastores/0/10/disk.0
Wed Sep 12 17:57:39 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:57:40 2012 [InM][D]: Host 21 successfully monitored.
Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:57:40 2012 [InM][D]: Host 22 successfully monitored.
Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:57:40 2012 [InM][D]: Host 23 successfully monitored.
Wed Sep 12 17:57:40 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:57:40 2012 [InM][D]: Host 24 successfully monitored.
Wed Sep 12 17:57:51 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:58:06 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:58:06 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:58:06 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 Command execution fail: /var/lib/one/remotes/tm/shared/delete arcus:/var/lib/one//datastores/0/10
Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 delete: Deleting /var/lib/one/datastores/0/10
Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG E 10 delete: Command "rm -rf /var/lib/one/datastores/0/10" failed: ssh: connect to host arcus port 22: Connection timed out
Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG E 10 Error deleting /var/lib/one/datastores/0/10
Wed Sep 12 17:58:32 2012 [TM][D]: Message received: LOG I 10 ExitCode: 255
Wed Sep 12 17:58:32 2012 [TM][D]: Message received: TRANSFER FAILURE 10 Error deleting /var/lib/one/datastores/0/10
Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 Command execution fail: /var/tmp/one/vmm/kvm/cancel one-10 arcus 10 arcus
Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 ssh: connect to host arcus port 22: Connection timed out
Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 ExitSSHCode: 255
Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG E 10 Error connecting to arcus
Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: LOG I 10 Failed to execute virtualization driver operation: cancel.
Wed Sep 12 17:58:32 2012 [VMM][D]: Message received: CANCEL FAILURE 10 Error connecting to arcus
Wed Sep 12 17:58:33 2012 [InM][I]: Monitoring host arcus (25)
Wed Sep 12 17:58:35 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:58:35 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:58:35 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:58:48 2012 [InM][I]: Monitoring host cirrus (21)
Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host stratus (22)
Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host cumulus (23)
Wed Sep 12 17:58:49 2012 [InM][I]: Monitoring host nimbus (24)
Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:58:52 2012 [InM][D]: Host 21 successfully monitored.
Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:58:52 2012 [InM][D]: Host 22 successfully monitored.
Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:58:52 2012 [InM][D]: Host 23 successfully monitored.
Wed Sep 12 17:58:52 2012 [InM][I]: ExitCode: 0
Wed Sep 12 17:58:52 2012 [InM][D]: Host 24 successfully monitored.
Wed Sep 12 17:59:04 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:59:04 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:59:04 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:59:32 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:59:33 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:59:33 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:59:33 2012 [ReM][D]: AclInfo method invoked
Wed Sep 12 17:59:34 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:59:37 2012 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 25 arcus; else exit 42; fi'
Wed Sep 12 17:59:37 2012 [InM][I]: ssh: connect to host arcus port 22: Connection timed out
Wed Sep 12 17:59:37 2012 [InM][I]: ExitCode: 255
Wed Sep 12 17:59:37 2012 [InM][E]: Error monitoring host 25 : MONITOR FAILURE 25 -
Wed Sep 12 17:59:37 2012 [ReM][D]: HostInfo method invoked
Wed Sep 12 17:59:37 2012 [ReM][D]: VirtualMachinePoolInfo method invoked
Wed Sep 12 17:59:37 2012 [HKM][D]: Message received: LOG I 25 ExitCode: 0
Wed Sep 12 17:59:37 2012 [HKM][D]: Message received: EXECUTE SUCCESS 25 error:
Wed Sep 12 17:59:37 2012 [ReM][D]: HostPoolInfo method invoked
Wed Sep 12 17:59:59 2012 [InM][I]: Monitoring host cirrus (21)
More information about the Users
mailing list