[one-users] VM HA with RBD
Tobias Brunner
tobias at tobru.ch
Mon Sep 9 08:27:21 PDT 2013
Hi,
While testing the provided HOST_HOOK "host_error.rb" to have VM High
Availability with RADOS/RBD as block device backend several questions
popped up which I was not able to solve:
The configuration is very default-ish:
HOST_HOOK = [
name = "error",
on = "ERROR",
command = "ft/host_error.rb",
arguments = "$ID -r",
remote = "no" ]
And this was my test scenario:
Starting position:
* A VM running on host1, using RBD as block storage
* There is another host in the cluster: host2
* The VM is also able to run on host2 (tested with live migration)
1. Kill host1 (power off)
2. After some minutes, oned discovers that the host is down:
Mon Sep 9 17:14:10 2013 [InM][I]: Command execution fail: 'if [ -x
"/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2
host1; else exit 42; fi'
Mon Sep 9 17:14:10 2013 [InM][I]: ssh: connect to host host1 port 22:
Connection timed out
Mon Sep 9 17:14:10 2013 [InM][I]: ExitCode: 255
Mon Sep 9 17:14:10 2013 [ONE][E]: Error monitoring Host host1 (2): -
3. ONE tries to remove the VM from host1:
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/tmp/one/vmm/kvm/cancel 'one-64' 'host1' 64 host1
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh:
connect to host host1 port 22: Connection timed out
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64
ExitSSHCode: 255
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error
connecting to host1
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute virtualization driver operation: cancel.
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/tmp/one/vnm/ovswitch/clean [...]
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh:
connect to host host1 port 22: Connection timed out
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64
ExitSSHCode: 255
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error
connecting to host1
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute network driver operation: clean.
[...]
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/lib/one/remotes/tm/ceph/delete
host1:/var/lib/one//datastores/0/64/disk.0 64 101
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 delete:
Deleting /var/lib/one/datastores/0/64/disk.0
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 delete:
Command "rbd rm one/one-5-64-0" failed: ssh: connect to host host1 port
22: Connection timed out
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 Error
deleting one/one-5-64-0 in host1
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 ExitCode:
255
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute transfer manager driver operation: tm_delete.
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/lib/one/remotes/tm/shared/delete
host1:/var/lib/one//datastores/0/64 64 0
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 delete:
Deleting /var/lib/one/datastores/0/64
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 delete:
Command "rm -rf /var/lib/one/datastores/0/64" failed: ssh: connect to
host host1 port 22: Connection timed out
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 Error
deleting /var/lib/one/datastores/0/64
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 ExitCode:
255
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute transfer manager driver operation: tm_delete.
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: CLEANUP SUCCESS 64
4. ONE tries to deploy the VM which was running on host1 to host2, but
fails because the RBD volume already exists.
Mon Sep 9 17:14:33 2013 [DiM][D]: Deploying VM 64
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 Command
execution fail: /var/lib/one/remotes/tm/ceph/clone quimby:one/one-5
uetli2:/var/lib/one//datastores/0/64/disk.0 64 101
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 clone:
Command "rbd copy one/one-5 one/one-5-64-0" faileImage copy: 0%
complete...failed.
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 rbd: copy
failed: (17) File exists
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09
17:14:35.466472 7f81463a8780 -1 librbd: rbd image one-5-64-0 already
exists
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09
17:14:35.466500 7f81463a8780 -1 librbd: header creation failed
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 Error
cloning one/one-5 to one/one-5-64-0 in quimby
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 ExitCode: 1
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: TRANSFER FAILURE 64
Error cloning one/one-5 to one/one-5-64-0 in frontend
Now some questions:
* Why does ONE try to remove the VM from the failed host? It really
makes no sense, because the host is down and not reachable anymore.
* Why has the VM to be recreated? The disk image lies on a shared
storage (RBD) and should only be started on another host, not recreated.
* The VM now has the state "FAILED". How is the VM supposed to be
recovered?
Thanks for every clarification on this topic.
Cheers,
Tobias
More information about the Users
mailing list