[one-users] VM HA with RBD

Tobias Brunner tobias at tobru.ch
Mon Sep 9 08:27:21 PDT 2013


Hi,

While testing the provided HOST_HOOK "host_error.rb" to have VM High 
Availability with RADOS/RBD as block device backend several questions 
popped up which I was not able to solve:

The configuration is very default-ish:
HOST_HOOK = [
     name      = "error",
     on        = "ERROR",
     command   = "ft/host_error.rb",
     arguments = "$ID -r",
     remote    = "no" ]

And this was my test scenario:

Starting position:
* A VM running on host1, using RBD as block storage
* There is another host in the cluster: host2
* The VM is also able to run on host2 (tested with live migration)

1. Kill host1 (power off)

2. After some minutes, oned discovers that the host is down:

Mon Sep  9 17:14:10 2013 [InM][I]: Command execution fail: 'if [ -x 
"/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2 
host1; else                              exit 42; fi'
Mon Sep  9 17:14:10 2013 [InM][I]: ssh: connect to host host1 port 22: 
Connection timed out
Mon Sep  9 17:14:10 2013 [InM][I]: ExitCode: 255
Mon Sep  9 17:14:10 2013 [ONE][E]: Error monitoring Host host1 (2): -


3. ONE tries to remove the VM from host1:

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command 
execution fail: /var/tmp/one/vmm/kvm/cancel 'one-64' 'host1' 64 host1
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: 
connect to host host1 port 22: Connection timed out
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 
ExitSSHCode: 255
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error 
connecting to host1
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to 
execute virtualization driver operation: cancel.
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command 
execution fail: /var/tmp/one/vnm/ovswitch/clean [...]
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: 
connect to host host1 port 22: Connection timed out
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 
ExitSSHCode: 255
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error 
connecting to host1
Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to 
execute network driver operation: clean.

[...]

Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Command 
execution fail: /var/lib/one/remotes/tm/ceph/delete 
host1:/var/lib/one//datastores/0/64/disk.0 64 101
Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 delete: 
Deleting /var/lib/one/datastores/0/64/disk.0
Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 delete: 
Command "rbd rm one/one-5-64-0" failed: ssh: connect to host host1 port 
22: Connection timed out
Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 Error 
deleting one/one-5-64-0 in host1
Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 
255
Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Failed to 
execute transfer manager driver operation: tm_delete.
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Command 
execution fail: /var/lib/one/remotes/tm/shared/delete 
host1:/var/lib/one//datastores/0/64 64 0
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 delete: 
Deleting /var/lib/one/datastores/0/64
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 delete: 
Command "rm -rf /var/lib/one/datastores/0/64" failed: ssh: connect to 
host host1 port 22: Connection timed out
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 Error 
deleting /var/lib/one/datastores/0/64
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 
255
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Failed to 
execute transfer manager driver operation: tm_delete.
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: CLEANUP SUCCESS 64

4. ONE tries to deploy the VM which was running on host1 to host2, but 
fails because the RBD volume already exists.

Mon Sep  9 17:14:33 2013 [DiM][D]: Deploying VM 64
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 Command 
execution fail: /var/lib/one/remotes/tm/ceph/clone quimby:one/one-5 
uetli2:/var/lib/one//datastores/0/64/disk.0 64 101
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG E 64 clone: 
Command "rbd copy one/one-5 one/one-5-64-0" faileImage copy: 0% 
complete...failed.
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 rbd: copy 
failed: (17) File exists
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09 
17:14:35.466472 7f81463a8780 -1 librbd: rbd image one-5-64-0 already 
exists
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09 
17:14:35.466500 7f81463a8780 -1 librbd: header creation failed
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG E 64 Error 
cloning one/one-5 to one/one-5-64-0 in quimby
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 ExitCode: 1
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: TRANSFER FAILURE 64 
Error cloning one/one-5 to one/one-5-64-0 in frontend


Now some questions:
* Why does ONE try to remove the VM from the failed host? It really 
makes no sense, because the host is down and not reachable anymore.
* Why has the VM to be recreated? The disk image lies on a shared 
storage (RBD) and should only be started on another host, not recreated.
* The VM now has the state "FAILED". How is the VM supposed to be 
recovered?

Thanks for every clarification on this topic.

Cheers,
Tobias



More information about the Users mailing list