[one-users] VM HA with RBD

Fri Sep 13 02:37:17 PDT 2013

Hi

The HA hooks are really a template to implement a full HA solution. As
stated in the guide, you can end up with two living VMs, so fencing is
needed for this to work (if you have a configured fencing mechanism, is
just matter of triggering it from the hook, for the failed hook).

regarding your questions:

* Why does ONE try to remove the VM from the failed host? It really makes
no sense, because the host is down and not reachable anymore.

Whenever a VM is removed from a host the resources are tried to be deleted,
if this operation fails the process continues so there is no problem on
doing this. You can reduce the ssh timeout to reduce the downtime.

* Why has the VM to be recreated? The disk image lies on a shared storage
(RBD) and should only be started on another host, not recreated.

Any other process will try to contact the failing host so the only possible
path is to recreate the VM. Note that this operations are agnostic from the
underlying infrastructure, so it should work on RBD or a simple storage
shared through SSH cp's.

Given said that, It seems that we need to modify the ceph Datastore to
check if the volume exist before trying to create a new one, so the use
case is fully supported.

http://dev.opennebula.org/issues/2324

* The VM now has the state "FAILED". How is the VM supposed to be recovered?

You can try delete --recreate.

On Mon, Sep 9, 2013 at 5:27 PM, Tobias Brunner <tobias at tobru.ch> wrote:

> Hi,
>
> While testing the provided HOST_HOOK "host_error.rb" to have VM High
> Availability with RADOS/RBD as block device backend several questions
> popped up which I was not able to solve:
>
> The configuration is very default-ish:
> HOST_HOOK = [
>     name      = "error",
>     on        = "ERROR",
>     command   = "ft/host_error.rb",
>     arguments = "$ID -r",
>     remote    = "no" ]
>
> And this was my test scenario:
>
> Starting position:
> * A VM running on host1, using RBD as block storage
> * There is another host in the cluster: host2
> * The VM is also able to run on host2 (tested with live migration)
>
> 1. Kill host1 (power off)
>
> 2. After some minutes, oned discovers that the host is down:
>
> Mon Sep  9 17:14:10 2013 [InM][I]: Command execution fail: 'if [ -x
> "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2
> host1; else                              exit 42; fi'
> Mon Sep  9 17:14:10 2013 [InM][I]: ssh: connect to host host1 port 22:
> Connection timed out
> Mon Sep  9 17:14:10 2013 [InM][I]: ExitCode: 255
> Mon Sep  9 17:14:10 2013 [ONE][E]: Error monitoring Host host1 (2): -
>
>
> 3. ONE tries to remove the VM from host1:
>
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command
> execution fail: /var/tmp/one/vmm/kvm/cancel 'one-64' 'host1' 64 host1
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: connect
> to host host1 port 22: Connection timed out
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ExitSSHCode:
> 255
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error
> connecting to host1
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to
> execute virtualization driver operation: cancel.
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command
> execution fail: /var/tmp/one/vnm/ovswitch/**clean [...]
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: connect
> to host host1 port 22: Connection timed out
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ExitSSHCode:
> 255
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error
> connecting to host1
> Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to
> execute network driver operation: clean.
>
> [...]
>
> Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Command
> execution fail: /var/lib/one/remotes/tm/ceph/**delete host1:/var/lib/one//
> **datastores/0/64/disk.0 64 101
> Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 delete:
> Deleting /var/lib/one/datastores/0/64/**disk.0
> Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 delete:
> Command "rbd rm one/one-5-64-0" failed: ssh: connect to host host1 port 22:
> Connection timed out
> Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 Error
> deleting one/one-5-64-0 in host1
> Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 255
> Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Failed to
> execute transfer manager driver operation: tm_delete.
> Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Command
> execution fail: /var/lib/one/remotes/tm/**shared/delete
> host1:/var/lib/one//**datastores/0/64 64 0
> Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 delete:
> Deleting /var/lib/one/datastores/0/64
> Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 delete:
> Command "rm -rf /var/lib/one/datastores/0/64" failed: ssh: connect to host
> host1 port 22: Connection timed out
> Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 Error
> deleting /var/lib/one/datastores/0/64
> Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 255
> Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Failed to
> execute transfer manager driver operation: tm_delete.
> Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: CLEANUP SUCCESS 64
>
> 4. ONE tries to deploy the VM which was running on host1 to host2, but
> fails because the RBD volume already exists.
>
> Mon Sep  9 17:14:33 2013 [DiM][D]: Deploying VM 64
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 Command
> execution fail: /var/lib/one/remotes/tm/ceph/**clone quimby:one/one-5
> uetli2:/var/lib/one//**datastores/0/64/disk.0 64 101
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG E 64 clone:
> Command "rbd copy one/one-5 one/one-5-64-0" faileImage copy: 0%
> complete...failed.
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 rbd: copy
> failed: (17) File exists
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09
> 17:14:35.466472 7f81463a8780 -1 librbd: rbd image one-5-64-0 already exists
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09
> 17:14:35.466500 7f81463a8780 -1 librbd: header creation failed
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG E 64 Error cloning
> one/one-5 to one/one-5-64-0 in quimby
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 ExitCode: 1
> Mon Sep  9 17:14:35 2013 [TM][D]: Message received: TRANSFER FAILURE 64
> Error cloning one/one-5 to one/one-5-64-0 in frontend
>
>
> Now some questions:
> * Why does ONE try to remove the VM from the failed host? It really makes
> no sense, because the host is down and not reachable anymore.
> * Why has the VM to be recreated? The disk image lies on a shared storage
> (RBD) and should only be started on another host, not recreated.
> * The VM now has the state "FAILED". How is the VM supposed to be
> recovered?
>
> Thanks for every clarification on this topic.
>
> Cheers,
> Tobias
>
> ______________________________**_________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/**listinfo.cgi/users-opennebula.**org
>
> --
> <http://lists.opennebula.org/listinfo.cgi/users-opennebula.org>
> --
> Join us at OpenNebulaConf2013 in Berlin, 24-26 September, 2013
> --
> Ruben S. Montero, PhD
> Project co-Lead and Chief Architect
> OpenNebula - The Open Source Solution for Data Center Virtualization
> <http://lists.opennebula.org/listinfo.cgi/users-opennebula.org>
> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20130913/55c8d81f/attachment-0002.htm>