<div dir="ltr">Hi<div><br></div><div>The HA hooks are really a template to implement a full HA solution. As stated in the guide, you can end up with two living VMs, so fencing is needed for this to work (if you have a configured fencing mechanism, is just matter of triggering it from the hook, for the failed hook).</div>


<div><br></div><div>regarding your questions:</div><div><br></div><div><div>* Why does ONE try to remove the VM from the failed host? It really makes no sense, because the host is down and not reachable anymore.</div><div>


<br></div><div>Whenever a VM is removed from a host the resources are tried to be deleted, if this operation fails the process continues so there is no problem on doing this. You can reduce the ssh timeout to reduce the downtime.</div>


<div><br></div><div><br></div><div>* Why has the VM to be recreated? The disk image lies on a shared storage (RBD) and should only be started on another host, not recreated.</div><div><br></div><div>Any other process will try to contact the failing host so the only possible path is to recreate the VM. Note that this operations are agnostic from the underlying infrastructure, so it should work on RBD or a simple storage shared through SSH cp's.</div>


<div><br></div><div>Given said that, It seems that we need to modify the ceph Datastore to check if the volume exist before trying to create a new one, so the use case is fully supported.</div><div><br></div><div><a href="http://dev.opennebula.org/issues/2324" target="_blank">http://dev.opennebula.org/issues/2324</a><br>


</div><div><br></div><div>* The VM now has the state "FAILED". How is the VM supposed to be recovered?</div></div><div><br></div><div>You can try delete --recreate.</div><div><br></div><div><br></div><div><br></div>


<div><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Sep 9, 2013 at 5:27 PM, Tobias Brunner <span dir="ltr"><<a href="mailto:tobias@tobru.ch" target="_blank">tobias@tobru.ch</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

While testing the provided HOST_HOOK "host_error.rb" to have VM High Availability with RADOS/RBD as block device backend several questions popped up which I was not able to solve:<br>

<br>

The configuration is very default-ish:<br>

HOST_HOOK = [<br>

    name      = "error",<br>

    on        = "ERROR",<br>

    command   = "ft/host_error.rb",<br>

    arguments = "$ID -r",<br>

    remote    = "no" ]<br>

<br>

And this was my test scenario:<br>

<br>

Starting position:<br>

* A VM running on host1, using RBD as block storage<br>

* There is another host in the cluster: host2<br>

* The VM is also able to run on host2 (tested with live migration)<br>

<br>

1. Kill host1 (power off)<br>

<br>

2. After some minutes, oned discovers that the host is down:<br>

<br>

Mon Sep  9 17:14:10 2013 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2 host1; else                              exit 42; fi'<br>

Mon Sep  9 17:14:10 2013 [InM][I]: ssh: connect to host host1 port 22: Connection timed out<br>

Mon Sep  9 17:14:10 2013 [InM][I]: ExitCode: 255<br>

Mon Sep  9 17:14:10 2013 [ONE][E]: Error monitoring Host host1 (2): -<br>

<br>

<br>

3. ONE tries to remove the VM from host1:<br>

<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/tmp/one/vmm/kvm/cancel 'one-64' 'host1' 64 host1<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: connect to host host1 port 22: Connection timed out<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ExitSSHCode: 255<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error connecting to host1<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to execute virtualization driver operation: cancel.<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/tmp/one/vnm/ovswitch/<u></u>clean [...]<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: connect to host host1 port 22: Connection timed out<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ExitSSHCode: 255<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error connecting to host1<br>

Mon Sep  9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to execute network driver operation: clean.<br>

<br>

[...]<br>

<br>

Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/lib/one/remotes/tm/ceph/<u></u>delete host1:/var/lib/one//<u></u>datastores/0/64/disk.0 64 101<br>

Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 delete: Deleting /var/lib/one/datastores/0/64/<u></u>disk.0<br>

Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 delete: Command "rbd rm one/one-5-64-0" failed: ssh: connect to host host1 port 22: Connection timed out<br>

Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 Error deleting one/one-5-64-0 in host1<br>

Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 255<br>

Mon Sep  9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Failed to execute transfer manager driver operation: tm_delete.<br>

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/lib/one/remotes/tm/<u></u>shared/delete host1:/var/lib/one//<u></u>datastores/0/64 64 0<br>

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 delete: Deleting /var/lib/one/datastores/0/64<br>

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 delete: Command "rm -rf /var/lib/one/datastores/0/64" failed: ssh: connect to host host1 port 22: Connection timed out<br>

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 Error deleting /var/lib/one/datastores/0/64<br>

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 255<br>

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Failed to execute transfer manager driver operation: tm_delete.<br>

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: CLEANUP SUCCESS 64<br>

<br>

4. ONE tries to deploy the VM which was running on host1 to host2, but fails because the RBD volume already exists.<br>

<br>

Mon Sep  9 17:14:33 2013 [DiM][D]: Deploying VM 64<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 Command execution fail: /var/lib/one/remotes/tm/ceph/<u></u>clone quimby:one/one-5 uetli2:/var/lib/one//<u></u>datastores/0/64/disk.0 64 101<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG E 64 clone: Command "rbd copy one/one-5 one/one-5-64-0" faileImage copy: 0% complete...failed.<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 rbd: copy failed: (17) File exists<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09 17:14:35.466472 7f81463a8780 -1 librbd: rbd image one-5-64-0 already exists<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09 17:14:35.466500 7f81463a8780 -1 librbd: header creation failed<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG E 64 Error cloning one/one-5 to one/one-5-64-0 in quimby<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 ExitCode: 1<br>

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: TRANSFER FAILURE 64 Error cloning one/one-5 to one/one-5-64-0 in frontend<br>

<br>

<br>

Now some questions:<br>

* Why does ONE try to remove the VM from the failed host? It really makes no sense, because the host is down and not reachable anymore.<br>

* Why has the VM to be recreated? The disk image lies on a shared storage (RBD) and should only be started on another host, not recreated.<br>

* The VM now has the state "FAILED". How is the VM supposed to be recovered?<br>

<br>

Thanks for every clarification on this topic.<br>

<br>

Cheers,<br>

Tobias<br>

<br>

______________________________<u></u>_________________<br>

Users mailing list<br>

<a href="mailto:Users@lists.opennebula.org" target="_blank">Users@lists.opennebula.org</a><br>

<a href="http://lists.opennebula.org/listinfo.cgi/users-opennebula.org" target="_blank">http://lists.opennebula.org/<u></u>listinfo.cgi/users-opennebula.<u></u>org<br clear="all"><div><br></div>-- <br></a><div dir="ltr">


<a href="http://lists.opennebula.org/listinfo.cgi/users-opennebula.org" target="_blank"><div><div><div>-- </div><div>Join us at OpenNebulaConf2013 in Berlin, 24-26 September, 2013</div></div><div>-- </div></div>Ruben S. Montero, PhD<br>


Project co-Lead and Chief Architect<br>OpenNebula - The Open Source Solution for Data Center Virtualization<br></a><a href="http://www.OpenNebula.org" target="_blank">www.OpenNebula.org</a> | <a href="mailto:rsmontero@opennebula.org" target="_blank">rsmontero@opennebula.org</a> | @OpenNebula</div>


</blockquote></div></div></div>