[one-users] Multiple guests running after failed cleanup

Ruben S. Montero rsmontero at opennebula.org
Mon Jan 13 04:11:38 PST 2014


Hi Matthew,

The stock ha-restart scripts needs to include a proper fencing mechanism
for the vm hosts. This is needed to prevent the split-brain conditions
described in your email.

Simply include the fencing command in the hook (you have the hostname of
the target host in the script, so it should be straight-forward). This will
typically reboot the host, shutdown any VM in it.

Cheers

Ruben


On Thu, Jan 9, 2014 at 5:39 PM, Matthew Richardson <m.richardson at ed.ac.uk>wrote:

> Hi,
>
> I'm running a ONE 4.2 pool, and had some issues with it earlier today.
>
> I had some vm hosts lock up due to networking issues, where the vm hosts
> could see the rest of the world, but not be reached by the ONE server.
>
> As a result, the ONE server called a hook script:
>
> VM_HOOK = [ name = "on_crash_boot", on = "UNKNOWN", command =
> "/usr/bin/env onevm boot", arguments = "$ID" ]
>
> This resulted in an attempted cleanup (which appears to fail due to the
> ongoing network problems) followed by a restart elsewhere.  However, the
> failed cleanup meant that I then had 2 instances of the same guest
> running on 2 vm hosts, which led to mac address conflicts on the network.
>
> Is this a bug in ONE's handling of cleanup failure, or is there
> something else I should be doing in my hook script to ensure that it is
> safe to call onevm boot?
>
> Any advice appreciated! (other than to take better care of the network :) )
>
> thanks,
>
> Matthew
>
>
> oned.log starts as follows:
>
> Thu Jan  9 08:13:07 2014 [InM][I]: Command execution fail: 'if [ -x
> "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2
> vmhost3; else                              exit 42; fi'
> Thu Jan  9 08:13:07 2014 [InM][I]: Connection closed by 192.168.12.16
> Thu Jan  9 08:13:07 2014 [InM][I]: ExitCode: 255
> Thu Jan  9 08:13:07 2014 [ONE][E]: Error monitoring Host vmhost3 (2): -
> Thu Jan  9 08:13:07 2014 [ReM][D]: Req:3296 UID:0 VirtualMachineAction
> invoked, "boot", 14
> Thu Jan  9 08:13:07 2014 [DiM][D]: Restarting VM 14
> Thu Jan  9 08:13:07 2014 [ReM][D]: Req:3296 UID:0 VirtualMachineAction
> result SUCCESS, 14
> Thu Jan  9 08:13:07 2014 [HKM][D]: Message received: EXECUTE SUCCESS 14
> on_crash_boot:
>
> Thu Jan  9 08:13:08 2014 [ReM][D]: Req:3328 UID:0 VirtualMachineInfo
> invoked, 14
> Thu Jan  9 08:13:08 2014 [ReM][D]: Req:3328 UID:0 VirtualMachineInfo
> result SUCCESS, "<VM><ID>14</ID><UID>..."
>
> Thu Jan  9 08:13:08 2014 [ReM][D]: Req:9328 UID:0 VirtualMachineAction
> invoked, "delete-recreate", 14
> Thu Jan  9 08:13:08 2014 [ReM][D]: Req:9328 UID:0 VirtualMachineAction
> result SUCCESS, 14
>
> Thu Jan  9 08:13:08 2014 [VMM][D]: Message received: LOG I 14 Driver
> command for 14 cancelled
>
>
>
> The (slightly redacted) guest log (14.log) is as follows:
>
> Thu Jan  9 07:44:53 2014 [LCM][I]: New VM state is RUNNING
> Thu Jan  9 08:13:07 2014 [LCM][I]: New VM state is UNKNOWN
> Thu Jan  9 08:13:07 2014 [LCM][I]: New VM state is BOOT_UNKNOWN
> Thu Jan  9 08:13:07 2014 [HKM][I]: Success executing Hook: on_crash_boot: .
> Thu Jan  9 08:13:07 2014 [VMM][I]: Generating deployment file:
> /var/lib/one/vms/14/deployment.4917
> Thu Jan  9 08:13:08 2014 [LCM][I]: New VM state is CLEANUP.
> Thu Jan  9 08:13:08 2014 [VMM][I]: Driver command for 14 cancelled
> Thu Jan  9 08:18:52 2014 [VMM][I]: Command execution fail:
> /var/tmp/one/vmm/kvm/cancel 'one-14' 'vmhost3' 14 vmhost3
> Thu Jan  9 08:18:52 2014 [VMM][I]: Connection closed by 192.168.12.16
> Thu Jan  9 08:18:52 2014 [VMM][I]: ExitSSHCode: 255
> Thu Jan  9 08:18:52 2014 [VMM][E]: Error connecting to vmhost3
> Thu Jan  9 08:18:52 2014 [VMM][I]: Failed to execute virtualization
> driver operation: cancel.
> Thu Jan  9 08:18:52 2014 [VMM][I]: Command execution fail:
> /var/tmp/one/vnm/dummy/clean <...snip...>
> Thu Jan  9 08:18:52 2014 [VMM][I]: Connection closed by 192.168.12.16
> Thu Jan  9 08:18:52 2014 [VMM][I]: ExitSSHCode: 255
> Thu Jan  9 08:18:52 2014 [VMM][E]: Error connecting to vmhost3
> Thu Jan  9 08:18:52 2014 [VMM][I]: Failed to execute network driver
> operation: clean.
> Thu Jan  9 08:19:01 2014 [VMM][I]: Successfully execute transfer manager
> driver operation: tm_delete.
> Thu Jan  9 08:19:02 2014 [VMM][I]: Successfully execute transfer manager
> driver operation: tm_delete.
> Thu Jan  9 08:19:02 2014 [VMM][I]: Host successfully cleaned.
> Thu Jan  9 08:19:03 2014 [DiM][I]: New VM state is PENDING
> Thu Jan  9 08:20:54 2014 [DiM][I]: New VM state is ACTIVE.
> Thu Jan  9 08:20:54 2014 [LCM][I]: New VM state is PROLOG.
> Thu Jan  9 08:20:54 2014 [VM][I]: Virtual Machine has no context
> Thu Jan  9 08:20:54 2014 [LCM][I]: New VM state is BOOT
> Thu Jan  9 08:20:54 2014 [VMM][I]: Generating deployment file:
> /var/lib/one/vms/14/deployment.4918
> Thu Jan  9 08:20:56 2014 [VMM][I]: ExitCode: 0
> Thu Jan  9 08:20:56 2014 [VMM][I]: Successfully execute network driver
> operation: pre.
> Thu Jan  9 08:20:56 2014 [VMM][I]: ExitCode: 0
> Thu Jan  9 08:20:56 2014 [VMM][I]: Successfully execute virtualization
> driver operation: deploy.
> Thu Jan  9 08:20:56 2014 [VMM][I]: ExitCode: 0
> Thu Jan  9 08:20:56 2014 [VMM][I]: Successfully execute network driver
> operation: post.
> Thu Jan  9 08:20:56 2014 [LCM][I]: New VM state is RUNNING
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>
> --
> <http://lists.opennebula.org/listinfo.cgi/users-opennebula.org>
> --
> Ruben S. Montero, PhD
> Project co-Lead and Chief Architect<http://lists.opennebula.org/listinfo.cgi/users-opennebula.org>
> OpenNebula - Flexible Enterprise Cloud Made Simple
>  <http://lists.opennebula.org/listinfo.cgi/users-opennebula.org>
> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20140113/d5e33ebe/attachment-0002.htm>


More information about the Users mailing list