Hi Florian, Neil, DuDu,<div><br></div><div>You are very right, one possible reason for this error message is OpenNEbula attempting two simultaneous monitoring in the same host. One possible solution is to increase the HOST_MONITORING_INTERVAL (in our latest development revision of OpenNebula, we already increased that to 10 minutes, 600 seconds). And, of course, using the snmp driver also proved to be a great solution for this scalability issue.</div>
<div><br></div><div>Hope it helps,</div><div><br></div><div>-Tino</div><div><br></div><div>--<br>Constantino Vázquez Blanco | <a href="http://dsa-research.org/tinova">dsa-research.org/tinova</a><br>Virtualization Technology Engineer / Researcher<br>
OpenNebula Toolkit | <a href="http://opennebula.org">opennebula.org</a><br>
<br><br><div class="gmail_quote">On Mon, Jul 19, 2010 at 6:18 PM, Floris Sluiter <span dir="ltr"><<a href="mailto:Floris.Sluiter@sara.nl">Floris.Sluiter@sara.nl</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div lang="EN-US" link="blue" vlink="purple">
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Hi Dudu, Tino and all,</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">We have seen the exact same message (</span>Command execution
fail and bad interpreter: Text file busy))<span style="font-size:11.0pt;color:#1F497D"> on our cluster last week when
we expanded it from 12 to 16 hosts (with add host)and deploying 10 Vmachines at
the same time. We did not have multiple instances of opennebula running, we
only added to a running one, so it is unlikely that was the issue (the cluster
was already running stable for a while). We investigated and thought it was a
timing issue with the monitoring (ssh) driver set to 60 seconds and having many
hosts and many VMs. </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">We started using the ssh-monitoring driver again in after the
latest update to opennebula, before that we used our in hous developed snmp
monitoring driver. </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">When we deployed our snmp driver, the error message stopped and
for the last week we have a stable cloud again, now with 16 hosts…</span></p>
<p><span style="font-size:11.0pt;color:#1F497D">For
people who think see the same timing issues as we did , the snmp_driver is
available in the ecosystem (but make sure you know what snmp is before you try
;-)): </span><span style="font-size:9.5pt;color:#484848"><a href="http://opennebula.org/software:ecosystem:snmp_im_driver" target="_blank">http://opennebula.org/software:ecosystem:snmp_im_driver</a></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Regards,</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Floris </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">HPC project leader</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Sara</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span style="font-size:10.0pt">From:</span></b><span style="font-size:10.0pt">
<a href="mailto:users-bounces@lists.opennebula.org" target="_blank">users-bounces@lists.opennebula.org</a> [mailto:<a href="mailto:users-bounces@lists.opennebula.org" target="_blank">users-bounces@lists.opennebula.org</a>] <b>On
Behalf Of </b>Tino Vazquez<br>
<b>Sent:</b> maandag 19 juli 2010 16:15<br>
<b>To:</b> DuDu<br>
<b>Cc:</b> <a href="mailto:users@lists.opennebula.org" target="_blank">users@lists.opennebula.org</a><br>
<b>Subject:</b> Re: [one-users] oned hang</span></p>
</div><div><div></div><div class="h5">
<p class="MsoNormal"> </p>
<p class="MsoNormal">Dear DuDu,</p>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">This happens when two monitorization actions take place at
the same time.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">First thing, which OpenNebula version are you using?</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">Are you per chance running two OpenNebula instances? Did you
change the host polling time?</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">Regards,</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">-Tino</p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><br clear="all">
--<br>
Constantino Vázquez Blanco | <a href="http://dsa-research.org/tinova" target="_blank">dsa-research.org/tinova</a><br>
Virtualization Technology Engineer / Researcher<br>
OpenNebula Toolkit | <a href="http://opennebula.org" target="_blank">opennebula.org</a><br>
<br>
</p>
<div>
<p class="MsoNormal">On Wed, Jul 14, 2010 at 3:13 PM, DuDu <<a href="mailto:blackass@gmail.com" target="_blank">blackass@gmail.com</a>> wrote:</p>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">Hi,</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">We deployed a small cluster of opennebula, with 8 hosts. It
is the default opennebula installation, however, we found that after several
days of running, oned hung. All CLI commands hang too. No new logs generated in
one_xmlrpc.log. And there are quite some error message like the following in
oned.log:</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">[root@vm-container-31-0 logdir]# tail oned.log<br>
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup
failed: xauth key data not generated<br>
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake
authentication data for X11 forwarding.<br>
Wed Jul 14 14:51:02 2010 [InM][I]: bash:
/tmp/one-im//one_im-c4718299a313d89398ea693104dcce5f: /bin/sh: bad interpreter:
Text file busy<br>
Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126<br>
Wed Jul 14 14:51:02 2010 [InM][I]: Command execution fail: 'mkdir -p
/tmp/one-im/; cat > /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822; if
[ "x$?" != "x0" ]; then exit -1; fi; chmod +x
/tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822;
/tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822'<br>
Wed Jul 14 14:51:02 2010 [InM][I]: STDERR follows.<br>
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup
failed: xauth key data not generated<br>
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake
authentication data for X11 forwarding.<br>
Wed Jul 14 14:51:02 2010 [InM][I]: bash:
/tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822: /bin/sh: bad interpreter:
Text file busy<br>
Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">We have to sigkill oned and restart it. And that solves all
problems.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">Any idea of this?</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">Thanks!</p>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><br>
_______________________________________________<br>
Users mailing list<br>
<a href="mailto:Users@lists.opennebula.org" target="_blank">Users@lists.opennebula.org</a><br>
<a href="http://lists.opennebula.org/listinfo.cgi/users-opennebula.org" target="_blank">http://lists.opennebula.org/listinfo.cgi/users-opennebula.org</a></p>
</div>
<p class="MsoNormal"> </p>
</div>
</div></div></div>
</div>
</blockquote></div><br></div>