[one-users] Monitor continually cycles through finding machines RUNNING and stat UNKNOWN - Possibly solved

Gerry O'Brien gerry at scss.tcd.ie
Mon Jan 20 09:18:56 PST 2014


Hi,

     I think we've figured out the cause of the issues reported above 
and they are particular to our installation.

     All our hosts use an NFS mounted root partition. The reasons for 
using this approach are historical and were supposed to make it easier 
to keep the hosts equally up-to-date.
    The issue here was that /tmp was the same for every host which 
caused collectd-client_control.sh to run multiple instances of 
collectd-client.rb as it writes its PID in /tmp and 
collectd-client_control.sh couldn't find the PID of the already running 
collectd-client.rb.

     My guess is that the DNS issue is related to the explicit use of 
the hostname  in ruby /var/tmp/one/im/kvm.d/collectd-client.rb kvm 
/var/lib/one//datastores 4124 20 3 host104.scss.tcd.ie. This seems to 
have changed since 4.2.
     The multiple copies of collectd-client.rb only exacerbated the 
problem. As we have a single hosts file for every host the solution was 
to place DNS entries for all hosts in /etc/hosts

         Regards,
           Gerry


On 20/01/2014 15:15, Javier Fontan wrote:
> The problem seems to be the high amount of collectd processes running.
> Try killing all "collectd-client.rb" processes. There should be only
> one running per host.
>
> In case you want to use the old method of monitoring you can follow this guide:
>
> http://docs.opennebula.org/stable/administration/monitoring/imsshpullg.html#imsshpullg
>
> On Mon, Jan 20, 2014 at 2:17 PM, Gerry O'Brien <gerry at scss.tcd.ie> wrote:
>> Hi Ruben,
>>
>>      Below is the output of 'ps -ef | grep one' on a host that has been
>> disabled, rebooted and enabled. There are multiple versions of
>> collectd-client.rb kvm running.
>>
>>
>>      We have discovered today a serious issue that is having an adverse
>> effect on our DNS system. When the machines below was enabled, immediately
>> our DNS server is flooded with requests from the host (see a sample below).
>>       Our logs show that this has only started happening since the upgrade to
>> 4.4. If we don't get a fix for this we will have to go back to 4.2, which is
>> something I really don't want to do.
>>
>>          Regards,
>>              Gerry
>>
>>
>>
>>
>> oneadmin  3628     1  0 13:04 ?        00:00:00 ruby
>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin  4600     1  0 13:05 ?        00:00:00 ruby
>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin  6400     1  0 13:07 ?        00:00:00 ruby
>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin  9003     1  0 13:08 ?        00:00:00 ruby
>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin 12953  3628  0 13:10 ?        00:00:00 /bin/bash
>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin 12955  6400  0 13:10 ?        00:00:00 /bin/bash
>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin 12969 12953  0 13:10 ?        00:00:00 /bin/bash
>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin 12970 12969  0 13:10 ?        00:00:00 /bin/bash
>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin 12972 12955  0 13:10 ?        00:00:00 /bin/bash
>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin 12973 12972  0 13:10 ?        00:00:00 /bin/bash
>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
>> 20 0 host101.scss.tcd.ie
>> oneadmin 13029 12973  0 13:10 ?        00:00:00 /bin/bash ./monitor_ds.sh
>> kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie
>> oneadmin 13030 12970  0 13:10 ?        00:00:00 /bin/bash ./monitor_ds.sh
>> kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie
>>
>>
>>
>> -2014 13:14:26.675 client 134.226.59.101#52314: query: host101.scss.tcd.ie
>> IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:26.824 client 134.226.59.101#58734: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:26.825 client 134.226.59.101#58734: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#53975: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:26.953 client 134.226.59.101#53975: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:27.347 client 134.226.59.101#49614: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:27.348 client 134.226.59.101#49614: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:27.350 client 134.226.59.101#44058: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:27.357 client 134.226.59.101#44058: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query:
>> host101.scss.tcd.ie IN A + (134.226.32.57)
>> 20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query:
>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>> 20-Jan-2014 13:14:31.302 client 134.226
>>
>>
>>
>>
>>
>>
>>
>> On 17/01/2014 17:45, Ruben S. Montero wrote:
>>> Hi Gerry
>>>
>>> Just to check, are you using 4.4 Final? We've seen this in the betas and
>>> "thought" we fixed for the final version. Also could you check that there
>>> are just one monitorization process at the hosts (collectd-client.sh, or
>>> equiv should be the name of the process)
>>>
>>> Also could you send us the lines from oned.log between Thu Jan 16 16:56:25
>>> 2014 and Thu Jan 16 17:25:43 2014; plus the first lines that includes you
>>> oned.conf values (we are interested specially in those related to
>>> monitoring interval)
>>>
>>>
>>> Cheers
>>>
>>> Ruben
>>>
>>>
>>>
>>>
>>> On Fri, Jan 17, 2014 at 2:27 PM, Gerry O'Brien <gerry at scss.tcd.ie> wrote:
>>>
>>>> Hi,
>>>>
>>>>       Below is a truncated log file for a VM. The monitor continually
>>>> cycles
>>>> through finding the machine RUNNING and stat UNKNOWN. This occurs for
>>>> many
>>>> many machines at the same time. All machines were created by a script.
>>>>
>>>>       The VMs are Microsoft Windows 7 64bit Enterprise. Individual context
>>>> is created by a startup script. They run fine but eventually /var/log/one
>>>> is going overflow.
>>>>
>>>>       Restarting oned seems to fix the problem but this is hardly a long
>>>> term solution.
>>>>
>>>>       Any suggestions on what could be causing this?
>>>>
>>>>           Regards,
>>>>               Gerry
>>>>
>>>>
>>>>
>>>>
>>>> Thu Jan 16 16:56:21 2014 [DiM][I]: New VM state is ACTIVE.
>>>> Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is PROLOG.
>>>> Thu Jan 16 16:56:22 2014 [VM][I]: Virtual Machine has no context
>>>> Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is BOOT
>>>> Thu Jan 16 16:56:22 2014 [VMM][I]: Generating deployment file:
>>>> /var/lib/one/vms/1788/deployment.0
>>>> Thu Jan 16 16:56:23 2014 [VMM][I]: ExitCode: 0
>>>> Thu Jan 16 16:56:23 2014 [VMM][I]: Successfully execute network driver
>>>> operation: pre.
>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0
>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute virtualization
>>>> driver operation: deploy.
>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0
>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute network driver
>>>> operation: post.
>>>> Thu Jan 16 16:56:25 2014 [LCM][I]: New VM state is RUNNING
>>>> Thu Jan 16 16:56:51 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 16:59:01 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 16:59:23 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:01:41 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:01:58 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:04:18 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:04:39 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:06:55 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:07:06 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:09:31 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:09:31 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:12:22 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:12:27 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:15:11 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:15:22 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:17:49 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:18:00 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:20:27 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:20:34 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:23:04 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:23:08 2014 [LCM][I]: New VM state is UNKNOWN
>>>> Thu Jan 16 17:25:41 2014 [VMM][I]: VM found again, state is RUNNING
>>>> Thu Jan 16 17:25:43 2014 [LCM][I]: New VM state is UNKNOWN
>>>>
>>>> --
>>>> Gerry O'Brien
>>>>
>>>> Systems Manager
>>>> School of Computer Science and Statistics
>>>> Trinity College Dublin
>>>> Dublin 2
>>>> IRELAND
>>>>
>>>> 00 353 1 896 1341
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at lists.opennebula.org
>>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>>>
>>>
>>
>> --
>> Gerry O'Brien
>>
>> Systems Manager
>> School of Computer Science and Statistics
>> Trinity College Dublin
>> Dublin 2
>> IRELAND
>>
>> 00 353 1 896 1341
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.opennebula.org
>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>
>


-- 
Gerry O'Brien

Systems Manager
School of Computer Science and Statistics
Trinity College Dublin
Dublin 2
IRELAND

00 353 1 896 1341



More information about the Users mailing list