[one-users] Monitor continually cycles through finding machines RUNNING and stat UNKNOWN

Javier Fontan jfontan at opennebula.org
Fri Feb 14 02:11:53 PST 2014


I've checked in our machines and it's not normal. Kill those
processes. After some time it will be started again, hopefully only
one.

On Tue, Jan 21, 2014 at 1:53 PM, Gerry O'Brien <gerry at scss.tcd.ie> wrote:
> Hi,
>
>     I've gotten down to only one collestd-client.rb process (see below). Are
> the multiple kvm-probes OK?
>
>         Regards,
>           Gerry
>
>
>
>
> root at host101:~# ps -ef | grep one
> oneadmin  3349     1  0 12:23 ?        00:00:00 ruby
> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores 4124
> 20 0 host101.scss.tcd.ie
> oneadmin 21068  3349  0 12:51 ?        00:00:00 /bin/bash
> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
> 20 0 host101.scss.tcd.ie
> oneadmin 21076 21068  0 12:51 ?        00:00:00 /bin/bash
> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
> 20 0 host101.scss.tcd.ie
> oneadmin 21077 21076  0 12:51 ?        00:00:00 /bin/bash
> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores 4124
> 20 0 host101.scss.tcd.ie
>
>
>
>
>
> On 21/01/2014 10:10, Javier Fontan wrote:
>>
>> It seems that there are more people having this problem and we are
>> taking a look on several ways to fix this. One problem with /var/run
>> is that it is normally owned by root and a process started by oneadmin
>> user can not write there. In the frontend a new directory for
>> OpenNebula pid files is created but in the nodes it does not exist.
>>
>> On Tue, Jan 21, 2014 at 8:07 AM, Gerry O'Brien <gerry at scss.tcd.ie> wrote:
>>>
>>> Hi Javier,
>>>
>>>    See my previous email. Another scenario is when
>>> "/tmp/one-collectd-client.pid" does not exist due to issues with /tmp.
>>>
>>>     A change seems to have been made to put a pid file in /tmp instead of
>>> /run or /var/run.
>>>
>>>          Regards,
>>>            Gerry
>>>
>>>
>>>
>>> On 20/01/2014 17:44, Javier Fontan wrote:
>>>>
>>>> I've been trying to reproduce the problem, that is, making OpenNebula
>>>> start a high amount of collectd-client processes. The only way I was
>>>> able to do it is when the file "/tmp/one-collectd-client.pid" exists
>>>> and has wrong permissions. Can you check the ownership and permissions
>>>> of that file?
>>>>
>>>> On Mon, Jan 20, 2014 at 4:15 PM, Javier Fontan <jfontan at opennebula.org>
>>>> wrote:
>>>>>
>>>>> The problem seems to be the high amount of collectd processes running.
>>>>> Try killing all "collectd-client.rb" processes. There should be only
>>>>> one running per host.
>>>>>
>>>>> In case you want to use the old method of monitoring you can follow
>>>>> this
>>>>> guide:
>>>>>
>>>>>
>>>>>
>>>>> http://docs.opennebula.org/stable/administration/monitoring/imsshpullg.html#imsshpullg
>>>>>
>>>>> On Mon, Jan 20, 2014 at 2:17 PM, Gerry O'Brien <gerry at scss.tcd.ie>
>>>>> wrote:
>>>>>>
>>>>>> Hi Ruben,
>>>>>>
>>>>>>       Below is the output of 'ps -ef | grep one' on a host that has
>>>>>> been
>>>>>> disabled, rebooted and enabled. There are multiple versions of
>>>>>> collectd-client.rb kvm running.
>>>>>>
>>>>>>
>>>>>>       We have discovered today a serious issue that is having an
>>>>>> adverse
>>>>>> effect on our DNS system. When the machines below was enabled,
>>>>>> immediately
>>>>>> our DNS server is flooded with requests from the host (see a sample
>>>>>> below).
>>>>>>        Our logs show that this has only started happening since the
>>>>>> upgrade to
>>>>>> 4.4. If we don't get a fix for this we will have to go back to 4.2,
>>>>>> which is
>>>>>> something I really don't want to do.
>>>>>>
>>>>>>           Regards,
>>>>>>               Gerry
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> oneadmin  3628     1  0 13:04 ?        00:00:00 ruby
>>>>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin  4600     1  0 13:05 ?        00:00:00 ruby
>>>>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin  6400     1  0 13:07 ?        00:00:00 ruby
>>>>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin  9003     1  0 13:08 ?        00:00:00 ruby
>>>>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 12953  3628  0 13:10 ?        00:00:00 /bin/bash
>>>>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes
>>>>>> /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 12955  6400  0 13:10 ?        00:00:00 /bin/bash
>>>>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes
>>>>>> /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 12969 12953  0 13:10 ?        00:00:00 /bin/bash
>>>>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes
>>>>>> /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 12970 12969  0 13:10 ?        00:00:00 /bin/bash
>>>>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes
>>>>>> /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 12972 12955  0 13:10 ?        00:00:00 /bin/bash
>>>>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes
>>>>>> /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 12973 12972  0 13:10 ?        00:00:00 /bin/bash
>>>>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes
>>>>>> /var/lib/one//datastores
>>>>>> 4124
>>>>>> 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 13029 12973  0 13:10 ?        00:00:00 /bin/bash
>>>>>> ./monitor_ds.sh
>>>>>> kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie
>>>>>> oneadmin 13030 12970  0 13:10 ?        00:00:00 /bin/bash
>>>>>> ./monitor_ds.sh
>>>>>> kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie
>>>>>>
>>>>>>
>>>>>>
>>>>>> -2014 13:14:26.675 client 134.226.59.101#52314: query:
>>>>>> host101.scss.tcd.ie
>>>>>> IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.824 client 134.226.59.101#58734: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.825 client 134.226.59.101#58734: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#53975: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:26.953 client 134.226.59.101#53975: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.347 client 134.226.59.101#49614: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.348 client 134.226.59.101#49614: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.350 client 134.226.59.101#44058: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.357 client 134.226.59.101#44058: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query:
>>>>>> host101.scss.tcd.ie IN A + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query:
>>>>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57)
>>>>>> 20-Jan-2014 13:14:31.302 client 134.226
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17/01/2014 17:45, Ruben S. Montero wrote:
>>>>>>>
>>>>>>> Hi Gerry
>>>>>>>
>>>>>>> Just to check, are you using 4.4 Final? We've seen this in the betas
>>>>>>> and
>>>>>>> "thought" we fixed for the final version. Also could you check that
>>>>>>> there
>>>>>>> are just one monitorization process at the hosts (collectd-client.sh,
>>>>>>> or
>>>>>>> equiv should be the name of the process)
>>>>>>>
>>>>>>> Also could you send us the lines from oned.log between Thu Jan 16
>>>>>>> 16:56:25
>>>>>>> 2014 and Thu Jan 16 17:25:43 2014; plus the first lines that includes
>>>>>>> you
>>>>>>> oned.conf values (we are interested specially in those related to
>>>>>>> monitoring interval)
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Ruben
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 17, 2014 at 2:27 PM, Gerry O'Brien <gerry at scss.tcd.ie>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>        Below is a truncated log file for a VM. The monitor
>>>>>>>> continually
>>>>>>>> cycles
>>>>>>>> through finding the machine RUNNING and stat UNKNOWN. This occurs
>>>>>>>> for
>>>>>>>> many
>>>>>>>> many machines at the same time. All machines were created by a
>>>>>>>> script.
>>>>>>>>
>>>>>>>>        The VMs are Microsoft Windows 7 64bit Enterprise. Individual
>>>>>>>> context
>>>>>>>> is created by a startup script. They run fine but eventually
>>>>>>>> /var/log/one
>>>>>>>> is going overflow.
>>>>>>>>
>>>>>>>>        Restarting oned seems to fix the problem but this is hardly a
>>>>>>>> long
>>>>>>>> term solution.
>>>>>>>>
>>>>>>>>        Any suggestions on what could be causing this?
>>>>>>>>
>>>>>>>>            Regards,
>>>>>>>>                Gerry
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thu Jan 16 16:56:21 2014 [DiM][I]: New VM state is ACTIVE.
>>>>>>>> Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is PROLOG.
>>>>>>>> Thu Jan 16 16:56:22 2014 [VM][I]: Virtual Machine has no context
>>>>>>>> Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is BOOT
>>>>>>>> Thu Jan 16 16:56:22 2014 [VMM][I]: Generating deployment file:
>>>>>>>> /var/lib/one/vms/1788/deployment.0
>>>>>>>> Thu Jan 16 16:56:23 2014 [VMM][I]: ExitCode: 0
>>>>>>>> Thu Jan 16 16:56:23 2014 [VMM][I]: Successfully execute network
>>>>>>>> driver
>>>>>>>> operation: pre.
>>>>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0
>>>>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute
>>>>>>>> virtualization
>>>>>>>> driver operation: deploy.
>>>>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0
>>>>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute network
>>>>>>>> driver
>>>>>>>> operation: post.
>>>>>>>> Thu Jan 16 16:56:25 2014 [LCM][I]: New VM state is RUNNING
>>>>>>>> Thu Jan 16 16:56:51 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 16:59:01 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 16:59:23 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:01:41 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:01:58 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:04:18 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:04:39 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:06:55 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:07:06 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:09:31 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:09:31 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:12:22 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:12:27 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:15:11 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:15:22 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:17:49 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:18:00 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:20:27 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:20:34 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:23:04 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:23:08 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>> Thu Jan 16 17:25:41 2014 [VMM][I]: VM found again, state is RUNNING
>>>>>>>> Thu Jan 16 17:25:43 2014 [LCM][I]: New VM state is UNKNOWN
>>>>>>>>
>>>>>>>> --
>>>>>>>> Gerry O'Brien
>>>>>>>>
>>>>>>>> Systems Manager
>>>>>>>> School of Computer Science and Statistics
>>>>>>>> Trinity College Dublin
>>>>>>>> Dublin 2
>>>>>>>> IRELAND
>>>>>>>>
>>>>>>>> 00 353 1 896 1341
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at lists.opennebula.org
>>>>>>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>>>>>>>
>>>>>> --
>>>>>> Gerry O'Brien
>>>>>>
>>>>>> Systems Manager
>>>>>> School of Computer Science and Statistics
>>>>>> Trinity College Dublin
>>>>>> Dublin 2
>>>>>> IRELAND
>>>>>>
>>>>>> 00 353 1 896 1341
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at lists.opennebula.org
>>>>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Javier Fontán Muiños
>>>>> Developer
>>>>> OpenNebula - The Open Source Toolkit for Data Center Virtualization
>>>>> www.OpenNebula.org | @OpenNebula | github.com/jfontan
>>>>
>>>>
>>>>
>>>
>>> --
>>> Gerry O'Brien
>>>
>>> Systems Manager
>>> School of Computer Science and Statistics
>>> Trinity College Dublin
>>> Dublin 2
>>> IRELAND
>>>
>>> 00 353 1 896 1341
>>>
>>
>>
>
>
> --
> Gerry O'Brien
>
> Systems Manager
> School of Computer Science and Statistics
> Trinity College Dublin
> Dublin 2
> IRELAND
>
> 00 353 1 896 1341
>



-- 
Javier Fontán Muiños
Developer
OpenNebula - The Open Source Toolkit for Data Center Virtualization
www.OpenNebula.org | @OpenNebula | github.com/jfontan


More information about the Users mailing list