[one-users] What remotes commands does one 4.6 use:

Ruben S. Montero rsmontero at opennebula.org
Wed Jul 30 08:02:58 PDT 2014


Sorry for the noise, just saw it in the other thread...


On Wed, Jul 30, 2014 at 5:01 PM, Ruben S. Montero <rsmontero at opennebula.org>
wrote:

> BTW, Could you paste the ouput of run_probes commands once it finish?
>
>
> On Wed, Jul 30, 2014 at 4:58 PM, Ruben S. Montero <
> rsmontero at opennebula.org> wrote:
>
>> This seems to be a bug, when collectd does not respond (because of
>> waiting for sudo password) OpenNebula does not move the hosts to ERROR. The
>> probes are designed to not start another collectd process; but probably we
>> should check that a running one it is not working and send the ERROR
>> message to OpenNebula.
>>
>> Pointer to the issue:
>> http://dev.opennebula.org/issues/3118
>>
>> Cheers
>>
>>
>> On Wed, Jul 30, 2014 at 4:53 PM, Steven Timm <timm at fnal.gov> wrote:
>>
>>> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>>
>>>  Hi,
>>>> 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo
>>>> access. It should be automatically setup by the opennebula node
>>>> packages.
>>>>
>>>> 2.- It is not a real daemon, the first time a host is monitored a
>>>> process is left to periodically send information. OpenNebula
>>>> restarts it if no information is received in 3 monitor steps. Nothing
>>>> needs to be set up...
>>>>
>>>> Cheers
>>>>
>>>>
>>> On further inspection I found that this collectd was running on my
>>> nodes, and obviously failing up until now because the sudoers was not set
>>> correctly.  But there was nothing to warn us about it.  Nothing on
>>> the opennebula head node to even tell us that the information was stale.
>>> No log file on the node to show the errors we were getting. In short,
>>> it was just quietly dying and we had no idea.  How to make sure this
>>> doesn't happen again in the future?
>>>
>>> Steve Timm
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <timm at fnal.gov> wrote:
>>>>       On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>>>
>>>>
>>>>             Maybe you could try to execute the  monitor probes in the
>>>> node,
>>>>
>>>>             1. ssh the node
>>>>             2. Go to /var/tmp/one/im
>>>>             3. Execute run_probes kvm-probes
>>>>
>>>>
>>>>       When I do that, (using sh -x ) I get the following:
>>>>
>>>>       -bash-4.1$ sh -x ./run_probes kvm-probes
>>>>       ++ dirname ./run_probes
>>>>       + source ./../scripts_common.sh
>>>>       ++ export LANG=C
>>>>       ++ LANG=C
>>>>       ++ export
>>>>       PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>>>       ++
>>>>       PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>>>       ++ AWK=awk
>>>>       ++ BASH=bash
>>>>       ++ CUT=cut
>>>>       ++ DATE=date
>>>>       ++ DD=dd
>>>>       ++ DF=df
>>>>       ++ DU=du
>>>>       ++ GREP=grep
>>>>       ++ ISCSIADM=iscsiadm
>>>>       ++ LVCREATE=lvcreate
>>>>       ++ LVREMOVE=lvremove
>>>>       ++ LVRENAME=lvrename
>>>>       ++ LVS=lvs
>>>>       ++ LN=ln
>>>>       ++ MD5SUM=md5sum
>>>>       ++ MKFS=mkfs
>>>>       ++ MKISOFS=genisoimage
>>>>       ++ MKSWAP=mkswap
>>>>       ++ QEMU_IMG=qemu-img
>>>>       ++ RADOS=rados
>>>>       ++ RBD=rbd
>>>>       ++ READLINK=readlink
>>>>       ++ RM=rm
>>>>       ++ SCP=scp
>>>>       ++ SED=sed
>>>>       ++ SSH=ssh
>>>>       ++ SUDO=sudo
>>>>       ++ SYNC=sync
>>>>       ++ TAR=tar
>>>>       ++ TGTADM=tgtadm
>>>>       ++ TGTADMIN=tgt-admin
>>>>       ++ TGTSETUPLUN=tgt-setup-lun-one
>>>>       ++ TR=tr
>>>>       ++ VGDISPLAY=vgdisplay
>>>>       ++ VMKFSTOOLS=vmkfstools
>>>>       ++ WGET=wget
>>>>       +++ uname -s
>>>>       ++ '[' xLinux = xLinux ']'
>>>>       ++ SED='sed -r'
>>>>       +++ basename ./run_probes
>>>>       ++ SCRIPT_NAME=run_probes
>>>>       + export LANG=C
>>>>       + LANG=C
>>>>       + HYPERVISOR_DIR=kvm-probes.d
>>>>       + ARGUMENTS=kvm-probes
>>>>       ++ dirname ./run_probes
>>>>       + SCRIPTS_DIR=.
>>>>       + cd .
>>>>       ++ '[' -d kvm-probes.d ']'
>>>>       ++ run_dir kvm-probes.d
>>>>       ++ cd kvm-probes.d
>>>>       +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb
>>>> monitor_ds.sh name.sh poll.sh version.sh
>>>>       ++ for i in '`ls *`'
>>>>       ++ '[' -x architecture.sh ']'
>>>>       ++ ./architecture.sh kvm-probes
>>>>       ++ EXIT_CODE=0
>>>>       ++ '[' x0 '!=' x0 ']'
>>>>       ++ for i in '`ls *`'
>>>>       ++ '[' -x collectd-client-shepherd.sh ']'
>>>>       ++ ./collectd-client-shepherd.sh kvm-probes
>>>>       ++ EXIT_CODE=0
>>>>       ++ '[' x0 '!=' x0 ']'
>>>>       ++ for i in '`ls *`'
>>>>       ++ '[' -x cpu.sh ']'
>>>>       ++ ./cpu.sh kvm-probes
>>>>       ++ EXIT_CODE=0
>>>>       ++ '[' x0 '!=' x0 ']'
>>>>       ++ for i in '`ls *`'
>>>>       ++ '[' -x kvm.rb ']'
>>>>       ++ ./kvm.rb kvm-probes
>>>>       ++ EXIT_CODE=0
>>>>       ++ '[' x0 '!=' x0 ']'
>>>>       ++ for i in '`ls *`'
>>>>       ++ '[' -x monitor_ds.sh ']'
>>>>       ++ ./monitor_ds.sh kvm-probes
>>>>       [sudo] password for oneadmin:
>>>>
>>>>       and it stays hung on the password for oneadmin.
>>>>
>>>>       What's going on?
>>>>
>>>>       Also, you mentioned a collectd--are you saying that OpenNebula
>>>> 4.6 now needs to run a daemon on every single VM host?
>>>>        Where is it documented
>>>>       on how to set it up?
>>>>
>>>>       Steve
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>             Make sure you do not have a host using the same hostname
>>>> fgtest14 and running a  collectd process
>>>>
>>>>             On Jul 29, 2014 4:35 PM, "Steven Timm" <timm at fnal.gov>
>>>> wrote:
>>>>
>>>>                   I am still trying to debug a nasty monitoring
>>>> inconsistency.
>>>>
>>>>                   -bash-4.1$ onevm list | grep fgtest14
>>>>                       26 oneadmin oneadmin fgt6x4-26       runn    6
>>>>    4G fgtest14   117d 19h50
>>>>                       27 oneadmin oneadmin fgt5x4-27       runn   10
>>>>    4G fgtest14   117d 17h57
>>>>                       28 oneadmin oneadmin fgt1x1-28       runn   10
>>>>  4.1G fgtest14   117d 16h59
>>>>                       30 oneadmin oneadmin fgt5x1-30       runn    0
>>>>    4G fgtest14   116d 23h50
>>>>                       33 oneadmin oneadmin ip6sl5vda-33    runn    6
>>>>    4G fgtest14   116d 19h57
>>>>                   -bash-4.1$ onehost list
>>>>                     ID NAME            CLUSTER   RVM      ALLOCATED_CPU
>>>>      ALLOCATED_MEM STAT
>>>>                      3 fgtest11        ipv6        0       0 / 400 (0%)
>>>>    0K / 15.7G (0%) on
>>>>                      4 fgtest12        ipv6        0       0 / 400 (0%)
>>>>    0K / 15.7G (0%) on
>>>>                      7 fgtest13        ipv6        0       0 / 800 (0%)
>>>>    0K / 23.6G (0%) on
>>>>                      8 fgtest14        ipv6        5       0 / 800 (0%)
>>>>    0K / 23.6G (0%) on
>>>>                      9 fgtest20        ipv6        3    300 / 800 (37%)
>>>>  12G / 31.4G (38%) on
>>>>                     11 fgtest19        ipv6        0       0 / 800 (0%)
>>>>    0K / 31.5G (0%) on
>>>>                   -bash-4.1$ onehost show 8
>>>>                   HOST 8 INFORMATION
>>>>                   ID                    : 8
>>>>                   NAME                  : fgtest14
>>>>                   CLUSTER               : ipv6
>>>>                   STATE                 : MONITORED
>>>>                   IM_MAD                : kvm
>>>>                   VM_MAD                : kvm
>>>>                   VN_MAD                : dummy
>>>>                   LAST MONITORING TIME  : 07/29 09:25:45
>>>>
>>>>                   HOST SHARES
>>>>                   TOTAL MEM             : 23.6G
>>>>                   USED MEM (REAL)       : 876.4M
>>>>                   USED MEM (ALLOCATED)  : 0K
>>>>                   TOTAL CPU             : 800
>>>>                   USED CPU (REAL)       : 0
>>>>                   USED CPU (ALLOCATED)  : 0
>>>>                   RUNNING VMS           : 5
>>>>
>>>>                   LOCAL SYSTEM DATASTORE #102 CAPACITY
>>>>                   TOTAL:                : 548.8G
>>>>                   USED:                 : 175.3G
>>>>                   FREE:                 : 345.6G
>>>>
>>>>                   MONITORING INFORMATION
>>>>                   ARCH="x86_64"
>>>>                   CPUSPEED="2992"
>>>>                   HOSTNAME="fgtest14.fnal.gov"
>>>>                   HYPERVISOR="kvm"
>>>>                   MODELNAME="Intel(R) Xeon(R) CPU           E5450  @
>>>> 3.00GHz"
>>>>                   NETRX="234844577"
>>>>                   NETTX="21553126"
>>>>                   RESERVED_CPU=""
>>>>                   RESERVED_MEM=""
>>>>                   VERSION="4.6.0"
>>>>
>>>>                   VIRTUAL MACHINES
>>>>
>>>>                       ID USER     GROUP    NAME            STAT UCPU
>>>>  UMEM HOST TIME
>>>>                       26 oneadmin oneadmin fgt6x4-26       runn    6
>>>>    4G fgtest14   117d 19h50
>>>>                       27 oneadmin oneadmin fgt5x4-27       runn   10
>>>>    4G fgtest14   117d 17h57
>>>>                       28 oneadmin oneadmin fgt1x1-28       runn   10
>>>>  4.1G fgtest14   117d 17h00
>>>>                       30 oneadmin oneadmin fgt5x1-30       runn    0
>>>>    4G fgtest14   116d 23h50
>>>>                       33 oneadmin oneadmin ip6sl5vda-33    runn    6
>>>>    4G fgtest14   116d 19h57
>>>>                   ------------------------------
>>>> -----------------------------------------------------
>>>>
>>>>                   All of this looks great, right?
>>>>                   Just one problem:  There are no VM's running on
>>>> fgtest14 and
>>>>                   haven't been for 4 days.
>>>>
>>>>                   [root at fgtest14 ~]# virsh list
>>>>                    Id    Name                           State
>>>>                   ----------------------------------------------------
>>>>
>>>>                   [root at fgtest14 ~]#
>>>>
>>>>                   ------------------------------
>>>> -------------------------------------------
>>>>                   Yet the monitoring reports no errors.
>>>>
>>>>                   Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8)
>>>> successfully monitored.
>>>>
>>>>                   ------------------------------
>>>> -----------------------------------------------
>>>>                   At the same time, there is no evidence that ONE is
>>>> actually trying to or
>>>>                   succeeding to monitor these five vm's yet they are
>>>> still stuck in "runn"
>>>>                   which means I can't do a onevm restart to restart
>>>> them.
>>>>                   (the vm images of these 5 vm's are still out there on
>>>> the VM host and
>>>>                   I would like to save and restart them if I can).
>>>>
>>>>                   What is the remotes command that ONE4.6 would use to
>>>> monitor this host?
>>>>                   Can I do it manually and see what output I get?
>>>>
>>>>                   Are we dealing with some kind of a bug, or just a
>>>> very confused system?
>>>>                   Any help is appreciated. I have to get this sorted
>>>> out before
>>>>                   I dare deploy one4.x in production.
>>>>
>>>>                   Steve Timm
>>>>
>>>>
>>>>                   ------------------------------
>>>> ------------------------------------
>>>>                   Steven C. Timm, Ph.D  (630) 840-8525
>>>>                   timm at fnal.gov  http://home.fnal.gov/~timm/
>>>>                   Fermilab Scientific Computing Division, Scientific
>>>> Computing Services Quad.
>>>>                   Grid and Cloud Services Dept., Associate Dept. Head
>>>> for Cloud Computing
>>>>                   _______________________________________________
>>>>                   Users mailing list
>>>>                   Users at lists.opennebula.org
>>>>                   http://lists.opennebula.org/
>>>> listinfo.cgi/users-opennebula.org
>>>>
>>>>
>>>>
>>>>
>>>>       ------------------------------------------------------------
>>>> ------
>>>>       Steven C. Timm, Ph.D  (630) 840-8525
>>>>       timm at fnal.gov  http://home.fnal.gov/~timm/
>>>>       Fermilab Scientific Computing Division, Scientific Computing
>>>> Services Quad.
>>>>       Grid and Cloud Services Dept., Associate Dept. Head for Cloud
>>>> Computing
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Ruben S. Montero, PhD
>>>> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise
>>>> Cloud Made Simple
>>>> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>>>>
>>>>
>>>>
>>> ------------------------------------------------------------------
>>> Steven C. Timm, Ph.D  (630) 840-8525
>>> timm at fnal.gov  http://home.fnal.gov/~timm/
>>> Fermilab Scientific Computing Division, Scientific Computing Services
>>> Quad.
>>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>>>
>>
>>
>>
>> --
>> --
>> Ruben S. Montero, PhD
>> Project co-Lead and Chief Architect
>> OpenNebula - Flexible Enterprise Cloud Made Simple
>> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>>
>
>
>
> --
> --
> Ruben S. Montero, PhD
> Project co-Lead and Chief Architect
> OpenNebula - Flexible Enterprise Cloud Made Simple
> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>



-- 
-- 
Ruben S. Montero, PhD
Project co-Lead and Chief Architect
OpenNebula - Flexible Enterprise Cloud Made Simple
www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20140730/8c6a02d9/attachment-0001.htm>


More information about the Users mailing list