[one-users] What remotes commands does one 4.6 use:

Wed Jul 30 07:58:35 PDT 2014

This seems to be a bug, when collectd does not respond (because of waiting
for sudo password) OpenNebula does not move the hosts to ERROR. The probes
are designed to not start another collectd process; but probably we should
check that a running one it is not working and send the ERROR message to
OpenNebula.

Pointer to the issue:
http://dev.opennebula.org/issues/3118

Cheers

On Wed, Jul 30, 2014 at 4:53 PM, Steven Timm <timm at fnal.gov> wrote:

> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>
>  Hi,
>> 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo
>> access. It should be automatically setup by the opennebula node
>> packages.
>>
>> 2.- It is not a real daemon, the first time a host is monitored a process
>> is left to periodically send information. OpenNebula
>> restarts it if no information is received in 3 monitor steps. Nothing
>> needs to be set up...
>>
>> Cheers
>>
>>
> On further inspection I found that this collectd was running on my nodes,
> and obviously failing up until now because the sudoers was not set
> correctly.  But there was nothing to warn us about it.  Nothing on
> the opennebula head node to even tell us that the information was stale.
> No log file on the node to show the errors we were getting. In short,
> it was just quietly dying and we had no idea.  How to make sure this
> doesn't happen again in the future?
>
> Steve Timm
>
>
>
>
>
>
>
>> On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <timm at fnal.gov> wrote:
>>       On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>
>>
>>             Maybe you could try to execute the  monitor probes in the
>> node,
>>
>>             1. ssh the node
>>             2. Go to /var/tmp/one/im
>>             3. Execute run_probes kvm-probes
>>
>>
>>       When I do that, (using sh -x ) I get the following:
>>
>>       -bash-4.1$ sh -x ./run_probes kvm-probes
>>       ++ dirname ./run_probes
>>       + source ./../scripts_common.sh
>>       ++ export LANG=C
>>       ++ LANG=C
>>       ++ export
>>       PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>       ++
>>       PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>       ++ AWK=awk
>>       ++ BASH=bash
>>       ++ CUT=cut
>>       ++ DATE=date
>>       ++ DD=dd
>>       ++ DF=df
>>       ++ DU=du
>>       ++ GREP=grep
>>       ++ ISCSIADM=iscsiadm
>>       ++ LVCREATE=lvcreate
>>       ++ LVREMOVE=lvremove
>>       ++ LVRENAME=lvrename
>>       ++ LVS=lvs
>>       ++ LN=ln
>>       ++ MD5SUM=md5sum
>>       ++ MKFS=mkfs
>>       ++ MKISOFS=genisoimage
>>       ++ MKSWAP=mkswap
>>       ++ QEMU_IMG=qemu-img
>>       ++ RADOS=rados
>>       ++ RBD=rbd
>>       ++ READLINK=readlink
>>       ++ RM=rm
>>       ++ SCP=scp
>>       ++ SED=sed
>>       ++ SSH=ssh
>>       ++ SUDO=sudo
>>       ++ SYNC=sync
>>       ++ TAR=tar
>>       ++ TGTADM=tgtadm
>>       ++ TGTADMIN=tgt-admin
>>       ++ TGTSETUPLUN=tgt-setup-lun-one
>>       ++ TR=tr
>>       ++ VGDISPLAY=vgdisplay
>>       ++ VMKFSTOOLS=vmkfstools
>>       ++ WGET=wget
>>       +++ uname -s
>>       ++ '[' xLinux = xLinux ']'
>>       ++ SED='sed -r'
>>       +++ basename ./run_probes
>>       ++ SCRIPT_NAME=run_probes
>>       + export LANG=C
>>       + LANG=C
>>       + HYPERVISOR_DIR=kvm-probes.d
>>       + ARGUMENTS=kvm-probes
>>       ++ dirname ./run_probes
>>       + SCRIPTS_DIR=.
>>       + cd .
>>       ++ '[' -d kvm-probes.d ']'
>>       ++ run_dir kvm-probes.d
>>       ++ cd kvm-probes.d
>>       +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb
>> monitor_ds.sh name.sh poll.sh version.sh
>>       ++ for i in '`ls *`'
>>       ++ '[' -x architecture.sh ']'
>>       ++ ./architecture.sh kvm-probes
>>       ++ EXIT_CODE=0
>>       ++ '[' x0 '!=' x0 ']'
>>       ++ for i in '`ls *`'
>>       ++ '[' -x collectd-client-shepherd.sh ']'
>>       ++ ./collectd-client-shepherd.sh kvm-probes
>>       ++ EXIT_CODE=0
>>       ++ '[' x0 '!=' x0 ']'
>>       ++ for i in '`ls *`'
>>       ++ '[' -x cpu.sh ']'
>>       ++ ./cpu.sh kvm-probes
>>       ++ EXIT_CODE=0
>>       ++ '[' x0 '!=' x0 ']'
>>       ++ for i in '`ls *`'
>>       ++ '[' -x kvm.rb ']'
>>       ++ ./kvm.rb kvm-probes
>>       ++ EXIT_CODE=0
>>       ++ '[' x0 '!=' x0 ']'
>>       ++ for i in '`ls *`'
>>       ++ '[' -x monitor_ds.sh ']'
>>       ++ ./monitor_ds.sh kvm-probes
>>       [sudo] password for oneadmin:
>>
>>       and it stays hung on the password for oneadmin.
>>
>>       What's going on?
>>
>>       Also, you mentioned a collectd--are you saying that OpenNebula 4.6
>> now needs to run a daemon on every single VM host?
>>        Where is it documented
>>       on how to set it up?
>>
>>       Steve
>>
>>
>>
>>
>>
>>
>>
>>             Make sure you do not have a host using the same hostname
>> fgtest14 and running a  collectd process
>>
>>             On Jul 29, 2014 4:35 PM, "Steven Timm" <timm at fnal.gov> wrote:
>>
>>                   I am still trying to debug a nasty monitoring
>> inconsistency.
>>
>>                   -bash-4.1$ onevm list | grep fgtest14
>>                       26 oneadmin oneadmin fgt6x4-26       runn    6
>>  4G fgtest14   117d 19h50
>>                       27 oneadmin oneadmin fgt5x4-27       runn   10
>>  4G fgtest14   117d 17h57
>>                       28 oneadmin oneadmin fgt1x1-28       runn   10
>>  4.1G fgtest14   117d 16h59
>>                       30 oneadmin oneadmin fgt5x1-30       runn    0
>>  4G fgtest14   116d 23h50
>>                       33 oneadmin oneadmin ip6sl5vda-33    runn    6
>>  4G fgtest14   116d 19h57
>>                   -bash-4.1$ onehost list
>>                     ID NAME            CLUSTER   RVM      ALLOCATED_CPU
>>    ALLOCATED_MEM STAT
>>                      3 fgtest11        ipv6        0       0 / 400 (0%)
>>  0K / 15.7G (0%) on
>>                      4 fgtest12        ipv6        0       0 / 400 (0%)
>>  0K / 15.7G (0%) on
>>                      7 fgtest13        ipv6        0       0 / 800 (0%)
>>  0K / 23.6G (0%) on
>>                      8 fgtest14        ipv6        5       0 / 800 (0%)
>>  0K / 23.6G (0%) on
>>                      9 fgtest20        ipv6        3    300 / 800 (37%)
>>  12G / 31.4G (38%) on
>>                     11 fgtest19        ipv6        0       0 / 800 (0%)
>>  0K / 31.5G (0%) on
>>                   -bash-4.1$ onehost show 8
>>                   HOST 8 INFORMATION
>>                   ID                    : 8
>>                   NAME                  : fgtest14
>>                   CLUSTER               : ipv6
>>                   STATE                 : MONITORED
>>                   IM_MAD                : kvm
>>                   VM_MAD                : kvm
>>                   VN_MAD                : dummy
>>                   LAST MONITORING TIME  : 07/29 09:25:45
>>
>>                   HOST SHARES
>>                   TOTAL MEM             : 23.6G
>>                   USED MEM (REAL)       : 876.4M
>>                   USED MEM (ALLOCATED)  : 0K
>>                   TOTAL CPU             : 800
>>                   USED CPU (REAL)       : 0
>>                   USED CPU (ALLOCATED)  : 0
>>                   RUNNING VMS           : 5
>>
>>                   LOCAL SYSTEM DATASTORE #102 CAPACITY
>>                   TOTAL:                : 548.8G
>>                   USED:                 : 175.3G
>>                   FREE:                 : 345.6G
>>
>>                   MONITORING INFORMATION
>>                   ARCH="x86_64"
>>                   CPUSPEED="2992"
>>                   HOSTNAME="fgtest14.fnal.gov"
>>                   HYPERVISOR="kvm"
>>                   MODELNAME="Intel(R) Xeon(R) CPU           E5450  @
>> 3.00GHz"
>>                   NETRX="234844577"
>>                   NETTX="21553126"
>>                   RESERVED_CPU=""
>>                   RESERVED_MEM=""
>>                   VERSION="4.6.0"
>>
>>                   VIRTUAL MACHINES
>>
>>                       ID USER     GROUP    NAME            STAT UCPU
>>  UMEM HOST TIME
>>                       26 oneadmin oneadmin fgt6x4-26       runn    6
>>  4G fgtest14   117d 19h50
>>                       27 oneadmin oneadmin fgt5x4-27       runn   10
>>  4G fgtest14   117d 17h57
>>                       28 oneadmin oneadmin fgt1x1-28       runn   10
>>  4.1G fgtest14   117d 17h00
>>                       30 oneadmin oneadmin fgt5x1-30       runn    0
>>  4G fgtest14   116d 23h50
>>                       33 oneadmin oneadmin ip6sl5vda-33    runn    6
>>  4G fgtest14   116d 19h57
>>                   ------------------------------
>> -----------------------------------------------------
>>
>>                   All of this looks great, right?
>>                   Just one problem:  There are no VM's running on
>> fgtest14 and
>>                   haven't been for 4 days.
>>
>>                   [root at fgtest14 ~]# virsh list
>>                    Id    Name                           State
>>                   ----------------------------------------------------
>>
>>                   [root at fgtest14 ~]#
>>
>>                   ------------------------------
>> -------------------------------------------
>>                   Yet the monitoring reports no errors.
>>
>>                   Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8)
>> successfully monitored.
>>
>>                   ------------------------------
>> -----------------------------------------------
>>                   At the same time, there is no evidence that ONE is
>> actually trying to or
>>                   succeeding to monitor these five vm's yet they are
>> still stuck in "runn"
>>                   which means I can't do a onevm restart to restart them.
>>                   (the vm images of these 5 vm's are still out there on
>> the VM host and
>>                   I would like to save and restart them if I can).
>>
>>                   What is the remotes command that ONE4.6 would use to
>> monitor this host?
>>                   Can I do it manually and see what output I get?
>>
>>                   Are we dealing with some kind of a bug, or just a very
>> confused system?
>>                   Any help is appreciated. I have to get this sorted out
>> before
>>                   I dare deploy one4.x in production.
>>
>>                   Steve Timm
>>
>>
>>                   ------------------------------
>> ------------------------------------
>>                   Steven C. Timm, Ph.D  (630) 840-8525
>>                   timm at fnal.gov  http://home.fnal.gov/~timm/
>>                   Fermilab Scientific Computing Division, Scientific
>> Computing Services Quad.
>>                   Grid and Cloud Services Dept., Associate Dept. Head for
>> Cloud Computing
>>                   _______________________________________________
>>                   Users mailing list
>>                   Users at lists.opennebula.org
>>                   http://lists.opennebula.org/
>> listinfo.cgi/users-opennebula.org
>>
>>
>>
>>
>>       ------------------------------------------------------------------
>>       Steven C. Timm, Ph.D  (630) 840-8525
>>       timm at fnal.gov  http://home.fnal.gov/~timm/
>>       Fermilab Scientific Computing Division, Scientific Computing
>> Services Quad.
>>       Grid and Cloud Services Dept., Associate Dept. Head for Cloud
>> Computing
>>
>>
>>
>>
>> --
>> --
>> Ruben S. Montero, PhD
>> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise
>> Cloud Made Simple
>> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>>
>>
>>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D  (630) 840-8525
> timm at fnal.gov  http://home.fnal.gov/~timm/
> Fermilab Scientific Computing Division, Scientific Computing Services Quad.
> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>

-- 
-- 
Ruben S. Montero, PhD
Project co-Lead and Chief Architect
OpenNebula - Flexible Enterprise Cloud Made Simple
www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20140730/8473b5c3/attachment-0001.htm>