[one-users] What remotes commands does one 4.6 use:
Ruben S. Montero
rsmontero at opennebula.org
Wed Jul 30 08:01:48 PDT 2014
BTW, Could you paste the ouput of run_probes commands once it finish?
On Wed, Jul 30, 2014 at 4:58 PM, Ruben S. Montero <rsmontero at opennebula.org>
wrote:
> This seems to be a bug, when collectd does not respond (because of waiting
> for sudo password) OpenNebula does not move the hosts to ERROR. The probes
> are designed to not start another collectd process; but probably we should
> check that a running one it is not working and send the ERROR message to
> OpenNebula.
>
> Pointer to the issue:
> http://dev.opennebula.org/issues/3118
>
> Cheers
>
>
> On Wed, Jul 30, 2014 at 4:53 PM, Steven Timm <timm at fnal.gov> wrote:
>
>> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>
>> Hi,
>>> 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo
>>> access. It should be automatically setup by the opennebula node
>>> packages.
>>>
>>> 2.- It is not a real daemon, the first time a host is monitored a
>>> process is left to periodically send information. OpenNebula
>>> restarts it if no information is received in 3 monitor steps. Nothing
>>> needs to be set up...
>>>
>>> Cheers
>>>
>>>
>> On further inspection I found that this collectd was running on my nodes,
>> and obviously failing up until now because the sudoers was not set
>> correctly. But there was nothing to warn us about it. Nothing on
>> the opennebula head node to even tell us that the information was stale.
>> No log file on the node to show the errors we were getting. In short,
>> it was just quietly dying and we had no idea. How to make sure this
>> doesn't happen again in the future?
>>
>> Steve Timm
>>
>>
>>
>>
>>
>>
>>
>>> On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <timm at fnal.gov> wrote:
>>> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>>
>>>
>>> Maybe you could try to execute the monitor probes in the
>>> node,
>>>
>>> 1. ssh the node
>>> 2. Go to /var/tmp/one/im
>>> 3. Execute run_probes kvm-probes
>>>
>>>
>>> When I do that, (using sh -x ) I get the following:
>>>
>>> -bash-4.1$ sh -x ./run_probes kvm-probes
>>> ++ dirname ./run_probes
>>> + source ./../scripts_common.sh
>>> ++ export LANG=C
>>> ++ LANG=C
>>> ++ export
>>> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>> ++
>>> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>> ++ AWK=awk
>>> ++ BASH=bash
>>> ++ CUT=cut
>>> ++ DATE=date
>>> ++ DD=dd
>>> ++ DF=df
>>> ++ DU=du
>>> ++ GREP=grep
>>> ++ ISCSIADM=iscsiadm
>>> ++ LVCREATE=lvcreate
>>> ++ LVREMOVE=lvremove
>>> ++ LVRENAME=lvrename
>>> ++ LVS=lvs
>>> ++ LN=ln
>>> ++ MD5SUM=md5sum
>>> ++ MKFS=mkfs
>>> ++ MKISOFS=genisoimage
>>> ++ MKSWAP=mkswap
>>> ++ QEMU_IMG=qemu-img
>>> ++ RADOS=rados
>>> ++ RBD=rbd
>>> ++ READLINK=readlink
>>> ++ RM=rm
>>> ++ SCP=scp
>>> ++ SED=sed
>>> ++ SSH=ssh
>>> ++ SUDO=sudo
>>> ++ SYNC=sync
>>> ++ TAR=tar
>>> ++ TGTADM=tgtadm
>>> ++ TGTADMIN=tgt-admin
>>> ++ TGTSETUPLUN=tgt-setup-lun-one
>>> ++ TR=tr
>>> ++ VGDISPLAY=vgdisplay
>>> ++ VMKFSTOOLS=vmkfstools
>>> ++ WGET=wget
>>> +++ uname -s
>>> ++ '[' xLinux = xLinux ']'
>>> ++ SED='sed -r'
>>> +++ basename ./run_probes
>>> ++ SCRIPT_NAME=run_probes
>>> + export LANG=C
>>> + LANG=C
>>> + HYPERVISOR_DIR=kvm-probes.d
>>> + ARGUMENTS=kvm-probes
>>> ++ dirname ./run_probes
>>> + SCRIPTS_DIR=.
>>> + cd .
>>> ++ '[' -d kvm-probes.d ']'
>>> ++ run_dir kvm-probes.d
>>> ++ cd kvm-probes.d
>>> +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb
>>> monitor_ds.sh name.sh poll.sh version.sh
>>> ++ for i in '`ls *`'
>>> ++ '[' -x architecture.sh ']'
>>> ++ ./architecture.sh kvm-probes
>>> ++ EXIT_CODE=0
>>> ++ '[' x0 '!=' x0 ']'
>>> ++ for i in '`ls *`'
>>> ++ '[' -x collectd-client-shepherd.sh ']'
>>> ++ ./collectd-client-shepherd.sh kvm-probes
>>> ++ EXIT_CODE=0
>>> ++ '[' x0 '!=' x0 ']'
>>> ++ for i in '`ls *`'
>>> ++ '[' -x cpu.sh ']'
>>> ++ ./cpu.sh kvm-probes
>>> ++ EXIT_CODE=0
>>> ++ '[' x0 '!=' x0 ']'
>>> ++ for i in '`ls *`'
>>> ++ '[' -x kvm.rb ']'
>>> ++ ./kvm.rb kvm-probes
>>> ++ EXIT_CODE=0
>>> ++ '[' x0 '!=' x0 ']'
>>> ++ for i in '`ls *`'
>>> ++ '[' -x monitor_ds.sh ']'
>>> ++ ./monitor_ds.sh kvm-probes
>>> [sudo] password for oneadmin:
>>>
>>> and it stays hung on the password for oneadmin.
>>>
>>> What's going on?
>>>
>>> Also, you mentioned a collectd--are you saying that OpenNebula 4.6
>>> now needs to run a daemon on every single VM host?
>>> Where is it documented
>>> on how to set it up?
>>>
>>> Steve
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Make sure you do not have a host using the same hostname
>>> fgtest14 and running a collectd process
>>>
>>> On Jul 29, 2014 4:35 PM, "Steven Timm" <timm at fnal.gov>
>>> wrote:
>>>
>>> I am still trying to debug a nasty monitoring
>>> inconsistency.
>>>
>>> -bash-4.1$ onevm list | grep fgtest14
>>> 26 oneadmin oneadmin fgt6x4-26 runn 6
>>> 4G fgtest14 117d 19h50
>>> 27 oneadmin oneadmin fgt5x4-27 runn 10
>>> 4G fgtest14 117d 17h57
>>> 28 oneadmin oneadmin fgt1x1-28 runn 10
>>> 4.1G fgtest14 117d 16h59
>>> 30 oneadmin oneadmin fgt5x1-30 runn 0
>>> 4G fgtest14 116d 23h50
>>> 33 oneadmin oneadmin ip6sl5vda-33 runn 6
>>> 4G fgtest14 116d 19h57
>>> -bash-4.1$ onehost list
>>> ID NAME CLUSTER RVM ALLOCATED_CPU
>>> ALLOCATED_MEM STAT
>>> 3 fgtest11 ipv6 0 0 / 400 (0%)
>>> 0K / 15.7G (0%) on
>>> 4 fgtest12 ipv6 0 0 / 400 (0%)
>>> 0K / 15.7G (0%) on
>>> 7 fgtest13 ipv6 0 0 / 800 (0%)
>>> 0K / 23.6G (0%) on
>>> 8 fgtest14 ipv6 5 0 / 800 (0%)
>>> 0K / 23.6G (0%) on
>>> 9 fgtest20 ipv6 3 300 / 800 (37%)
>>> 12G / 31.4G (38%) on
>>> 11 fgtest19 ipv6 0 0 / 800 (0%)
>>> 0K / 31.5G (0%) on
>>> -bash-4.1$ onehost show 8
>>> HOST 8 INFORMATION
>>> ID : 8
>>> NAME : fgtest14
>>> CLUSTER : ipv6
>>> STATE : MONITORED
>>> IM_MAD : kvm
>>> VM_MAD : kvm
>>> VN_MAD : dummy
>>> LAST MONITORING TIME : 07/29 09:25:45
>>>
>>> HOST SHARES
>>> TOTAL MEM : 23.6G
>>> USED MEM (REAL) : 876.4M
>>> USED MEM (ALLOCATED) : 0K
>>> TOTAL CPU : 800
>>> USED CPU (REAL) : 0
>>> USED CPU (ALLOCATED) : 0
>>> RUNNING VMS : 5
>>>
>>> LOCAL SYSTEM DATASTORE #102 CAPACITY
>>> TOTAL: : 548.8G
>>> USED: : 175.3G
>>> FREE: : 345.6G
>>>
>>> MONITORING INFORMATION
>>> ARCH="x86_64"
>>> CPUSPEED="2992"
>>> HOSTNAME="fgtest14.fnal.gov"
>>> HYPERVISOR="kvm"
>>> MODELNAME="Intel(R) Xeon(R) CPU E5450 @
>>> 3.00GHz"
>>> NETRX="234844577"
>>> NETTX="21553126"
>>> RESERVED_CPU=""
>>> RESERVED_MEM=""
>>> VERSION="4.6.0"
>>>
>>> VIRTUAL MACHINES
>>>
>>> ID USER GROUP NAME STAT UCPU
>>> UMEM HOST TIME
>>> 26 oneadmin oneadmin fgt6x4-26 runn 6
>>> 4G fgtest14 117d 19h50
>>> 27 oneadmin oneadmin fgt5x4-27 runn 10
>>> 4G fgtest14 117d 17h57
>>> 28 oneadmin oneadmin fgt1x1-28 runn 10
>>> 4.1G fgtest14 117d 17h00
>>> 30 oneadmin oneadmin fgt5x1-30 runn 0
>>> 4G fgtest14 116d 23h50
>>> 33 oneadmin oneadmin ip6sl5vda-33 runn 6
>>> 4G fgtest14 116d 19h57
>>> ------------------------------
>>> -----------------------------------------------------
>>>
>>> All of this looks great, right?
>>> Just one problem: There are no VM's running on
>>> fgtest14 and
>>> haven't been for 4 days.
>>>
>>> [root at fgtest14 ~]# virsh list
>>> Id Name State
>>> ----------------------------------------------------
>>>
>>> [root at fgtest14 ~]#
>>>
>>> ------------------------------
>>> -------------------------------------------
>>> Yet the monitoring reports no errors.
>>>
>>> Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8)
>>> successfully monitored.
>>>
>>> ------------------------------
>>> -----------------------------------------------
>>> At the same time, there is no evidence that ONE is
>>> actually trying to or
>>> succeeding to monitor these five vm's yet they are
>>> still stuck in "runn"
>>> which means I can't do a onevm restart to restart them.
>>> (the vm images of these 5 vm's are still out there on
>>> the VM host and
>>> I would like to save and restart them if I can).
>>>
>>> What is the remotes command that ONE4.6 would use to
>>> monitor this host?
>>> Can I do it manually and see what output I get?
>>>
>>> Are we dealing with some kind of a bug, or just a very
>>> confused system?
>>> Any help is appreciated. I have to get this sorted out
>>> before
>>> I dare deploy one4.x in production.
>>>
>>> Steve Timm
>>>
>>>
>>> ------------------------------
>>> ------------------------------------
>>> Steven C. Timm, Ph.D (630) 840-8525
>>> timm at fnal.gov http://home.fnal.gov/~timm/
>>> Fermilab Scientific Computing Division, Scientific
>>> Computing Services Quad.
>>> Grid and Cloud Services Dept., Associate Dept. Head
>>> for Cloud Computing
>>> _______________________________________________
>>> Users mailing list
>>> Users at lists.opennebula.org
>>> http://lists.opennebula.org/
>>> listinfo.cgi/users-opennebula.org
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------
>>> Steven C. Timm, Ph.D (630) 840-8525
>>> timm at fnal.gov http://home.fnal.gov/~timm/
>>> Fermilab Scientific Computing Division, Scientific Computing
>>> Services Quad.
>>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud
>>> Computing
>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Ruben S. Montero, PhD
>>> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise
>>> Cloud Made Simple
>>> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>>>
>>>
>>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D (630) 840-8525
>> timm at fnal.gov http://home.fnal.gov/~timm/
>> Fermilab Scientific Computing Division, Scientific Computing Services
>> Quad.
>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>>
>
>
>
> --
> --
> Ruben S. Montero, PhD
> Project co-Lead and Chief Architect
> OpenNebula - Flexible Enterprise Cloud Made Simple
> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>
--
--
Ruben S. Montero, PhD
Project co-Lead and Chief Architect
OpenNebula - Flexible Enterprise Cloud Made Simple
www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20140730/b5e49f10/attachment-0001.htm>
More information about the Users
mailing list