[one-users] OpenNebula 4.6.0 monitoring question

Ruben S. Montero rsmontero at opennebula.org
Wed Jul 30 08:29:35 PDT 2014


This seems to be a problem when upgrading the DB, See the inconsistency in
fgtest14:

<RUNNING_VMS>5</RUNNING_VMS>....<VMS></VMS>

That's the reason for not seeing any action taken on VM 26 it is not
registered in the host (empty <VM> set)

I suggest to stop oned and execute onedb fsck

Cheers


On Wed, Jul 30, 2014 at 4:44 PM, Steven Timm <timm at fnal.gov> wrote:

> OK--I have now installed the opennebula-node-kvm rpm on
> all of the VM hosts (SURPRISE), made sure that the collectd
> that is running is the current one from opennebula 4.6,
> and verified that the run_probes kvm-probes can
> run interactively as oneadmin on all of the nodes.  the one on
> fgtest14 correctly reports that there are no running VM's,
> and the two machines that do have running vm's correctly report
> that they do have running VM's.
>
> Only problem is, the five virtual machines that opennebula still thinks
> are running on fgtest14, still report back as running
> even though opennebula hasn't made any attempt to monitor them?
>
> How do we get things back into sync and tell opennebula that VM #26
> isn't really running anymore? Is there a way to force this vm into
> "unknown" state so we can do a onevm boot on it?  Database hackery
> included?  Even better, has someone come up with an XML hacker to
> do the XML substitition of one field in the huge mysql field?
>
> Even more important:  it's clear that the monitoring was obviously
> failing and failing for a long time because we didn't have the
> sudoers file there that the opennebula-node-kvm provides.
> But there was absolutely no warning of that.. as far as the
> head node was concerned we were happy as a clam.
>
>
> ----
>
> The important pieces of output from run_probes kvm-probes
>
> fgtest19
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=33010680
> USEDMEMORY=1586216
> FREEMEMORY=31424464
> FREECPU=800.0
> USEDCPU=0.0
> NETRX=5958104400
> NETTX=2323329968
> DS_LOCATION_USED_MB=1924
> DS_LOCATION_TOTAL_MB=280380
> DS_LOCATION_FREE_MB=264129
> DS = [
>   ID = 102,
>   USED_MB = 1924,
>   TOTAL_MB = 280380,
>   FREE_MB = 264129
> ]
> HOSTNAME=fgtest19.fnal.gov
> VM_POLL=YES
> VM=[
>   ID=55,
>   DEPLOY_ID=one-55,
>   POLL="NETRX=25289118 USEDCPU=0.0 NETTX=214808 USEDMEMORY=4194304
> STATE=a" ]
> VERSION="4.6.0"
> fgtest20
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=32875804
> USEDMEMORY=8801100
> FREEMEMORY=24074704
> FREECPU=793.6
> USEDCPU=6.39999999999998
> NETRX=184155823062
> NETTX=58685116817
> DS_LOCATION_USED_MB=50049
> DS_LOCATION_TOTAL_MB=281012
> DS_LOCATION_FREE_MB=216499
> DS = [
>   ID = 102,
>   USED_MB = 50049,
>   TOTAL_MB = 281012,
>   FREE_MB = 216499
> ]
> HOSTNAME=fgtest20.fnal.gov
> VM_POLL=YES
> VM=[
>   ID=31,
>   DEPLOY_ID=one-31,
>   POLL="NETRX=71728978887 USEDCPU=0.5 NETTX=54281255903 USEDMEMORY=4270812
> STATE=a" ]
> VM=[
>   ID=24,
>   DEPLOY_ID=one-24,
>   POLL="NETRX=2383960153 USEDCPU=0.0 NETTX=17345416 USEDMEMORY=4194304
> STATE=a" ]
> VM=[
>   ID=48,
>   DEPLOY_ID=one-48,
>   POLL="NETRX=2546074171 USEDCPU=0.0 NETTX=145782495 USEDMEMORY=4194304
> STATE=a" ]
> VERSION="4.6.0"
>
> fgtest14
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=24736796
> USEDMEMORY=937004
> FREEMEMORY=23799792
> FREECPU=800.0
> USEDCPU=0.0
> NETRX=285471609
> NETTX=25467521
> DS_LOCATION_USED_MB=179498
> DS_LOCATION_TOTAL_MB=561999
> DS_LOCATION_FREE_MB=353864
> DS = [
>   ID = 102,
>   USED_MB = 179498,
>   TOTAL_MB = 561999,
>   FREE_MB = 353864
> ]
>
> -------------------------
> And the appropriate excerpts from oned.log:
>
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][D]:
> Restarting VM 26
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][E]:
> Could not restart VM 26, wrong state.
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [DiM][D]:
> Stopping VM 26
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [VMM][D]:
> VM 26 successfully monitored: STATE=-
> -----------------------------------
>
> This is the mysql row in host_pool for host fgtest14
> mysql>
> mysql> select * from host_pool where oid=8 \G
> *************************** 1. row ***************************
>           oid: 8
>          name: fgtest14
>          body: <HOST><ID>8</ID><NAME>fgtest14</NAME><STATE>2</
> STATE><IM_MAD>kvm</IM_MAD><VM_MAD>kvm</VM_MAD><VN_MAD>dummy<
> /VN_MAD><LAST_MON_TIME>1406731190</LAST_MON_TIME><
> CLUSTER_ID>101</CLUSTER_ID><CLUSTER>ipv6</CLUSTER><HOST_
> SHARE><DISK_USAGE>0</DISK_USAGE><MEM_USAGE>0</MEM_USAGE>
> <CPU_USAGE>0</CPU_USAGE><MAX_DISK>561999</MAX_DISK><MAX_
> MEM>24736796</MAX_MEM><MAX_CPU>800</MAX_CPU><FREE_DISK>
> 353864</FREE_DISK><FREE_MEM>23802216</FREE_MEM><FREE_CPU>
> 800</FREE_CPU><USED_DISK>179498</USED_DISK><USED_MEM>
> 934580</USED_MEM><USED_CPU>0</USED_CPU><RUNNING_VMS>5</
> RUNNING_VMS><DATASTORES><DS><FREE_MB><![CDATA[353864]]></
> FREE_MB><ID><![CDATA[102]]></ID><TOTAL_MB><![CDATA[561999]]
> ></TOTAL_MB><USED_MB><![CDATA[179498]]></USED_MB></DS></
> DATASTORES></HOST_SHARE><VMS></VMS><TEMPLATE><ARCH><![CDATA[
> x86_64]]></ARCH><CPUSPEED><![CDATA[2992]]></CPUSPEED><HOSTNAME><![CDATA[
> fgtest14.fnal.gov]]></HOSTNAME><HYPERVISOR><![CDATA[kvm]]></
> HYPERVISOR><MODELNAME><![CDATA[Intel(R) Xeon(R) CPU           E5450  @
> 3.00GHz]]></MODELNAME><NETRX><![CDATA[285677608]]></NETRX><
> NETTX><![CDATA[25489275]]></NETTX><RESERVED_CPU><![CDATA[]
> ]></RESERVED_CPU><RESERVED_MEM><![CDATA[]]></RESERVED_
> MEM><VERSION><![CDATA[4.6.0]]></VERSION></TEMPLATE></HOST>
>         state: 2
> last_mon_time: 1406731190
>           uid: 0
>           gid: 0
>       owner_u: 1
>       group_u: 0
>       other_u: 0
>           cid: 101
> 1 row in set (0.00 sec)
>
>
>
> And this is the row in vm_pool for VM id 26
>
> *************************** 1. row ***************************
>       oid: 26
>      name: fgt6x4-26
>      body: <VM><ID>26</ID><UID>0</UID><GID>0</GID><UNAME>oneadmin</
> UNAME><GNAME>oneadmin</GNAME><NAME>fgt6x4-26</NAME><
> PERMISSIONS><OWNER_U>1</OWNER_U><OWNER_M>1</OWNER_M><OWNER_
> A>0</OWNER_A><GROUP_U>0</GROUP_U><GROUP_M>0</GROUP_M><
> GROUP_A>0</GROUP_A><OTHER_U>0</OTHER_U><OTHER_M>0</OTHER_M><
> OTHER_A>0</OTHER_A></PERMISSIONS><LAST_POLL>1406320668</LAST_POLL><STATE>
> 3</STATE><LCM_STATE>3</LCM_STATE><RESCHED>0</RESCHED><
> STIME>1396463735</STIME><ETIME>0</ETIME><DEPLOY_ID>one-
> 26</DEPLOY_ID><MEMORY>4194304</MEMORY><CPU>6</CPU><NET_TX>
> 748982286</NET_TX><NET_RX>1588690678</NET_RX><TEMPLATE><
> AUTOMATIC_REQUIREMENTS><![CDATA[CLUSTER_ID = 101 & !(PUBLIC_CLOUD =
> YES)]]></AUTOMATIC_REQUIREMENTS><CONTEXT><CTX_USER><![CDATA[PFVTRVI+
> PElEPjA8L0lEPjxHSUQ+MDwvR0lEPjxHUk9VUFM+PElEPjA8L0lEPjwvR1JPVVBTPjxHTk
> FNRT5vbmVhZG1pbjwvR05BTUU+PE5BTUU+b25lYWRtaW48L05BTUU+
> PFBBU1NXT1JEPjFmNjQxYzdlMzZkZWU5MmUzNDQ0Mjk2NmI1OTYwMGJkMGE3
> ZmU5ZDQ8L1BBU1NXT1JEPjxBVVRIX0RSSVZFUj5jb3JlPC9BVVRIX0RSSVZF
> Uj48RU5BQkxFRD4xPC9FTkFCTEVEPjxURU1QTEFURT48VE9LRU5fUEFTU1dPUkQ+
> PCFbQ0RBVEFbNzFhYzU0OWM5MzhmNjA0NmY3NDEzMDI4Y2ZhOGNjODU2YzI2
> ZGNhNV1dPjwvVE9LRU5fUEFTU1dPUkQ+PC9URU1QTEFURT48REFUQVNUT1JFX1
> FVT1RBPjwvREFUQVNUT1JFX1FVT1RBPjxORVRXT1JLX1FVT1RBPjwvTkVUV0
> 9SS19RVU9UQT48Vk1fUVVPVEE+PC9WTV9RVU9UQT48SU1BR0VfUVVPVE
> E+PC9JTUFHRV9RVU9UQT48L1VTRVI+]]></CTX_USER><DISK_ID><![
> CDATA[2]]></DISK_ID><ETH0_DNS><![CDATA[131.225.0.254]]></
> ETH0_DNS><ETH0_GATEWAY><![CDATA[131.225.41.200]]></ETH0_
> GATEWAY><ETH0_IP><![CDATA[131.225.41.169]]></ETH0_IP><ETH0_
> IPV6><![CDATA[2001:400:2410:29::169]]></ETH0_IPV6><ETH0_
> MAC><![CDATA[00:16:3e:06:06:04]]></ETH0_MAC><ETH0_MASK><![
> CDATA[255.255.255.128]]></ETH0_MASK><FILES><![CDATA[/
> cloud/images/OpenNebula/scripts/one3.2/contextualization/init.sh
> /cloud/images/OpenNebula/scripts/one3.2/contextualization/credentials.sh
> /cloud/images/OpenNebula/scripts/one3.2/contextualization/kerberos.sh]
> ]></FILES><GATEWAY><![CDATA[131.225.41.200]]></GATEWAY><INIT_SCRIPTS><![CDATA[init.sh
> credentials.sh kerberos.sh]]></INIT_SCRIPTS><
> IP_PUBLIC><![CDATA[131.225.41.169]]></IP_PUBLIC><NETMASK><![
> CDATA[255.255.255.128]]></NETMASK><NETWORK><![CDATA[YES]
> ]></NETWORK><ROOT_PUBKEY><![CDATA[id_dsa.pub]]></ROOT_
> PUBKEY><TARGET><![CDATA[hdc]]></TARGET><USERNAME><![CDATA[
> opennebula]]></USERNAME><USER_PUBKEY><![CDATA[id_dsa.pub]]><
> /USER_PUBKEY></CONTEXT><CPU><![CDATA[1]]></CPU><DISK><CLONE>
> <![CDATA[NO]]></CLONE><CLONE_TARGET><![CDATA[SYSTEM]]></
> CLONE_TARGET><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><
> DATASTORE><![CDATA[ip6_img_ds]]></DATASTORE><DATASTORE_ID><!
> [CDATA[101]]></DATASTORE_ID><DEV_PREFIX><![CDATA[hd]]></
> DEV_PREFIX><DISK_ID><![CDATA[0]]></DISK_ID><IMAGE><![CDATA[
> fgt6x4_os]]></IMAGE><IMAGE_ID><![CDATA[5]]></IMAGE_ID><
> IMAGE_UNAME><![CDATA[oneadmin]]></IMAGE_UNAME><LN_TARGET><![
> CDATA[SYSTEM]]></LN_TARGET><PERSISTENT><![CDATA[YES]]></
> PERSISTENT><READONLY><![CDATA[NO]]></READONLY><SAVE><![
> CDATA[YES]]></SAVE><SIZE><![CDATA[46080]]></SIZE><SOURCE><
> ![CDATA[/var/lib/one//datastores/101/3078b4235100008fbdbf9dff7eea95
> b1]]></SOURCE><TARGET><![CDATA[vda]]></TARGET><TM_MAD><
> ![CDATA[ssh]]></TM_MAD><TYPE><![CDATA[FILE]]></TYPE></DISK><
> DISK><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_ID><![
> CDATA[1]]></DISK_ID><SIZE><![CDATA[5120]]></SIZE><TARGET><!
> [CDATA[vdb]]></TARGET><TYPE><![CDATA[swap]]></TYPE></DISK><
> FEATURES><ACPI><![CDATA[yes]]></ACPI></FEATURES><GRAPHICS><
> AUTOPORT><![CDATA[yes]]></AUTOPORT><KEYMAP><![CDATA[en-
> us]]></KEYMAP><LISTEN><![CDATA[127.0.0.1]]></LISTEN><
> PORT><![CDATA[5926]]></PORT><TYPE><![CDATA[vnc]]></TYPE></
> GRAPHICS><MEMORY><![CDATA[4096]]></MEMORY><NIC><BRIDGE><
> ![CDATA[br0]]></BRIDGE><CLUSTER_ID><![CDATA[101]]></
> CLUSTER_ID><IP><![CDATA[131.225.41.169]]></IP><IP6_LINK><!
> [CDATA[fe80::216:3eff:fe06:604]]></IP6_LINK><MAC><![
> CDATA[00:16:3e:06:06:04]]></MAC><MODEL><![CDATA[virtio]]><
> /MODEL><NETWORK><![CDATA[Static_IPV6_Public]]></
> NETWORK><NETWORK_ID><![CDATA[1]]></NETWORK_ID><NETWORK_
> UNAME><![CDATA[oneadmin]]></NETWORK_UNAME><NIC_ID><![
> CDATA[0]]></NIC_ID><VLAN><![CDATA[NO]]></VLAN></NIC><OS><
> ARCH><![CDATA[x86_64]]></ARCH></OS><RAW><DATA><![CDATA[
>                 <devices>
>                 <serial type='pty'>
>                         <target port='0'/>
>                 </serial>
>                 <console type='pty'>
>                 <target type='serial' port='0'/>
>                 </console>
>
> </devices>]]></DATA><TYPE><![CDATA[kvm]]></TYPE></RAW><
> TEMPLATE_ID><![CDATA[6]]></TEMPLATE_ID><VCPU><![CDATA[2]]
> ></VCPU><VMID><![CDATA[26]]></VMID></TEMPLATE><USER_TEMPLATE><ERROR><![CDATA[Fri
> Jul 25 15:37:48 2014 : Error saving VM state: Could not save one-26 to
> /var/lib/one/datastores/102/26/checkpoint]]></ERROR><
> NPTYPE><![CDATA[NPERNLM]]></NPTYPE><RANK><![CDATA[
> FREEMEMORY]]></RANK><USERVO><![CDATA[test181818]]></USERVO><
> /USER_TEMPLATE><HISTORY_RECORDS><HISTORY><OID>26</OID>
> <SEQ>0</SEQ><HOSTNAME>fgtest14</HOSTNAME><HID>10</
> HID><CID>101</CID><STIME>1396463752</STIME><ETIME>0</
> ETIME><VMMMAD>kvm</VMMMAD><VNMMAD>dummy</VNMMAD><TMMAD>
> ssh</TMMAD><DS_LOCATION>/var/lib/one/datastores</DS_
> LOCATION><DS_ID>102</DS_ID><PSTIME>1396463752</PSTIME><
> PETIME>1396465032</PETIME><RSTIME>1396465032</RSTIME><
> RETIME>0</RETIME><ESTIME>0</ESTIME><EETIME>0</EETIME><
> REASON>0</REASON><ACTION>0</ACTION></HISTORY></HISTORY_RECORDS></VM>
>       uid: 0
>       gid: 0
> last_poll: 1406320668
>     state: 3
> lcm_state: 3
>   owner_u: 1
>   group_u: 0
>   other_u: 0
> 1 row in set (0.00 sec)
>
>
> -------------------------------
>
>
>
>
> On Wed, 30 Jul 2014, Steven Timm wrote:
>
>  On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>
>>
>>>  Not really sure what can be going on...  The monitor scripts return the
>>>  information of all VMs running in the node.  In 4.6 the
>>>  monitoring system uses a push approach,  through UDP,  so you may have
>>> the
>>>  information being reported by misbehaved monitoring
>>>  daemons.  Sometimes this may happen in dev environments if you are
>>>  resetting the DB,...
>>>
>>
>> when we ran the update to take this database from ONE4.4 to ONE4.6, one
>> host (the aforementioned fgtest14) and one datastore (image store 101) got
>> wiped out of the database, I reinserted them both back in and restarted
>> opennebula.
>>
>> Steve Timm
>>
>>
>>
>>
>>
>>>  On Jul 28, 2014 6:32 PM, "Steven Timm" <timm at fnal.gov> wrote:
>>>
>>>        I am currently dealing with an unexplained monitoring question
>>>        in OpenNebula 4.6 on my development cloud.
>>>
>>>        I frequently see OpenNebula return that the status of a ONe
>>>        host is "ON" even in the case of a system misconfiguration where,
>>>        given the credentials, it is impossible for opennebula to
>>>        even ssh into the node as oneadmin.
>>>
>>>
>>>        I've fixed all those instances, restarted OpenNebula,
>>>        but opennebula still reports a number of VM's
>>>        in state "running" even though the node they are running
>>>        on was rebooted three days ago and is running no
>>>        virtual machines whatsoever.
>>>
>>>        I think I could be dealing with database corruption of some type
>>>        (generated on the one4.4->one4.6 update), or there could
>>>        be some problem with the remote scripts on the nodes.
>>>        I saw, and I think I fixed, the problems with the database
>>>        corruption (namely one of the hosts and one of the datastores
>>>        got knocked out of the database for reasons unknown, and I
>>>        re-inserted them).   But in any case there is some
>>>        error handling that is not working in the monitoring
>>>        and something is exiting with status 0 that shouldn't be.
>>>
>>>        ideas?  Has anyone else seen something like this?
>>>
>>>        Steve Timm
>>>
>>>
>>>
>>>        ------------------------------------------------------------
>>> ------
>>>        Steven C. Timm, Ph.D  (630) 840-8525
>>>        timm at fnal.gov  http://home.fnal.gov/~timm/
>>>        Fermilab Scientific Computing Division, Scientific Computing
>>>        Services Quad.
>>>        Grid and Cloud Services Dept., Associate Dept. Head for Cloud
>>>        Computing
>>>        _______________________________________________
>>>        Users mailing list
>>>        Users at lists.opennebula.org
>>> http: //lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>>
>>>
>>>
>>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D  (630) 840-8525
>> timm at fnal.gov  http://home.fnal.gov/~timm/
>> Fermilab Scientific Computing Division, Scientific Computing Services
>> Quad.
>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>>
>>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D  (630) 840-8525
> timm at fnal.gov  http://home.fnal.gov/~timm/
> Fermilab Scientific Computing Division, Scientific Computing Services Quad.
> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>



-- 
-- 
Ruben S. Montero, PhD
Project co-Lead and Chief Architect
OpenNebula - Flexible Enterprise Cloud Made Simple
www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20140730/043e2b27/attachment-0001.htm>


More information about the Users mailing list