[one-users] OpenNebula 4.6.0 monitoring question

Wed Jul 30 07:44:53 PDT 2014

OK--I have now installed the opennebula-node-kvm rpm on
all of the VM hosts (SURPRISE), made sure that the collectd
that is running is the current one from opennebula 4.6,
and verified that the run_probes kvm-probes can
run interactively as oneadmin on all of the nodes.  the one on
fgtest14 correctly reports that there are no running VM's,
and the two machines that do have running vm's correctly report
that they do have running VM's.

Only problem is, the five virtual machines that opennebula still thinks
are running on fgtest14, still report back as running
even though opennebula hasn't made any attempt to monitor them?

How do we get things back into sync and tell opennebula that VM #26
isn't really running anymore? Is there a way to force this vm into 
"unknown" state so we can do a onevm boot on it?  Database hackery 
included?  Even better, has someone come up with an XML hacker to
do the XML substitition of one field in the huge mysql field?

Even more important:  it's clear that the monitoring was obviously
failing and failing for a long time because we didn't have the
sudoers file there that the opennebula-node-kvm provides.
But there was absolutely no warning of that.. as far as the
head node was concerned we were happy as a clam.

----

The important pieces of output from run_probes kvm-probes

fgtest19
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2992
TOTALMEMORY=33010680
USEDMEMORY=1586216
FREEMEMORY=31424464
FREECPU=800.0
USEDCPU=0.0
NETRX=5958104400
NETTX=2323329968
DS_LOCATION_USED_MB=1924
DS_LOCATION_TOTAL_MB=280380
DS_LOCATION_FREE_MB=264129
DS = [
   ID = 102,
   USED_MB = 1924,
   TOTAL_MB = 280380,
   FREE_MB = 264129
]
HOSTNAME=fgtest19.fnal.gov
VM_POLL=YES
VM=[
   ID=55,
   DEPLOY_ID=one-55,
   POLL="NETRX=25289118 USEDCPU=0.0 NETTX=214808 USEDMEMORY=4194304 
STATE=a" ]
VERSION="4.6.0"
fgtest20
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2992
TOTALMEMORY=32875804
USEDMEMORY=8801100
FREEMEMORY=24074704
FREECPU=793.6
USEDCPU=6.39999999999998
NETRX=184155823062
NETTX=58685116817
DS_LOCATION_USED_MB=50049
DS_LOCATION_TOTAL_MB=281012
DS_LOCATION_FREE_MB=216499
DS = [
   ID = 102,
   USED_MB = 50049,
   TOTAL_MB = 281012,
   FREE_MB = 216499
]
HOSTNAME=fgtest20.fnal.gov
VM_POLL=YES
VM=[
   ID=31,
   DEPLOY_ID=one-31,
   POLL="NETRX=71728978887 USEDCPU=0.5 NETTX=54281255903 USEDMEMORY=4270812 
STATE=a" ]
VM=[
   ID=24,
   DEPLOY_ID=one-24,
   POLL="NETRX=2383960153 USEDCPU=0.0 NETTX=17345416 USEDMEMORY=4194304 
STATE=a" ]
VM=[
   ID=48,
   DEPLOY_ID=one-48,
   POLL="NETRX=2546074171 USEDCPU=0.0 NETTX=145782495 USEDMEMORY=4194304 
STATE=a" ]
VERSION="4.6.0"

fgtest14
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2992
TOTALMEMORY=24736796
USEDMEMORY=937004
FREEMEMORY=23799792
FREECPU=800.0
USEDCPU=0.0
NETRX=285471609
NETTX=25467521
DS_LOCATION_USED_MB=179498
DS_LOCATION_TOTAL_MB=561999
DS_LOCATION_FREE_MB=353864
DS = [
   ID = 102,
   USED_MB = 179498,
   TOTAL_MB = 561999,
   FREE_MB = 353864
]

-------------------------
And the appropriate excerpts from oned.log:

/var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][D]: 
Restarting VM 26
/var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][E]: 
Could not restart VM 26, wrong state.
/var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [DiM][D]: 
Stopping VM 26
/var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [VMM][D]: VM 
26 successfully monitored: STATE=-
-----------------------------------

This is the mysql row in host_pool for host fgtest14
mysql>
mysql> select * from host_pool where oid=8 \G
*************************** 1. row ***************************
           oid: 8
          name: fgtest14
          body: 
<HOST><ID>8</ID><NAME>fgtest14</NAME><STATE>2</STATE><IM_MAD>kvm</IM_MAD><VM_MAD>kvm</VM_MAD><VN_MAD>dummy</VN_MAD><LAST_MON_TIME>1406731190</LAST_MON_TIME><CLUSTER_ID>101</CLUSTER_ID><CLUSTER>ipv6</CLUSTER><HOST_SHARE><DISK_USAGE>0</DISK_USAGE><MEM_USAGE>0</MEM_USAGE><CPU_USAGE>0</CPU_USAGE><MAX_DISK>561999</MAX_DISK><MAX_MEM>24736796</MAX_MEM><MAX_CPU>800</MAX_CPU><FREE_DISK>353864</FREE_DISK><FREE_MEM>23802216</FREE_MEM><FREE_CPU>800</FREE_CPU><USED_DISK>179498</USED_DISK><USED_MEM>934580</USED_MEM><USED_CPU>0</USED_CPU><RUNNING_VMS>5</RUNNING_VMS><DATASTORES><DS><FREE_MB><![CDATA[353864]]></FREE_MB><ID><![CDATA[102]]></ID><TOTAL_MB><![CDATA[561999]]></TOTAL_MB><USED_MB><![CDATA[179498]]></USED_MB></DS></DATASTORES></HOST_SHARE><VMS></VMS><TEMPLATE><ARCH><![CDATA[x86_64]]></ARCH><CPUSPEED><![CDATA[2992]]></CPUSPEED><HOSTNAME><![CDATA[fgtest14.fnal.gov]]></HOSTNAME><HYPERVISOR><![CDATA[kvm]]></HYPERVISOR><MODELNAME><![CDATA[Intel(R) 
Xeon(R) CPU           E5450  @ 
3.00GHz]]></MODELNAME><NETRX><![CDATA[285677608]]></NETRX><NETTX><![CDATA[25489275]]></NETTX><RESERVED_CPU><![CDATA[]]></RESERVED_CPU><RESERVED_MEM><![CDATA[]]></RESERVED_MEM><VERSION><![CDATA[4.6.0]]></VERSION></TEMPLATE></HOST>
         state: 2
last_mon_time: 1406731190
           uid: 0
           gid: 0
       owner_u: 1
       group_u: 0
       other_u: 0
           cid: 101
1 row in set (0.00 sec)

And this is the row in vm_pool for VM id 26

*************************** 1. row ***************************
       oid: 26
      name: fgt6x4-26
      body: 
<VM><ID>26</ID><UID>0</UID><GID>0</GID><UNAME>oneadmin</UNAME><GNAME>oneadmin</GNAME><NAME>fgt6x4-26</NAME><PERMISSIONS><OWNER_U>1</OWNER_U><OWNER_M>1</OWNER_M><OWNER_A>0</OWNER_A><GROUP_U>0</GROUP_U><GROUP_M>0</GROUP_M><GROUP_A>0</GROUP_A><OTHER_U>0</OTHER_U><OTHER_M>0</OTHER_M><OTHER_A>0</OTHER_A></PERMISSIONS><LAST_POLL>1406320668</LAST_POLL><STATE>3</STATE><LCM_STATE>3</LCM_STATE><RESCHED>0</RESCHED><STIME>1396463735</STIME><ETIME>0</ETIME><DEPLOY_ID>one-26</DEPLOY_ID><MEMORY>4194304</MEMORY><CPU>6</CPU><NET_TX>748982286</NET_TX><NET_RX>1588690678</NET_RX><TEMPLATE><AUTOMATIC_REQUIREMENTS><![CDATA[CLUSTER_ID 
= 101 & !(PUBLIC_CLOUD = 
YES)]]></AUTOMATIC_REQUIREMENTS><CONTEXT><CTX_USER><![CDATA[PFVTRVI+PElEPjA8L0lEPjxHSUQ+MDwvR0lEPjxHUk9VUFM+PElEPjA8L0lEPjwvR1JPVVBTPjxHTkFNRT5vbmVhZG1pbjwvR05BTUU+PE5BTUU+b25lYWRtaW48L05BTUU+PFBBU1NXT1JEPjFmNjQxYzdlMzZkZWU5MmUzNDQ0Mjk2NmI1OTYwMGJkMGE3ZmU5ZDQ8L1BBU1NXT1JEPjxBVVRIX0RSSVZFUj5jb3JlPC9BVVRIX0RSSVZFUj48RU5BQkxFRD4xPC9FTkFCTEVEPjxURU1QTEFURT48VE9LRU5fUEFTU1dPUkQ+PCFbQ0RBVEFbNzFhYzU0OWM5MzhmNjA0NmY3NDEzMDI4Y2ZhOGNjODU2YzI2ZGNhNV1dPjwvVE9LRU5fUEFTU1dPUkQ+PC9URU1QTEFURT48REFUQVNUT1JFX1FVT1RBPjwvREFUQVNUT1JFX1FVT1RBPjxORVRXT1JLX1FVT1RBPjwvTkVUV09SS19RVU9UQT48Vk1fUVVPVEE+PC9WTV9RVU9UQT48SU1BR0VfUVVPVEE+PC9JTUFHRV9RVU9UQT48L1VTRVI+]]></CTX_USER><DISK_ID><![CDATA[2]]></DISK_ID><ETH0_DNS><![CDATA[131.225.0.254]]></ETH0_DNS><ETH0_GATEWAY><![CDATA[131.225.41.200]]></ETH0_GATEWAY><ETH0_IP><![CDATA[131.225.41.169]]></ETH0_IP><ETH0_IPV6><![CDATA[2001:400:2410:29::169]]></ETH0_IPV6><ETH0_MAC><![CDATA[00:16:3e:06:06:04]]></ETH0_MAC><ETH0_MASK><![CDATA[255.255.255.128]]></ETH0_MASK><FILES><![CDATA[/cloud/images/OpenNebula/scripts/one3.2/contextualization/init.sh 
/cloud/images/OpenNebula/scripts/one3.2/contextualization/credentials.sh 
/cloud/images/OpenNebula/scripts/one3.2/contextualization/kerberos.sh]]></FILES><GATEWAY><![CDATA[131.225.41.200]]></GATEWAY><INIT_SCRIPTS><![CDATA[init.sh 
credentials.sh 
kerberos.sh]]></INIT_SCRIPTS><IP_PUBLIC><![CDATA[131.225.41.169]]></IP_PUBLIC><NETMASK><![CDATA[255.255.255.128]]></NETMASK><NETWORK><![CDATA[YES]]></NETWORK><ROOT_PUBKEY><![CDATA[id_dsa.pub]]></ROOT_PUBKEY><TARGET><![CDATA[hdc]]></TARGET><USERNAME><![CDATA[opennebula]]></USERNAME><USER_PUBKEY><![CDATA[id_dsa.pub]]></USER_PUBKEY></CONTEXT><CPU><![CDATA[1]]></CPU><DISK><CLONE><![CDATA[NO]]></CLONE><CLONE_TARGET><![CDATA[SYSTEM]]></CLONE_TARGET><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><DATASTORE><![CDATA[ip6_img_ds]]></DATASTORE><DATASTORE_ID><![CDATA[101]]></DATASTORE_ID><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_ID><![CDATA[0]]></DISK_ID><IMAGE><![CDATA[fgt6x4_os]]></IMAGE><IMAGE_ID><![CDATA[5]]></IMAGE_ID><IMAGE_UNAME><![CDATA[oneadmin]]></IMAGE_UNAME><LN_TARGET><![CDATA[SYSTEM]]></LN_TARGET><PERSISTENT><![CDATA[YES]]></PERSISTENT><READONLY><![CDATA[NO]]></READONLY><SAVE><![CDATA[YES]]></SAVE><SIZE><![CDATA[46080]]></SIZE><SOURCE><![CDATA[/var/lib/one//datastores/101/3078b4235100008fbdbf9dff7eea95b1]]></SOURCE><TARGET><![CDATA[vda]]></TARGET><TM_MAD><![CDATA[ssh]]></TM_MAD><TYPE><![CDATA[FILE]]></TYPE></DISK><DISK><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_ID><![CDATA[1]]></DISK_ID><SIZE><![CDATA[5120]]></SIZE><TARGET><![CDATA[vdb]]></TARGET><TYPE><![CDATA[swap]]></TYPE></DISK><FEATURES><ACPI><![CDATA[yes]]></ACPI></FEATURES><GRAPHICS><AUTOPORT><![CDATA[yes]]></AUTOPORT><KEYMAP><![CDATA[en-us]]></KEYMAP><LISTEN><![CDATA[127.0.0.1]]></LISTEN><PORT><![CDATA[5926]]></PORT><TYPE><![CDATA[vnc]]></TYPE></GRAPHICS><MEMORY><![CDATA[4096]]></MEMORY><NIC><BRIDGE><![CDATA[br0]]></BRIDGE><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><IP><![CDATA[131.225.41.169]]></IP><IP6_LINK><![CDATA[fe80::216:3eff:fe06:604]]></IP6_LINK><MAC><![CDATA[00:16:3e:06:06:04]]></MAC><MODEL><![CDATA[virtio]]></MODEL><NETWORK><![CDATA[Static_IPV6_Public]]></NETWORK><NETWORK_ID><![CDATA[1]]></NETWORK_ID><NETWORK_UNAME><![CDATA[oneadmin]]></NETWORK_UNAME><NIC_ID><![CDATA[0]]></NIC_ID><VLAN><![CDATA[NO]]></VLAN></NIC><OS><ARCH><![CDATA[x86_64]]></ARCH></OS><RAW><DATA><![CDATA[
                 <devices>
                 <serial type='pty'>
                         <target port='0'/>
                 </serial>
                 <console type='pty'>
                 <target type='serial' port='0'/>
                 </console>

</devices>]]></DATA><TYPE><![CDATA[kvm]]></TYPE></RAW><TEMPLATE_ID><![CDATA[6]]></TEMPLATE_ID><VCPU><![CDATA[2]]></VCPU><VMID><![CDATA[26]]></VMID></TEMPLATE><USER_TEMPLATE><ERROR><![CDATA[Fri 
Jul 25 15:37:48 2014 : Error saving VM state: Could not save one-26 to 
/var/lib/one/datastores/102/26/checkpoint]]></ERROR><NPTYPE><![CDATA[NPERNLM]]></NPTYPE><RANK><![CDATA[FREEMEMORY]]></RANK><USERVO><![CDATA[test181818]]></USERVO></USER_TEMPLATE><HISTORY_RECORDS><HISTORY><OID>26</OID><SEQ>0</SEQ><HOSTNAME>fgtest14</HOSTNAME><HID>10</HID><CID>101</CID><STIME>1396463752</STIME><ETIME>0</ETIME><VMMMAD>kvm</VMMMAD><VNMMAD>dummy</VNMMAD><TMMAD>ssh</TMMAD><DS_LOCATION>/var/lib/one/datastores</DS_LOCATION><DS_ID>102</DS_ID><PSTIME>1396463752</PSTIME><PETIME>1396465032</PETIME><RSTIME>1396465032</RSTIME><RETIME>0</RETIME><ESTIME>0</ESTIME><EETIME>0</EETIME><REASON>0</REASON><ACTION>0</ACTION></HISTORY></HISTORY_RECORDS></VM>
       uid: 0
       gid: 0
last_poll: 1406320668
     state: 3
lcm_state: 3
   owner_u: 1
   group_u: 0
   other_u: 0
1 row in set (0.00 sec)

-------------------------------

On Wed, 30 Jul 2014, Steven Timm wrote:

> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>
>>
>>  Not really sure what can be going on...  The monitor scripts return the
>>  information of all VMs running in the node.  In 4.6 the
>>  monitoring system uses a push approach,  through UDP,  so you may have the
>>  information being reported by misbehaved monitoring
>>  daemons.  Sometimes this may happen in dev environments if you are
>>  resetting the DB,... 
>
> when we ran the update to take this database from ONE4.4 to ONE4.6, one host 
> (the aforementioned fgtest14) and one datastore (image store 101) got
> wiped out of the database, I reinserted them both back in and restarted 
> opennebula.
>
> Steve Timm
>
>
>
>
>>
>>  On Jul 28, 2014 6:32 PM, "Steven Timm" <timm at fnal.gov> wrote:
>>
>>        I am currently dealing with an unexplained monitoring question
>>        in OpenNebula 4.6 on my development cloud.
>>
>>        I frequently see OpenNebula return that the status of a ONe
>>        host is "ON" even in the case of a system misconfiguration where,
>>        given the credentials, it is impossible for opennebula to
>>        even ssh into the node as oneadmin.
>> 
>>
>>        I've fixed all those instances, restarted OpenNebula,
>>        but opennebula still reports a number of VM's
>>        in state "running" even though the node they are running
>>        on was rebooted three days ago and is running no
>>        virtual machines whatsoever.
>>
>>        I think I could be dealing with database corruption of some type
>>        (generated on the one4.4->one4.6 update), or there could
>>        be some problem with the remote scripts on the nodes.
>>        I saw, and I think I fixed, the problems with the database
>>        corruption (namely one of the hosts and one of the datastores
>>        got knocked out of the database for reasons unknown, and I
>>        re-inserted them).   But in any case there is some
>>        error handling that is not working in the monitoring
>>        and something is exiting with status 0 that shouldn't be.
>>
>>        ideas?  Has anyone else seen something like this?
>>
>>        Steve Timm
>>
>> 
>>
>>        ------------------------------------------------------------------
>>        Steven C. Timm, Ph.D  (630) 840-8525
>>        timm at fnal.gov  http://home.fnal.gov/~timm/
>>        Fermilab Scientific Computing Division, Scientific Computing
>>        Services Quad.
>>        Grid and Cloud Services Dept., Associate Dept. Head for Cloud
>>        Computing
>>        _______________________________________________
>>        Users mailing list
>>        Users at lists.opennebula.org
>> http: //lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>
>> 
>> 
>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D  (630) 840-8525
> timm at fnal.gov  http://home.fnal.gov/~timm/
> Fermilab Scientific Computing Division, Scientific Computing Services Quad.
> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>

------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing