[one-users] OpenNebula 4.6.0 monitoring question
Ruben S. Montero
rsmontero at opennebula.org
Wed Jul 30 08:29:35 PDT 2014
This seems to be a problem when upgrading the DB, See the inconsistency in
fgtest14:
<RUNNING_VMS>5</RUNNING_VMS>....<VMS></VMS>
That's the reason for not seeing any action taken on VM 26 it is not
registered in the host (empty <VM> set)
I suggest to stop oned and execute onedb fsck
Cheers
On Wed, Jul 30, 2014 at 4:44 PM, Steven Timm <timm at fnal.gov> wrote:
> OK--I have now installed the opennebula-node-kvm rpm on
> all of the VM hosts (SURPRISE), made sure that the collectd
> that is running is the current one from opennebula 4.6,
> and verified that the run_probes kvm-probes can
> run interactively as oneadmin on all of the nodes. the one on
> fgtest14 correctly reports that there are no running VM's,
> and the two machines that do have running vm's correctly report
> that they do have running VM's.
>
> Only problem is, the five virtual machines that opennebula still thinks
> are running on fgtest14, still report back as running
> even though opennebula hasn't made any attempt to monitor them?
>
> How do we get things back into sync and tell opennebula that VM #26
> isn't really running anymore? Is there a way to force this vm into
> "unknown" state so we can do a onevm boot on it? Database hackery
> included? Even better, has someone come up with an XML hacker to
> do the XML substitition of one field in the huge mysql field?
>
> Even more important: it's clear that the monitoring was obviously
> failing and failing for a long time because we didn't have the
> sudoers file there that the opennebula-node-kvm provides.
> But there was absolutely no warning of that.. as far as the
> head node was concerned we were happy as a clam.
>
>
> ----
>
> The important pieces of output from run_probes kvm-probes
>
> fgtest19
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=33010680
> USEDMEMORY=1586216
> FREEMEMORY=31424464
> FREECPU=800.0
> USEDCPU=0.0
> NETRX=5958104400
> NETTX=2323329968
> DS_LOCATION_USED_MB=1924
> DS_LOCATION_TOTAL_MB=280380
> DS_LOCATION_FREE_MB=264129
> DS = [
> ID = 102,
> USED_MB = 1924,
> TOTAL_MB = 280380,
> FREE_MB = 264129
> ]
> HOSTNAME=fgtest19.fnal.gov
> VM_POLL=YES
> VM=[
> ID=55,
> DEPLOY_ID=one-55,
> POLL="NETRX=25289118 USEDCPU=0.0 NETTX=214808 USEDMEMORY=4194304
> STATE=a" ]
> VERSION="4.6.0"
> fgtest20
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=32875804
> USEDMEMORY=8801100
> FREEMEMORY=24074704
> FREECPU=793.6
> USEDCPU=6.39999999999998
> NETRX=184155823062
> NETTX=58685116817
> DS_LOCATION_USED_MB=50049
> DS_LOCATION_TOTAL_MB=281012
> DS_LOCATION_FREE_MB=216499
> DS = [
> ID = 102,
> USED_MB = 50049,
> TOTAL_MB = 281012,
> FREE_MB = 216499
> ]
> HOSTNAME=fgtest20.fnal.gov
> VM_POLL=YES
> VM=[
> ID=31,
> DEPLOY_ID=one-31,
> POLL="NETRX=71728978887 USEDCPU=0.5 NETTX=54281255903 USEDMEMORY=4270812
> STATE=a" ]
> VM=[
> ID=24,
> DEPLOY_ID=one-24,
> POLL="NETRX=2383960153 USEDCPU=0.0 NETTX=17345416 USEDMEMORY=4194304
> STATE=a" ]
> VM=[
> ID=48,
> DEPLOY_ID=one-48,
> POLL="NETRX=2546074171 USEDCPU=0.0 NETTX=145782495 USEDMEMORY=4194304
> STATE=a" ]
> VERSION="4.6.0"
>
> fgtest14
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=24736796
> USEDMEMORY=937004
> FREEMEMORY=23799792
> FREECPU=800.0
> USEDCPU=0.0
> NETRX=285471609
> NETTX=25467521
> DS_LOCATION_USED_MB=179498
> DS_LOCATION_TOTAL_MB=561999
> DS_LOCATION_FREE_MB=353864
> DS = [
> ID = 102,
> USED_MB = 179498,
> TOTAL_MB = 561999,
> FREE_MB = 353864
> ]
>
> -------------------------
> And the appropriate excerpts from oned.log:
>
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][D]:
> Restarting VM 26
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][E]:
> Could not restart VM 26, wrong state.
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [DiM][D]:
> Stopping VM 26
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [VMM][D]:
> VM 26 successfully monitored: STATE=-
> -----------------------------------
>
> This is the mysql row in host_pool for host fgtest14
> mysql>
> mysql> select * from host_pool where oid=8 \G
> *************************** 1. row ***************************
> oid: 8
> name: fgtest14
> body: <HOST><ID>8</ID><NAME>fgtest14</NAME><STATE>2</
> STATE><IM_MAD>kvm</IM_MAD><VM_MAD>kvm</VM_MAD><VN_MAD>dummy<
> /VN_MAD><LAST_MON_TIME>1406731190</LAST_MON_TIME><
> CLUSTER_ID>101</CLUSTER_ID><CLUSTER>ipv6</CLUSTER><HOST_
> SHARE><DISK_USAGE>0</DISK_USAGE><MEM_USAGE>0</MEM_USAGE>
> <CPU_USAGE>0</CPU_USAGE><MAX_DISK>561999</MAX_DISK><MAX_
> MEM>24736796</MAX_MEM><MAX_CPU>800</MAX_CPU><FREE_DISK>
> 353864</FREE_DISK><FREE_MEM>23802216</FREE_MEM><FREE_CPU>
> 800</FREE_CPU><USED_DISK>179498</USED_DISK><USED_MEM>
> 934580</USED_MEM><USED_CPU>0</USED_CPU><RUNNING_VMS>5</
> RUNNING_VMS><DATASTORES><DS><FREE_MB><![CDATA[353864]]></
> FREE_MB><ID><![CDATA[102]]></ID><TOTAL_MB><![CDATA[561999]]
> ></TOTAL_MB><USED_MB><![CDATA[179498]]></USED_MB></DS></
> DATASTORES></HOST_SHARE><VMS></VMS><TEMPLATE><ARCH><![CDATA[
> x86_64]]></ARCH><CPUSPEED><![CDATA[2992]]></CPUSPEED><HOSTNAME><![CDATA[
> fgtest14.fnal.gov]]></HOSTNAME><HYPERVISOR><![CDATA[kvm]]></
> HYPERVISOR><MODELNAME><![CDATA[Intel(R) Xeon(R) CPU E5450 @
> 3.00GHz]]></MODELNAME><NETRX><![CDATA[285677608]]></NETRX><
> NETTX><![CDATA[25489275]]></NETTX><RESERVED_CPU><![CDATA[]
> ]></RESERVED_CPU><RESERVED_MEM><![CDATA[]]></RESERVED_
> MEM><VERSION><![CDATA[4.6.0]]></VERSION></TEMPLATE></HOST>
> state: 2
> last_mon_time: 1406731190
> uid: 0
> gid: 0
> owner_u: 1
> group_u: 0
> other_u: 0
> cid: 101
> 1 row in set (0.00 sec)
>
>
>
> And this is the row in vm_pool for VM id 26
>
> *************************** 1. row ***************************
> oid: 26
> name: fgt6x4-26
> body: <VM><ID>26</ID><UID>0</UID><GID>0</GID><UNAME>oneadmin</
> UNAME><GNAME>oneadmin</GNAME><NAME>fgt6x4-26</NAME><
> PERMISSIONS><OWNER_U>1</OWNER_U><OWNER_M>1</OWNER_M><OWNER_
> A>0</OWNER_A><GROUP_U>0</GROUP_U><GROUP_M>0</GROUP_M><
> GROUP_A>0</GROUP_A><OTHER_U>0</OTHER_U><OTHER_M>0</OTHER_M><
> OTHER_A>0</OTHER_A></PERMISSIONS><LAST_POLL>1406320668</LAST_POLL><STATE>
> 3</STATE><LCM_STATE>3</LCM_STATE><RESCHED>0</RESCHED><
> STIME>1396463735</STIME><ETIME>0</ETIME><DEPLOY_ID>one-
> 26</DEPLOY_ID><MEMORY>4194304</MEMORY><CPU>6</CPU><NET_TX>
> 748982286</NET_TX><NET_RX>1588690678</NET_RX><TEMPLATE><
> AUTOMATIC_REQUIREMENTS><![CDATA[CLUSTER_ID = 101 & !(PUBLIC_CLOUD =
> YES)]]></AUTOMATIC_REQUIREMENTS><CONTEXT><CTX_USER><![CDATA[PFVTRVI+
> PElEPjA8L0lEPjxHSUQ+MDwvR0lEPjxHUk9VUFM+PElEPjA8L0lEPjwvR1JPVVBTPjxHTk
> FNRT5vbmVhZG1pbjwvR05BTUU+PE5BTUU+b25lYWRtaW48L05BTUU+
> PFBBU1NXT1JEPjFmNjQxYzdlMzZkZWU5MmUzNDQ0Mjk2NmI1OTYwMGJkMGE3
> ZmU5ZDQ8L1BBU1NXT1JEPjxBVVRIX0RSSVZFUj5jb3JlPC9BVVRIX0RSSVZF
> Uj48RU5BQkxFRD4xPC9FTkFCTEVEPjxURU1QTEFURT48VE9LRU5fUEFTU1dPUkQ+
> PCFbQ0RBVEFbNzFhYzU0OWM5MzhmNjA0NmY3NDEzMDI4Y2ZhOGNjODU2YzI2
> ZGNhNV1dPjwvVE9LRU5fUEFTU1dPUkQ+PC9URU1QTEFURT48REFUQVNUT1JFX1
> FVT1RBPjwvREFUQVNUT1JFX1FVT1RBPjxORVRXT1JLX1FVT1RBPjwvTkVUV0
> 9SS19RVU9UQT48Vk1fUVVPVEE+PC9WTV9RVU9UQT48SU1BR0VfUVVPVE
> E+PC9JTUFHRV9RVU9UQT48L1VTRVI+]]></CTX_USER><DISK_ID><![
> CDATA[2]]></DISK_ID><ETH0_DNS><![CDATA[131.225.0.254]]></
> ETH0_DNS><ETH0_GATEWAY><![CDATA[131.225.41.200]]></ETH0_
> GATEWAY><ETH0_IP><![CDATA[131.225.41.169]]></ETH0_IP><ETH0_
> IPV6><![CDATA[2001:400:2410:29::169]]></ETH0_IPV6><ETH0_
> MAC><![CDATA[00:16:3e:06:06:04]]></ETH0_MAC><ETH0_MASK><![
> CDATA[255.255.255.128]]></ETH0_MASK><FILES><![CDATA[/
> cloud/images/OpenNebula/scripts/one3.2/contextualization/init.sh
> /cloud/images/OpenNebula/scripts/one3.2/contextualization/credentials.sh
> /cloud/images/OpenNebula/scripts/one3.2/contextualization/kerberos.sh]
> ]></FILES><GATEWAY><![CDATA[131.225.41.200]]></GATEWAY><INIT_SCRIPTS><![CDATA[init.sh
> credentials.sh kerberos.sh]]></INIT_SCRIPTS><
> IP_PUBLIC><![CDATA[131.225.41.169]]></IP_PUBLIC><NETMASK><![
> CDATA[255.255.255.128]]></NETMASK><NETWORK><![CDATA[YES]
> ]></NETWORK><ROOT_PUBKEY><![CDATA[id_dsa.pub]]></ROOT_
> PUBKEY><TARGET><![CDATA[hdc]]></TARGET><USERNAME><![CDATA[
> opennebula]]></USERNAME><USER_PUBKEY><![CDATA[id_dsa.pub]]><
> /USER_PUBKEY></CONTEXT><CPU><![CDATA[1]]></CPU><DISK><CLONE>
> <![CDATA[NO]]></CLONE><CLONE_TARGET><![CDATA[SYSTEM]]></
> CLONE_TARGET><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><
> DATASTORE><![CDATA[ip6_img_ds]]></DATASTORE><DATASTORE_ID><!
> [CDATA[101]]></DATASTORE_ID><DEV_PREFIX><![CDATA[hd]]></
> DEV_PREFIX><DISK_ID><![CDATA[0]]></DISK_ID><IMAGE><![CDATA[
> fgt6x4_os]]></IMAGE><IMAGE_ID><![CDATA[5]]></IMAGE_ID><
> IMAGE_UNAME><![CDATA[oneadmin]]></IMAGE_UNAME><LN_TARGET><![
> CDATA[SYSTEM]]></LN_TARGET><PERSISTENT><![CDATA[YES]]></
> PERSISTENT><READONLY><![CDATA[NO]]></READONLY><SAVE><![
> CDATA[YES]]></SAVE><SIZE><![CDATA[46080]]></SIZE><SOURCE><
> ![CDATA[/var/lib/one//datastores/101/3078b4235100008fbdbf9dff7eea95
> b1]]></SOURCE><TARGET><![CDATA[vda]]></TARGET><TM_MAD><
> ![CDATA[ssh]]></TM_MAD><TYPE><![CDATA[FILE]]></TYPE></DISK><
> DISK><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_ID><![
> CDATA[1]]></DISK_ID><SIZE><![CDATA[5120]]></SIZE><TARGET><!
> [CDATA[vdb]]></TARGET><TYPE><![CDATA[swap]]></TYPE></DISK><
> FEATURES><ACPI><![CDATA[yes]]></ACPI></FEATURES><GRAPHICS><
> AUTOPORT><![CDATA[yes]]></AUTOPORT><KEYMAP><![CDATA[en-
> us]]></KEYMAP><LISTEN><![CDATA[127.0.0.1]]></LISTEN><
> PORT><![CDATA[5926]]></PORT><TYPE><![CDATA[vnc]]></TYPE></
> GRAPHICS><MEMORY><![CDATA[4096]]></MEMORY><NIC><BRIDGE><
> ![CDATA[br0]]></BRIDGE><CLUSTER_ID><![CDATA[101]]></
> CLUSTER_ID><IP><![CDATA[131.225.41.169]]></IP><IP6_LINK><!
> [CDATA[fe80::216:3eff:fe06:604]]></IP6_LINK><MAC><![
> CDATA[00:16:3e:06:06:04]]></MAC><MODEL><![CDATA[virtio]]><
> /MODEL><NETWORK><![CDATA[Static_IPV6_Public]]></
> NETWORK><NETWORK_ID><![CDATA[1]]></NETWORK_ID><NETWORK_
> UNAME><![CDATA[oneadmin]]></NETWORK_UNAME><NIC_ID><![
> CDATA[0]]></NIC_ID><VLAN><![CDATA[NO]]></VLAN></NIC><OS><
> ARCH><![CDATA[x86_64]]></ARCH></OS><RAW><DATA><![CDATA[
> <devices>
> <serial type='pty'>
> <target port='0'/>
> </serial>
> <console type='pty'>
> <target type='serial' port='0'/>
> </console>
>
> </devices>]]></DATA><TYPE><![CDATA[kvm]]></TYPE></RAW><
> TEMPLATE_ID><![CDATA[6]]></TEMPLATE_ID><VCPU><![CDATA[2]]
> ></VCPU><VMID><![CDATA[26]]></VMID></TEMPLATE><USER_TEMPLATE><ERROR><![CDATA[Fri
> Jul 25 15:37:48 2014 : Error saving VM state: Could not save one-26 to
> /var/lib/one/datastores/102/26/checkpoint]]></ERROR><
> NPTYPE><![CDATA[NPERNLM]]></NPTYPE><RANK><![CDATA[
> FREEMEMORY]]></RANK><USERVO><![CDATA[test181818]]></USERVO><
> /USER_TEMPLATE><HISTORY_RECORDS><HISTORY><OID>26</OID>
> <SEQ>0</SEQ><HOSTNAME>fgtest14</HOSTNAME><HID>10</
> HID><CID>101</CID><STIME>1396463752</STIME><ETIME>0</
> ETIME><VMMMAD>kvm</VMMMAD><VNMMAD>dummy</VNMMAD><TMMAD>
> ssh</TMMAD><DS_LOCATION>/var/lib/one/datastores</DS_
> LOCATION><DS_ID>102</DS_ID><PSTIME>1396463752</PSTIME><
> PETIME>1396465032</PETIME><RSTIME>1396465032</RSTIME><
> RETIME>0</RETIME><ESTIME>0</ESTIME><EETIME>0</EETIME><
> REASON>0</REASON><ACTION>0</ACTION></HISTORY></HISTORY_RECORDS></VM>
> uid: 0
> gid: 0
> last_poll: 1406320668
> state: 3
> lcm_state: 3
> owner_u: 1
> group_u: 0
> other_u: 0
> 1 row in set (0.00 sec)
>
>
> -------------------------------
>
>
>
>
> On Wed, 30 Jul 2014, Steven Timm wrote:
>
> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>
>>
>>> Not really sure what can be going on... The monitor scripts return the
>>> information of all VMs running in the node. In 4.6 the
>>> monitoring system uses a push approach, through UDP, so you may have
>>> the
>>> information being reported by misbehaved monitoring
>>> daemons. Sometimes this may happen in dev environments if you are
>>> resetting the DB,...
>>>
>>
>> when we ran the update to take this database from ONE4.4 to ONE4.6, one
>> host (the aforementioned fgtest14) and one datastore (image store 101) got
>> wiped out of the database, I reinserted them both back in and restarted
>> opennebula.
>>
>> Steve Timm
>>
>>
>>
>>
>>
>>> On Jul 28, 2014 6:32 PM, "Steven Timm" <timm at fnal.gov> wrote:
>>>
>>> I am currently dealing with an unexplained monitoring question
>>> in OpenNebula 4.6 on my development cloud.
>>>
>>> I frequently see OpenNebula return that the status of a ONe
>>> host is "ON" even in the case of a system misconfiguration where,
>>> given the credentials, it is impossible for opennebula to
>>> even ssh into the node as oneadmin.
>>>
>>>
>>> I've fixed all those instances, restarted OpenNebula,
>>> but opennebula still reports a number of VM's
>>> in state "running" even though the node they are running
>>> on was rebooted three days ago and is running no
>>> virtual machines whatsoever.
>>>
>>> I think I could be dealing with database corruption of some type
>>> (generated on the one4.4->one4.6 update), or there could
>>> be some problem with the remote scripts on the nodes.
>>> I saw, and I think I fixed, the problems with the database
>>> corruption (namely one of the hosts and one of the datastores
>>> got knocked out of the database for reasons unknown, and I
>>> re-inserted them). But in any case there is some
>>> error handling that is not working in the monitoring
>>> and something is exiting with status 0 that shouldn't be.
>>>
>>> ideas? Has anyone else seen something like this?
>>>
>>> Steve Timm
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------
>>> Steven C. Timm, Ph.D (630) 840-8525
>>> timm at fnal.gov http://home.fnal.gov/~timm/
>>> Fermilab Scientific Computing Division, Scientific Computing
>>> Services Quad.
>>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud
>>> Computing
>>> _______________________________________________
>>> Users mailing list
>>> Users at lists.opennebula.org
>>> http: //lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>>
>>>
>>>
>>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D (630) 840-8525
>> timm at fnal.gov http://home.fnal.gov/~timm/
>> Fermilab Scientific Computing Division, Scientific Computing Services
>> Quad.
>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>>
>>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D (630) 840-8525
> timm at fnal.gov http://home.fnal.gov/~timm/
> Fermilab Scientific Computing Division, Scientific Computing Services Quad.
> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>
--
--
Ruben S. Montero, PhD
Project co-Lead and Chief Architect
OpenNebula - Flexible Enterprise Cloud Made Simple
www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opennebula.org/pipermail/users-opennebula.org/attachments/20140730/043e2b27/attachment-0001.htm>
More information about the Users
mailing list