[one-users] OpenNebula 4.6.0 monitoring question
Steven Timm
timm at fnal.gov
Wed Jul 30 11:53:47 PDT 2014
Thanks Ruben.
onedb fsck turned up and fixed a bunch of problems including the main
one--that fgtest14 had once been host ID 10 and I had mistakenly
re-inserted into the db as host ID 8. had to manually modify the
mysql on those 5 entries in VM pool to change the <hid> from
10 to 8, but once I did then opennebula finally detected
that they were down and now shows them as UNKN.
There is one remaining problem and that is the following:
To successfully modify the BODY field in the vm_pool of the mysql database
it was necessary to strip out some newlines and single quotes that
were in the XML and so now I have XML that doesn't actually work to
start a VM.
(I did a mysql command
update vm_pool set body='a bunch of xml' where oid=nnn;
and the mysql syntax supported neither newlines or single quotes. That's
a problem because some of the things we are using need single quotes
and maybe newlines too.
does anyone have an xml editor that can more easily modify the
text of the body field in the opennebula database?
Steve Timm
(ps--before the xml in question looked like this:
>
> <devices>
> <serial type='pty'>
> <target port='0'/>
> </serial>
> <console type='pty'>
> <target type='serial' port='0'/>
> </console>
> </devices>
And now it looks like this:
< <devices> <serial type=pty>
<target port=0/> </serial> <console type=pty> <target type=serial port=0/>
</console> </devices>
---
Steve Timm
On Wed, 30 Jul 2014, Ruben S. Montero wrote:
> This seems to be a problem when upgrading the DB, See the inconsistency in fgtest14:
> <RUNNING_VMS>5</RUNNING_VMS>....<VMS></VMS>
>
> That's the reason for not seeing any action taken on VM 26 it is not registered in the host (empty <VM> set)
>
> I suggest to stop oned and execute onedb fsck
>
> Cheers
>
>
> On Wed, Jul 30, 2014 at 4:44 PM, Steven Timm <timm at fnal.gov> wrote:
> OK--I have now installed the opennebula-node-kvm rpm on
> all of the VM hosts (SURPRISE), made sure that the collectd
> that is running is the current one from opennebula 4.6,
> and verified that the run_probes kvm-probes can
> run interactively as oneadmin on all of the nodes. the one on
> fgtest14 correctly reports that there are no running VM's,
> and the two machines that do have running vm's correctly report
> that they do have running VM's.
>
> Only problem is, the five virtual machines that opennebula still thinks
> are running on fgtest14, still report back as running
> even though opennebula hasn't made any attempt to monitor them?
>
> How do we get things back into sync and tell opennebula that VM #26
> isn't really running anymore? Is there a way to force this vm into "unknown" state so we can do a onevm boot on it?
> Database hackery included? Even better, has someone come up with an XML hacker to
> do the XML substitition of one field in the huge mysql field?
>
> Even more important: it's clear that the monitoring was obviously
> failing and failing for a long time because we didn't have the
> sudoers file there that the opennebula-node-kvm provides.
> But there was absolutely no warning of that.. as far as the
> head node was concerned we were happy as a clam.
>
>
> ----
>
> The important pieces of output from run_probes kvm-probes
>
> fgtest19
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=33010680
> USEDMEMORY=1586216
> FREEMEMORY=31424464
> FREECPU=800.0
> USEDCPU=0.0
> NETRX=5958104400
> NETTX=2323329968
> DS_LOCATION_USED_MB=1924
> DS_LOCATION_TOTAL_MB=280380
> DS_LOCATION_FREE_MB=264129
> DS = [
> ID = 102,
> USED_MB = 1924,
> TOTAL_MB = 280380,
> FREE_MB = 264129
> ]
> HOSTNAME=fgtest19.fnal.gov
> VM_POLL=YES
> VM=[
> ID=55,
> DEPLOY_ID=one-55,
> POLL="NETRX=25289118 USEDCPU=0.0 NETTX=214808 USEDMEMORY=4194304 STATE=a" ]
> VERSION="4.6.0"
> fgtest20
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=32875804
> USEDMEMORY=8801100
> FREEMEMORY=24074704
> FREECPU=793.6
> USEDCPU=6.39999999999998
> NETRX=184155823062
> NETTX=58685116817
> DS_LOCATION_USED_MB=50049
> DS_LOCATION_TOTAL_MB=281012
> DS_LOCATION_FREE_MB=216499
> DS = [
> ID = 102,
> USED_MB = 50049,
> TOTAL_MB = 281012,
> FREE_MB = 216499
> ]
> HOSTNAME=fgtest20.fnal.gov
> VM_POLL=YES
> VM=[
> ID=31,
> DEPLOY_ID=one-31,
> POLL="NETRX=71728978887 USEDCPU=0.5 NETTX=54281255903 USEDMEMORY=4270812 STATE=a" ]
> VM=[
> ID=24,
> DEPLOY_ID=one-24,
> POLL="NETRX=2383960153 USEDCPU=0.0 NETTX=17345416 USEDMEMORY=4194304 STATE=a" ]
> VM=[
> ID=48,
> DEPLOY_ID=one-48,
> POLL="NETRX=2546074171 USEDCPU=0.0 NETTX=145782495 USEDMEMORY=4194304 STATE=a" ]
> VERSION="4.6.0"
>
> fgtest14
> ARCH=x86_64
> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
> HYPERVISOR=kvm
> TOTALCPU=800
> CPUSPEED=2992
> TOTALMEMORY=24736796
> USEDMEMORY=937004
> FREEMEMORY=23799792
> FREECPU=800.0
> USEDCPU=0.0
> NETRX=285471609
> NETTX=25467521
> DS_LOCATION_USED_MB=179498
> DS_LOCATION_TOTAL_MB=561999
> DS_LOCATION_FREE_MB=353864
> DS = [
> ID = 102,
> USED_MB = 179498,
> TOTAL_MB = 561999,
> FREE_MB = 353864
> ]
>
> -------------------------
> And the appropriate excerpts from oned.log:
>
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][D]: Restarting VM 26
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][E]: Could not restart VM 26, wrong state.
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [DiM][D]: Stopping VM 26
> /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [VMM][D]: VM 26 successfully monitored: STATE=-
> -----------------------------------
>
> This is the mysql row in host_pool for host fgtest14
> mysql>
> mysql> select * from host_pool where oid=8 \G
> *************************** 1. row ***************************
> oid: 8
> name: fgtest14
> body:<HOST><ID>8</ID><NAME>fgtest14</NAME><STATE>2</STATE><IM_MAD>kvm</IM_MAD><VM_MAD>kvm</VM_MAD><VN_MAD>dummy</VN_MAD><LAST_MON_TIME>1
> 406731190</LAST_MON_TIME><CLUSTER_ID>101</CLUSTER_ID><CLUSTER>ipv6</CLUSTER><HOST_SHARE><DISK_USAGE>0</DISK_USAGE><MEM_USAGE>0</MEM
> _USAGE><CPU_USAGE>0</CPU_USAGE><MAX_DISK>561999</MAX_DISK><MAX_MEM>24736796</MAX_MEM><MAX_CPU>800</MAX_CPU><FREE_DISK>353864</FREE_
> DISK><FREE_MEM>23802216</FREE_MEM><FREE_CPU>800</FREE_CPU><USED_DISK>179498</USED_DISK><USED_MEM>934580</USED_MEM><USED_CPU>0</USED
> _CPU><RUNNING_VMS>5</RUNNING_VMS><DATASTORES><DS><FREE_MB><![CDATA[353864]]></FREE_MB><ID><![CDATA[102]]></ID><TOTAL_MB><![CDATA[56
> 1999]]></TOTAL_MB><USED_MB><![CDATA[179498]]></USED_MB></DS></DATASTORES></HOST_SHARE><VMS></VMS><TEMPLATE><ARCH><![CDATA[x86_64]]>
> </ARCH><CPUSPEED><![CDATA[2992]]></CPUSPEED><HOSTNAME><![CDATA[fgtest14.fnal.gov]]></HOSTNAME><HYPERVISOR><![CDATA[kvm]]></HYPERVIS
> OR><MODELNAME><![CDATA[Intel(R) Xeon(R) CPU E5450 @3.00GHz]]></MODELNAME><NETRX><![CDATA[285677608]]></NETRX><NETTX><![CDATA[25489275]]></NETTX><RESERVED_CPU><![CDATA[]]></RESERVED_C
> PU><RESERVED_MEM><![CDATA[]]></RESERVED_MEM><VERSION><![CDATA[4.6.0]]></VERSION></TEMPLATE></HOST>
> state: 2
> last_mon_time: 1406731190
> uid: 0
> gid: 0
> owner_u: 1
> group_u: 0
> other_u: 0
> cid: 101
> 1 row in set (0.00 sec)
>
>
>
> And this is the row in vm_pool for VM id 26
>
> *************************** 1. row ***************************
> oid: 26
> name: fgt6x4-26
> body:<VM><ID>26</ID><UID>0</UID><GID>0</GID><UNAME>oneadmin</UNAME><GNAME>oneadmin</GNAME><NAME>fgt6x4-26</NAME><PERMISSIONS><OWNER_U>1<
> /OWNER_U><OWNER_M>1</OWNER_M><OWNER_A>0</OWNER_A><GROUP_U>0</GROUP_U><GROUP_M>0</GROUP_M><GROUP_A>0</GROUP_A><OTHER_U>0</OTHER_U><O
> THER_M>0</OTHER_M><OTHER_A>0</OTHER_A></PERMISSIONS><LAST_POLL>1406320668</LAST_POLL><STATE>3</STATE><LCM_STATE>3</LCM_STATE><RESCH
> ED>0</RESCHED><STIME>1396463735</STIME><ETIME>0</ETIME><DEPLOY_ID>one-26</DEPLOY_ID><MEMORY>4194304</MEMORY><CPU>6</CPU><NET_TX>748
> 982286</NET_TX><NET_RX>1588690678</NET_RX><TEMPLATE><AUTOMATIC_REQUIREMENTS><![CDATA[CLUSTER_ID = 101 & !(PUBLIC_CLOUD =YES)]]></AUTOMATIC_REQUIREMENTS><CONTEXT><CTX_USER><![CDATA[PFVTRVI+PElEPjA8L0lEPjxHSUQ+MDwvR0lEPjxHUk9VUFM+PElEPjA8L0lEPjwvR1JPVVB
> TPjxHTkFNRT5vbmVhZG1pbjwvR05BTUU+PE5BTUU+b25lYWRtaW48L05BTUU+PFBBU1NXT1JEPjFmNjQxYzdlMzZkZWU5MmUzNDQ0Mjk2NmI1OTYwMGJkMGE3ZmU5ZDQ8L1
> BBU1NXT1JEPjxBVVRIX0RSSVZFUj5jb3JlPC9BVVRIX0RSSVZFUj48RU5BQkxFRD4xPC9FTkFCTEVEPjxURU1QTEFURT48VE9LRU5fUEFTU1dPUkQ+PCFbQ0RBVEFbNzFhY
> zU0OWM5MzhmNjA0NmY3NDEzMDI4Y2ZhOGNjODU2YzI2ZGNhNV1dPjwvVE9LRU5fUEFTU1dPUkQ+PC9URU1QTEFURT48REFUQVNUT1JFX1FVT1RBPjwvREFUQVNUT1JFX1FV
> T1RBPjxORVRXT1JLX1FVT1RBPjwvTkVUV09SS19RVU9UQT48Vk1fUVVPVEE+PC9WTV9RVU9UQT48SU1BR0VfUVVPVEE+PC9JTUFHRV9RVU9UQT48L1VTRVI+]]></CTX_US
> ER><DISK_ID><![CDATA[2]]></DISK_ID><ETH0_DNS><![CDATA[131.225.0.254]]></ETH0_DNS><ETH0_GATEWAY><![CDATA[131.225.41.200]]></ETH0_GAT
> EWAY><ETH0_IP><![CDATA[131.225.41.169]]></ETH0_IP><ETH0_IPV6><![CDATA[2001:400:2410:29::169]]></ETH0_IPV6><ETH0_MAC><![CDATA[00:16:
> 3e:06:06:04]]></ETH0_MAC><ETH0_MASK><![CDATA[255.255.255.128]]></ETH0_MASK><FILES><![CDATA[/cloud/images/OpenNebula/scripts/one3.2/
> contextualization/init.sh /cloud/images/OpenNebula/scripts/one3.2/contextualization/credentials.sh/cloud/images/OpenNebula/scripts/one3.2/contextualization/kerberos.sh]]></FILES><GATEWAY><![CDATA[131.225.41.200]]></GATEWAY><INIT_
> SCRIPTS><![CDATA[init.sh credentials.shkerberos.sh]]></INIT_SCRIPTS><IP_PUBLIC><![CDATA[131.225.41.169]]></IP_PUBLIC><NETMASK><![CDATA[255.255.255.128]]></NETMASK><NETWOR
> K><![CDATA[YES]]></NETWORK><ROOT_PUBKEY><![CDATA[id_dsa.pub]]></ROOT_PUBKEY><TARGET><![CDATA[hdc]]></TARGET><USERNAME><![CDATA[open
> nebula]]></USERNAME><USER_PUBKEY><![CDATA[id_dsa.pub]]></USER_PUBKEY></CONTEXT><CPU><![CDATA[1]]></CPU><DISK><CLONE><![CDATA[NO]]><
> /CLONE><CLONE_TARGET><![CDATA[SYSTEM]]></CLONE_TARGET><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><DATASTORE><![CDATA[ip6_img_ds]]></DA
> TASTORE><DATASTORE_ID><![CDATA[101]]></DATASTORE_ID><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_ID><![CDATA[0]]></DISK_ID><IMAGE><
> ![CDATA[fgt6x4_os]]></IMAGE><IMAGE_ID><![CDATA[5]]></IMAGE_ID><IMAGE_UNAME><![CDATA[oneadmin]]></IMAGE_UNAME><LN_TARGET><![CDATA[SY
> STEM]]></LN_TARGET><PERSISTENT><![CDATA[YES]]></PERSISTENT><READONLY><![CDATA[NO]]></READONLY><SAVE><![CDATA[YES]]></SAVE><SIZE><![
> CDATA[46080]]></SIZE><SOURCE><![CDATA[/var/lib/one//datastores/101/3078b4235100008fbdbf9dff7eea95b1]]></SOURCE><TARGET><![CDATA[vda
> ]]></TARGET><TM_MAD><![CDATA[ssh]]></TM_MAD><TYPE><![CDATA[FILE]]></TYPE></DISK><DISK><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_
> ID><![CDATA[1]]></DISK_ID><SIZE><![CDATA[5120]]></SIZE><TARGET><![CDATA[vdb]]></TARGET><TYPE><![CDATA[swap]]></TYPE></DISK><FEATURE
> S><ACPI><![CDATA[yes]]></ACPI></FEATURES><GRAPHICS><AUTOPORT><![CDATA[yes]]></AUTOPORT><KEYMAP><![CDATA[en-us]]></KEYMAP><LISTEN><!
> [CDATA[127.0.0.1]]></LISTEN><PORT><![CDATA[5926]]></PORT><TYPE><![CDATA[vnc]]></TYPE></GRAPHICS><MEMORY><![CDATA[4096]]></MEMORY><N
> IC><BRIDGE><![CDATA[br0]]></BRIDGE><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><IP><![CDATA[131.225.41.169]]></IP><IP6_LINK><![CDATA[fe
> 80::216:3eff:fe06:604]]></IP6_LINK><MAC><![CDATA[00:16:3e:06:06:04]]></MAC><MODEL><![CDATA[virtio]]></MODEL><NETWORK><![CDATA[Stati
> c_IPV6_Public]]></NETWORK><NETWORK_ID><![CDATA[1]]></NETWORK_ID><NETWORK_UNAME><![CDATA[oneadmin]]></NETWORK_UNAME><NIC_ID><![CDATA
> [0]]></NIC_ID><VLAN><![CDATA[NO]]></VLAN></NIC><OS><ARCH><![CDATA[x86_64]]></ARCH></OS><RAW><DATA><![CDATA[
> <devices>
> <serial type='pty'>
> <target port='0'/>
> </serial>
> <console type='pty'>
> <target type='serial' port='0'/>
> </console>
>
> </devices>]]></DATA><TYPE><![CDATA[kvm]]></TYPE></RAW><TEMPLATE_ID><![CDATA[6]]></TEMPLATE_ID><VCPU><![CDATA[2]]></VCPU><VMID><![CD
> ATA[26]]></VMID></TEMPLATE><USER_TEMPLATE><ERROR><![CDATA[Fri Jul 25 15:37:48 2014 : Error saving VM state: Could not
> save one-26 to/var/lib/one/datastores/102/26/checkpoint]]></ERROR><NPTYPE><![CDATA[NPERNLM]]></NPTYPE><RANK><![CDATA[FREEMEMORY]]></RANK><USERVO>
> <![CDATA[test181818]]></USERVO></USER_TEMPLATE><HISTORY_RECORDS><HISTORY><OID>26</OID><SEQ>0</SEQ><HOSTNAME>fgtest14</HOSTNAME><HID
> >10</HID><CID>101</CID><STIME>1396463752</STIME><ETIME>0</ETIME><VMMMAD>kvm</VMMMAD><VNMMAD>dummy</VNMMAD><TMMAD>ssh</TMMAD><DS_LOC
> ATION>/var/lib/one/datastores</DS_LOCATION><DS_ID>102</DS_ID><PSTIME>1396463752</PSTIME><PETIME>1396465032</PETIME><RSTIME>13964650
> 32</RSTIME><RETIME>0</RETIME><ESTIME>0</ESTIME><EETIME>0</EETIME><REASON>0</REASON><ACTION>0</ACTION></HISTORY></HISTORY_RECORDS></
> VM>
> uid: 0
> gid: 0
> last_poll: 1406320668
> state: 3
> lcm_state: 3
> owner_u: 1
> group_u: 0
> other_u: 0
> 1 row in set (0.00 sec)
>
>
> -------------------------------
>
>
>
> On Wed, 30 Jul 2014, Steven Timm wrote:
>
> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>
>
> Not really sure what can be going on... The monitor scripts return the
> information of all VMs running in the node. In 4.6 the
> monitoring system uses a push approach, through UDP, so you may have the
> information being reported by misbehaved monitoring
> daemons. Sometimes this may happen in dev environments if you are
> resetting the DB,...
>
>
> when we ran the update to take this database from ONE4.4 to ONE4.6, one host (the aforementioned fgtest14)
> and one datastore (image store 101) got
> wiped out of the database, I reinserted them both back in and restarted opennebula.
>
> Steve Timm
>
>
>
>
>
> On Jul 28, 2014 6:32 PM, "Steven Timm" <timm at fnal.gov> wrote:
>
> I am currently dealing with an unexplained monitoring question
> in OpenNebula 4.6 on my development cloud.
>
> I frequently see OpenNebula return that the status of a ONe
> host is "ON" even in the case of a system misconfiguration where,
> given the credentials, it is impossible for opennebula to
> even ssh into the node as oneadmin.
>
>
> I've fixed all those instances, restarted OpenNebula,
> but opennebula still reports a number of VM's
> in state "running" even though the node they are running
> on was rebooted three days ago and is running no
> virtual machines whatsoever.
>
> I think I could be dealing with database corruption of some type
> (generated on the one4.4->one4.6 update), or there could
> be some problem with the remote scripts on the nodes.
> I saw, and I think I fixed, the problems with the database
> corruption (namely one of the hosts and one of the datastores
> got knocked out of the database for reasons unknown, and I
> re-inserted them). But in any case there is some
> error handling that is not working in the monitoring
> and something is exiting with status 0 that shouldn't be.
>
> ideas? Has anyone else seen something like this?
>
> Steve Timm
>
>
>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D (630) 840-8525
> timm at fnal.gov http://home.fnal.gov/~timm/
> Fermilab Scientific Computing Division, Scientific Computing
> Services Quad.
> Grid and Cloud Services Dept., Associate Dept. Head for Cloud
> Computing
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http: //lists.opennebula.org/listinfo.cgi/users-opennebula.org
>
>
>
>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D (630) 840-8525
> timm at fnal.gov http://home.fnal.gov/~timm/
> Fermilab Scientific Computing Division, Scientific Computing Services Quad.
> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>
>
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D (630) 840-8525
> timm at fnal.gov http://home.fnal.gov/~timm/
> Fermilab Scientific Computing Division, Scientific Computing Services Quad.
> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>
>
>
>
> --
> --
> Ruben S. Montero, PhD
> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made Simple
> www.OpenNebula.org | rsmontero at opennebula.org | @OpenNebula
>
>
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
timm at fnal.gov http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
More information about the Users
mailing list