[one-users] Oned and Sched failing to start on parallel file system

Nicholas Robison nrobison at purdue.edu
Sun Jan 29 08:47:53 PST 2012


I'm trying to deploy an OpenNebula installation on a cluster over the OrangeFS file system. I've successfully setup this cluster in the past using local storage but now I'm testing performance over distributed storage. I've configured the installation to use the parallel storage along with a mysql database hosted locally on the head node, but now I'm seeing a couple of errors.

Most of the time I get this:

/srv/cloud/one/bin/one: line 172: /srv/cloud/one/var/sched.pid: Input/output error
oned failed to start
/srv/cloud/one/bin/one: line 84: 28706 Terminated              $ONED -f 2>&1

there are not further errors in dmesg, oned.log or messages. Other times oned will start but then sched fails with this error:

/srv/cloud/one/bin/one: line 172: /srv/cloud/one/var/sched.pid: Input/output error
/srv/cloud/one/bin/one: line 112: 29006 Segmentation fault      (core dumped) $ONE_SCHEDULER -p $PORT -t 30 -m 300 -d 30 -h 1
cat: /srv/cloud/one/var/oned.pid: Input/output error

again, nothing in dmesg or messages but sched.log reports:

Sun Jan 29 11:40:27 2012 [POOL][E]: Could not retrieve pool info from ONE
Sun Jan 29 11:40:32 2012 [HOST][E]: Exception raised: Unable to transport XML to server and get XML response back.  libcurl failed to execute the HTTP POST transaction.  couldn't connect to host
Sun Jan 29 11:40:32 2012 [POOL][E]: Could not retrieve pool info from ONE

There are not errors in the OrangeFS logs and performance seems good so I'm assuming the file system is working. I've never seen this error in any of my other OpenNebula clusters using local storage.

Any ideas? I'm sure I'm forgetting some helpful details but any thoughts would be greatly appreciated.

Thanks,
Nick


More information about the Users mailing list