[one-users] Some issues while using OpenNebula

Ruben S. Montero rubensm at dacya.ucm.es
Thu Feb 5 14:01:55 PST 2009

Hi Boris, 
	Thank you very much for your interest in OpenNebula, and sharing your 
concerns with us. Comments below:

> The first issue is about ONEs behaviour in the case of malfunctions.
> Is it right, that ONE does not detect it, if an physical host crashed
> and has problems with handling that?

Not really, the logic is already there so if you have experience problems with 
this it should be a bug. In fact, OpenNebula checks the physical hosts at 
several points. If there is a misconfiguration or just a failure  (e.g. the 
physical node crashed) the information system detects it and marks the as 
error, like this:

 HID NAME                      RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
   1 cluster02                   0    100    100    100 1047552  896000  err

In that case OpenNebula should not allocate VMs in that node.

> What is the procedure, if an host running virtual machines crashes?

Well, there is not too much space to do things right ;). We are implementing  
a general hook mechanism that let you program actions on different VM states. 
Like executing a pre-defined action (on_failure=reschedule) or a custom command 
on specific VM states like boot, failure... (this feature will not be available 
in 1.2).

Note that this will be done at the cluster level, so you have to be cautious. 
For example automatically re-schedule a VM with a cloned disk will not 
preserve your data. If you have a shared file-system the disk may contain 
inconsistencies and need a fsck so the VM will not boot....

> Is it right, that you can not delete a virtual machine (VM), when it
> gets stucked in the boot-status for any reasons?

This is a known issue of 1.2. But I must say that this is only cosmetic. We do 
not delete VMs, even when they are done, just we do not show them in the 
onehost list command (greep -v boot, will fix this ;). 

All that info is kept in the DB to generate accounting reports, or use it for 
billing purposes. Providing a friendly accounting API & CLI is also in our 
short-term roadmap. Note that the info is in an standard sqlite DB, so it is 
really easy to access the accounting data.

> Also, when anything ONE tries to do via ssh fails, the ssh connection
> remains, which brought down our ssh clients because of too many
> connections.

I'd love to hear more about this one, we are very interested in making 
OpenNebula scalable. In fact we have successfully performed tests with ~100 
ssh simultaneous connections. Could you give me more details so we can 
reproduce this and track down the problem?

> Then we have got a problem with the tm_mv.sh in the nfs version we
> use. It does not seem to work. Maybe you can on help us:
> If I try to use nfs and cloning, the vm runs as expected. But when I
> shut it down (the save-tag in the template is set to "yes"), it is not
> saved.
> The vm log-file says "Will not move, is not saving image"

You are right, this should not happen. In fact, as you mention this is a very 
simple script. I've filed a bug for this:


Could you send us the vm.log and transfer.0 files for the VM?. And just a silly 
try, egrep is installed, isn't it?

> Even more critical seems to be, that the images-subfolder gets deleted
> anyway, so the changes made to your image are lost and you can only
> set up another vm from the previous image saved in the images-folder.
> (A vm's image is supposed to be saved back to the image folder, if you
> use onevm shutdown, isnt it?)

If the tm_mv.sh works properly you will have a copy in $VAR_LOCATION/<VMID>, 
so the temporary images dir should be cleaned up. Then you can set up a 
template with the save disk.

> Actually, I am still not sure which commands are meant for the regular
> operations? If I like to stop a VM and continue it anywhere later,
> should I use shutdown and then submit its template again (therefore
> getting a new onevm id)? Or should I use suspend and resume?

Ok. May be we should improve the doc here:

Stop. You stop the VM, generate a checkpoint and "transfer back" the image, 
where "transfer-back" means different things for NFS or SCP. Then if you resume 
the VM, then it is scheduled in other resource and continues its execution.

Resume. Same as above but everything is left in the physical host where the VM 
is running. When you resume it, the VM will continue in the same resource. 
Note that the scheduler is not invoked here and images are not moved so this 
one should be faster.

Shutdown. In this case imagine that you have installed a base system, then you 
boot up the machine and configure something. You can make the modifications 
persistent by cloning the disk (so you can continue the VM by just submitting 
it again after shutdown) or by saving. In the latter you keep the original 
image and the modifications are saved in the $VAR_LOCATION/<VM_ID> directory. 
You have to move the saved image to your repo and fix a new template for it.

Note that the first two are stateful
> One note:
> Before we even got it to run, we had to do a small change in the
> XenDriver.cc There you use tap:aio in the 1.2 beta version. But to use
> "file:" is not deprecated, as mentioned in the comment in XenDriver.cc.
> tap:aio just has a higher performance. But it does not work on all Linux
> distributions and xen-versions! So please consider making this
> configurable.

May be we are wrong and this is Xen/distribution dependent. We will add a note 
under the known issues for 1.2

Thanks for your feedback!


> Kind regards,
> Boris Konrad
> _______________________________________________
> Users mailing list
> Users at lists.opennebula.org
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

 Dr. Ruben Santiago Montero
 Associate Professor
 Distributed System Architecture Group (http://dsa-research.org)

 URL:    http://dsa-research.org/doku.php?id=people:ruben
 Weblog: http://blog.dsa-research.org/?author=7
 GridWay, http://www.gridway.org
 OpenNEbula, http://www.opennebula.org

More information about the Users mailing list