[one-users] Multi-deployment of VMs: Slow "Dispatching virtual machine"

Tue May 28 08:45:45 PDT 2013

Hi,

I'm benchmarking the multi-deployment of VMs in OpenNebula to test the 
scalability of our distributed file system XtreemFS.

Therefore, I do the following things:

- stop the scheduler
- "onevm create" multiple VMs
- start the scheduler again

- wait until the last VM has booted

Recently, we upgraded our OpenNebula installation from 2.2 to 3.8 on our 
32 node test cluster. With OpenNebula 2.2 the VMs were deployed almost 
simultaneously. But in 3.8 dispatching a single VM takes quite some time 
(1-2 seconds) for the scheduler.

Here are the details:

I benchmark the creation of the qcow2 snapshot in the "clone" transfer 
manager script and here's what it looked like for deploying 10 VMs with 
OpenNebula 2.2:

1362253295.5779 clone_starting n03
1362253295.5929 clone_starting n01
1362253295.6138 clone_starting n00
1362253295.6418 clone_starting n05
1362253295.6428 clone_starting n04
1362253295.6905 clone_starting n08
1362253295.6960 clone_starting n09
1362253295.7047 clone_starting n06
1362253295.7113 clone_starting n02
1362253295.7330 clone_starting n07
1362253296.7047 clone_finished n05
1362253296.7214 clone_finished n03
1362253296.7353 clone_finished n01
1362253296.7571 clone_finished n06
1362253296.7677 clone_finished n09
1362253296.7705 clone_finished n04
1362253296.8035 clone_finished n08
1362253296.8206 clone_finished n00
1362253296.8214 clone_finished n02
1362253296.8292 clone_finished n07

The whole thing finished in under two seconds.

With OpenNebula 3.8 it looks much different:

1369752457.4118 clone_starting n13
1369752457.4195 clone_finished n13
1369752459.6483 clone_starting n17
1369752459.6561 clone_finished n17
1369752460.6465 clone_starting n08
1369752460.6544 clone_finished n08
1369752461.9516 clone_starting n12
1369752461.9602 clone_finished n12
1369752463.2860 clone_starting n15
1369752463.2948 clone_finished n15
1369752465.7036 clone_starting n14
1369752465.7120 clone_finished n14
1369752466.7329 clone_starting n11
1369752466.7406 clone_finished n11
1369752467.9151 clone_starting n10
1369752467.9231 clone_finished n10
1369752468.8460 clone_starting n16
1369752468.8539 clone_finished n16
1369752469.8849 clone_starting n09
1369752469.8958 clone_finished n09

Now, dispatching a single VM takes between 1-2 seconds. Here are the 
corresponding snippets from the sched.log:

Tue May 28 16:47:35 2013 [VM][I]: Dispatching virtual machine 266 to host 98
Tue May 28 16:47:36 2013 [VM][I]: Dispatching virtual machine 267 to 
host 102
Tue May 28 16:47:36 2013 [VM][I]: Dispatching virtual machine 268 to host 93
Tue May 28 16:47:39 2013 [VM][I]: Dispatching virtual machine 269 to host 97
Tue May 28 16:47:41 2013 [VM][I]: Dispatching virtual machine 270 to 
host 100
Tue May 28 16:47:41 2013 [VM][I]: Dispatching virtual machine 271 to host 99
Tue May 28 16:47:43 2013 [VM][I]: Dispatching virtual machine 272 to host 96
Tue May 28 16:47:44 2013 [VM][I]: Dispatching virtual machine 273 to host 95
Tue May 28 16:47:44 2013 [VM][I]: Dispatching virtual machine 274 to 
host 101
Tue May 28 16:47:45 2013 [VM][I]: Dispatching virtual machine 275 to host 94

When I have a look at the sources, I suspect part of the problem is the 
blocking XML-RPC call to the one daemon (?):

https://github.com/OpenNebula/one/blob/d732c5ae2fe774a2f0c0e24e6b60b3dc832a5f35/src/scheduler/src/pool/VirtualMachinePoolXML.cc#L133

Nonetheless, it shouldn't take that long. Therefore, my questions are:

- Is this normal? Can you please give advice how to further track down 
what takes so long?

- With 2.2 you can clearly see the interleaving of multiple deployments 
while 3.8 processes them one at a time. Is there a way to get the old 
behavior back in a recent OpenNebula installation?

Thank you very much for your help.

Best regards,
Michael