[one-users] Hung sshd processes on VM hosts

Steven Timm timm at fnal.gov
Fri Mar 14 10:55:15 PDT 2014


We recently deployed several new and bigger hosts on our OpenNebula 3.2 
cloud and are seeing some issues.  At this point we are not sure if we are 
dealing with an OS problem with the sshd or something else.
But the symptom is that we see a OpenNebula monitoring process come into 
the VM host as oneadmin, do its thing but then the sshd process
(owned by root) that spawned the process starts using up to 100% of system 
cpu, and it is not killable at all.  strace of the sshd process simply 
hang. Eventually a lot of these build up on the VM host and it is almost
impossible to do anything.  Only way to kill them we have found so far
is to restart the parent sshd and then we can kill all the child sshd 
processes.

The symptom tends to happen when there are more than 20 virtual machines 
on the same host.  These are new Ivy-Bridge based hosts that should be 
good for at least 40 VM's apiece.

Has anyone seen anything like this before?  And yes, I know the 4.x series 
of opennebula is a lot more efficient in its monitoring and we are trying 
to get there as fast as we can.

Steve Timm
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing


More information about the Users mailing list