[one-users] Stop-Resume failing with shared storage

Rangababu Chakravarthula rbabu at hexagrid.com
Sun Mar 7 20:04:55 PST 2010


Thank you Tino. Sorry for the late reply. Here are the detailed logs. 
Any help is appreciated.

NFS SHARED IMAGES DIRECTORY BETWEEN ALL HOSTS /mnt/sharedimagesdir

Contents of ONED.CONF

VM_DIR=/mnt/sharedimagesdir
IM_MAD = [
    name       = "im_kvm",
    executable = "one_im_ssh",
    arguments  = "im_kvm/im_kvm.conf",
    default    = "im_kvm/im_kvm.conf" ]
VM_MAD = [
     name       = "vmm_kvm",
     executable = "one_vmm_kvm",
     default    = "vmm_kvm/vmm_kvm.conf",
     type       = "kvm" ]
TM_MAD = [
        name       = "tm_nfs",
        executable = "one_tm",
        arguments  = "tm_nfs/tm_nfs.conf",
        default    = "tm_nfs/tm_nfs.conf" ]

WE MODIFIED tm_clone.sh & tm_ln.sh to add SSH


SUBMITTED NEW VM

onevm show 433

VID            : 433                
UID            : 0                  
STATE          : ACTIVE             
LCM STATE      : RUNNING            
DEPLOY ID      : one-433            
MEMORY         : 262144             
CPU            : 0                  
LAST POLL      : 1267828125         
START TIME     : 03/05 16:12:02     
STOP TIME      : 12/31 18:00:00     
NET TX         : 0                  
NET RX         : 0                  

....: Template :....
    DISK            : 
CLONE=no,SOURCE=/mnt/sharedimagesdir/images/onetest0,TARGET=hda,TYPE=disk
    GRAPHICS        : LISTEN=0.0.0.0,PORT=6003,TYPE=vnc
    INPUT           : TYPE=tablet        
    MEMORY          : 256                
    NAME            : onetest            
    NIC             : BRIDGE=br171,MAC=00:04:c9:5b:44:8a
    OS              : BOOT=hd            
    VCPU            : 1                  


ON THE MANAGEMENT NODE

root at ManagementNode:/etc/one/tm_nfs# ls -al /var/lib/one/433/
total 24
drwxrwxrwx   2 oneadmin nogroup  4096 2010-03-05 16:12 .
drwxr-xr-x 437 oneadmin root    12288 2010-03-05 16:26 ..
-rw-r--r--   1 oneadmin nogroup   549 2010-03-05 16:12 deployment.0
-rw-r--r--   1 oneadmin nogroup    89 2010-03-05 16:12 transfer.0

/var/log/one/433.log

Fri Mar  5 16:12:11 2010 [DiM][I]: New VM state is ACTIVE.
Fri Mar  5 16:12:11 2010 [LCM][I]: New VM state is PROLOG.
Fri Mar  5 16:12:11 2010 [TM][I]: tm_ln.sh: Creating directory 
/mnt/sharedimagesdir/433/images
Fri Mar  5 16:12:11 2010 [TM][I]: tm_ln.sh: Executed "ssh 10.10.20.190 
mkdir -p /mnt/sharedimagesdir/433/images".
Fri Mar  5 16:12:11 2010 [TM][I]: tm_ln.sh: Executed "ssh 10.10.20.190 
chmod a+w /mnt/sharedimagesdir/433/images".
Fri Mar  5 16:12:11 2010 [TM][I]: tm_ln.sh: Link 
/mnt/sharedimagesdir/images/onetest0
Fri Mar  5 16:12:11 2010 [TM][I]: tm_ln.sh: Executed "ssh 10.10.20.190 
ln -s /mnt/sharedimagesdir/images/onetest0 
/mnt/sharedimagesdir/433/images/disk.0".
Fri Mar  5 16:12:11 2010 [LCM][I]: New VM state is BOOT
Fri Mar  5 16:12:11 2010 [VMM][I]: Generating deployment file: 
/var/lib/one/433/deployment.0
Fri Mar  5 16:12:11 2010 [VMM][I]: Command: scp 
/var/lib/one/433/deployment.0 
10.10.20.190:/mnt/sharedimagesdir/433/images/deployment.0
Fri Mar  5 16:12:11 2010 [VMM][I]: Copy success
Fri Mar  5 16:12:12 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Mar  5 16:12:12 2010 [VMM][I]: ExitCode: 0
Fri Mar  5 16:12:12 2010 [LCM][I]: New VM state is RUNNING


onevm list

 433  onetest runn   0  262144    10.10.20.190 00 00:16:44





ON THE HOST

root at 00238bbda914:/mnt/sharedimagesdir# ls -ltr 
/mnt/sharedimagesdir/433/images/
total 2
lrwxrwxrwx  1 oneadmin nogroup  32 2010-03-05 22:08 disk.0 -> 
/mnt/sharedimagesdir/images/onetest0
-rw-r--r--+ 1 oneadmin nogroup 549 2010-03-05 22:08 deployment.0
root at 00238bbda914:/mnt/sharedimagesdir#


/var/log/libvirt/qemu/433.log on HOST

LC_ALL=C 
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin 
/usr/bin/kvm -S -M pc-0.11 -m 256 -smp 1 -name one-433 -uuid 
74c151d6-b1f5-3e41-fc45-e7fdc9247722 -monitor 
unix:/var/run/libvirt/qemu/one-433.monitor,server,nowait -boot c -drive 
file=/mnt/sharedimagesdir/433/images/disk.0,if=ide,index=0,boot=on -net 
nic,macaddr=00:04:c9:5b:44:8a,vlan=0,name=nic.0 -net 
tap,fd=20,vlan=0,name=tap.0 -serial none -parallel none -usb -usbdevice 
tablet -vnc 0.0.0.0:103 -vga cirrus

deployment.0 file on HOST

<domain type='kvm'>
        <name>one-433</name>
        <vcpu>1</vcpu>
        <memory>262144</memory>
        <os>
                <type>hvm</type>
                <boot dev='hd'/>
        </os>
        <devices>
                <emulator>/usr/bin/kvm</emulator>
                <disk type='file' device='disk'>
                        <source 
file='/mnt/sharedimagesdir/433/images/disk.0'/>
                        <target dev='hda'/>
                </disk>
                <interface type='bridge'>
                        <source bridge='br171'/>
                        <mac address='00:04:c9:5b:44:8a'/>
                </interface>
                <graphics type='vnc' listen='0.0.0.0' port='6003'/>
                <input type='tablet'/>
        </devices>
        <features>
                <acpi/>
        </features>
</domain>


SUSPEND INVOKED


onevm list

 433  onetest susp   0  262144    10.10.20.190 00 00:25:08

433.log

Fri Mar  5 16:35:28 2010 [LCM][I]: New VM state is SAVE_SUSPEND
Fri Mar  5 16:35:29 2010 [VMM][I]: Connecting to uri: qemu:///system
Fri Mar  5 16:35:29 2010 [VMM][I]: ExitCode: 0
Fri Mar  5 16:35:29 2010 [DiM][I]: New VM state is SUSPENDED

Oned.log

Fri Mar  5 16:35:28 2010 [ReM][D]: VirtualMachineAction invoked
Fri Mar  5 16:35:28 2010 [DiM][D]: Suspending VM 433
Fri Mar  5 16:35:29 2010 [VMM][D]: Message received: LOG - 433 
Connecting to uri: qemu:///system

Fri Mar  5 16:35:29 2010 [VMM][D]: Message received: LOG - 433 ExitCode: 0

Fri Mar  5 16:35:29 2010 [VMM][D]: Message received: SAVE SUCCESS 433

ONE THE HOST

root at 00238bbda914:/mnt/sharedimagesdir/433/images# ls -ltr
total 3
lrwxrwxrwx  1 oneadmin nogroup     32 2010-03-05 22:08 disk.0 -> 
/mnt/sharedimagesdir/images/onetest0
-rw-r--r--+ 1 oneadmin nogroup    549 2010-03-05 22:08 deployment.0
-rw-------+ 1 root     root    940894 2010-03-05 22:31 checkpoint






Tino Vazquez wrote:
> Hi Ranga,
>
> If you are using a shared repository (i'll assume you use NFS or a
> similar distributed FS), then the "<vmid>/images/" is shared between
> all the remote hosts, so there is no need to move the checkpoint files
> and they should be available in all the nodes.
>
> Please send us the log of the VM that is failing so we can try and
> reproduce the problem.
>
> Regards,
>
> -Tino
>
> --
> Constantino Vázquez, Grid & Virtualization Technology
> Engineer/Researcher: http://www.dsa-research.org/tinova
> DSA Research Group: http://dsa-research.org
> Globus GridWay Metascheduler: http://www.GridWay.org
> OpenNebula Virtual Infrastructure Engine: http://www.OpenNebula.org
>
>
>
> On Thu, Feb 18, 2010 at 2:44 AM, Rangababu Chakravarthula
> <rbabu at hexagrid.com> wrote:
>   
>> We are using shared storage as defined here
>>
>> http://www.opennebula.org/doku.php?id=documentation:rel1.2:sm#samplea_shared_image_repository
>>
>> When we run onevm stop or onevm suspend it tries to do SAVE_STOP and
>> SAVE_SUSPEND and creates a checkpoint file on the host
>> /var/lib/one/<vmid>/images/
>>
>> and in the logs we see
>> tm_mv.sh: Will not move, is not saving image
>>
>> I think it is trying to move the checkpoint file back to the management node
>> and based on logic in tm_mv.sh it is not moving.
>>
>> Later when we try to do onevm resume , one picks a different host and tries
>> to move the checkpoint file from the management node to the new host and
>> again says "Will not move, is not saving image" and on the host it fails to
>> bring the VM  since there is no checkpoint file on the new host.
>>
>> How can we ask ONE to not resume from checkpoint file but instead load from
>> the disk file that is in the template.
>>
>> Ranga
>> _______________________________________________
>> Users mailing list
>> Users at lists.opennebula.org
>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>
>>     




More information about the Users mailing list