[one-users] (thanks) Re: experiences with distributed FS?

Thu Feb 9 04:38:54 PST 2012

Thanks for sharing your knowledge

It will take a while to look into this very useful information

best regards,
João

here's a short summary by FS:

GPFS===============================
• ------------ (H) are looking forward to use GPFS as a clustered FS for 
our upcoming KVM & OpenNebula
• we are already a happy GPFS user
• … GPFS … is pretty flexible ( Support for LX,AIX, and MS ) and very 
scalable
• BUT it is not opensource - 
http://www-03.ibm.com/systems/software/gpfs/resources.html
• Stability (is it maintainable without a general shutdown)?
o If best practise has been followed you can survive without a forced 
downtime for years.
o In case you would use NFS ( hard mounted ) on top of GPFS you could 
survive for decades ( In case a 60 sec NFS 'hang' would be Ok since this 
approach would allow you to survive even a complete GPFS cluster reboot. 
But best practise is to use a rolling update/upgrade approach so only 
parts of the cluster server will be down for a certain time and not 
affecting the service.
• Effort/learning curve - If you familiar with an clustered fs already - 
A few days with a little help
• --------------(GUL) Beware of the clustering software, be sure to use 
a very recent release (i.e. the on in Debian Stable).
• I tried it with an older, broken version of the clustering software 
and initially it worked wonderfully well... Until the cluster software 
began to malfunction and kick the servers out.
Conga CLVM / GFS2 (maybe not really hard distributed?) 
==========================
• ----------------(A) I implemented the following, which I have in 
production,
• I have 4 nodes connected to an external storage (NAS) for iSCSI, then
• deploy a cluster with Conga CLVM to implement, create a large volume 
between 4 nodes and this GFS2 implemented.
• It really has been very successful in production and is the experience 
that I can share with you.
• More documentation: 
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/index.html
• Check for the following manuals: Cluster Administration; Global File 
System 2
MooseFS===============================
• -----------------(AZ) I have an installation of OpenNebula 3.0 with 
Xen Hypervisor and a storage with MooseFS.
• The OpenNebula server is also the MooseFS master server; obviously we 
have a second OpenNebula/MooseFS slave server configured with uCarp for HA.
• At the moment we have 3 storage nodes, every server is equiped with 2x 
2TB SATA disks for a total of 11 TB disk space; we use a replica factor 
of 2.
• On OpenNebula side, we use shared storage driver; the VM disks are 
plain files, Xen driver is "file:"
• In past we used OpenNebula 2.2 with MooseFS driver for immediate 
deployment via snapshot, but the driver wasn't updated to OpenNebula 3.0.
• We are very happy with MooseFS: it's very easy and robust. The only 
consideration is on disk side: SATA is a little slow expecially in write 
operation, then we use a lot of RAM on each storage server (8 GB) to 
cache data.
• If you want, on my personal blog (http://www.azns.it) there are many 
configuration examples of OpenNebula/MooseFS/uCarp. Sorry they are in 
Italian
• --------------------(MM) And my choice: MooseFS - redundant any node 
can go offline and data will be available, scalable, with striping, with 
CoW, Create copy of huge images in a second, has internal checksum 
correction, has commercial support, data deduplication (commercial 
version only) and many other features. there are plugging for 
OpenNebula. It has some bugs but you can start from this point.
• -----------------(HB) One issue with MooseFS is that it relies upon a 
single Metadata server. Therefore, if that Metadata server fails, the 
cluster fails. GlusterFS does not have a Metadata server.
• --------------------(GT) I assure you that it will work on ON 3.0 too 
as is. However it should be updated following the new tm_shared 
implementation. The upgrade will be probably trivial since the driver is 
very simple, however I really don't have time to update and test it. 
Feel free to fork it on github: 
https://github.com/libersoft/opennebula-tm-moosefs update it against the 
latest tm_shared driver and submit changes back.
• I've written a small article few months ago after a successful 
deployment of OpenNebula on top of a MooseFS volume: 
http://blog.opennebula.org/?p=1512. Short anwser: yes, they works, but 
you need to fully understand how they work before using them with 
OpenNebula. Build a test environment and try different configurations, 
make some test plugging randomly power-chord and try to recover by yourself
• ------------------(CP) I can agree with the point on slowness
• I had a shared moosefs setup running (with only 4 moosefs servers, 
with 2x 1Tb SATA drives in each), and found disk writes to be far too 
slow to host our test environment correctly
• I have migrated this to a custom tm, which uses moosefs for image 
store, and copies vm images to local storage to actually run.
• I am busy investigating moosefs slowness on a separate testing cluster 
(and have so far improved bonnie++ rewrite speed from 1.5MB/s to 
18MB/s). I should have something I am willing to put back into use as a 
proper shared filesystem soon (I really like using the snapshot 
capabilites to deploy vms)
• Just to clarify - moosefs is fast in almost all respects, but is slow 
when continually reading from and writing to the same block of a file. 
Unfortunately the workload on some of my vms seems to hit this case 
fairly often, and that is what I am trying to optimise.
Lustre===============================
• -------------(MM) Lustre - does not have redundancy, If you will use 
file striping between nodes and one nodes go offline all data are not 
available.
Gluster===============================
• ------------(MM) Gluster - does not support KVM virtualization. 
Software developer lead mentioned that it will be fixed in next release 
(April).
• -------------(HB) Our research project is currently using GlusterFS 
for our distributed NFS storage system.
• We're leveraging the distributed-replica configuration in which every 
two pairs of servers is a replica pair, and all pairs form a distributed 
cluster.
• We do not do data stripping since to achieve up-time reliability with 
stripping would require too many servers.
• Furthermore, another nice feature of GlusterFS, is that you can just 
install it into a VM, clone it a few times, and distribute them across 
VMMs. However, we utilize physical servers using RAID-1.
Shipping Dog ===============================
• -----------(MM) Shipping Dog - Working only with images does not allow 
to store plain files on it. one image can be connected to only one VM 
per a time.
eXtremFS ===============================
• -----------(MM) eXtremFS - does not have support. If something not 
working it is your problem even you ready to pay for fixing.
XtreemFS===============================
• -------------------(MB) I'm one of the developers of the distributed 
file system XtreemFS.
• Internally we also run a OpenNebula cluster and use XtreemFS as shared 
file system there. Since XtreemFS is POSIX compatible, you can use the 
tm_nfs and just point OpenNebula to the mounted POSIX volume.
• ….. So far we haven't seen any problems with using XtreemFS in 
OpenNebula, otherwise we would have fixed it
• Regarding the performance: We did not do any measurements so far. Any 
numbers or suggestions how to benchmark different distributed file 
systems are welcome.
• In XtreemFS the metadata of a volume is stored on one metadata server. 
If you like, you can set it up replicated and then your file system will 
have no single point of failure.
• Regarding support: Although we cannot offer commercial support, we 
provide support through our mailing list and are always eager to help.
• ----------------(RW)… you are using FUSE.

Em 09-02-2012 10:49, richard -rw- weinberger escreveu:
> On Thu, Feb 9, 2012 at 11:17 AM, Michael Berlin
>> Regarding the performance: We did not do any measurements so far. Any
>> numbers or suggestions how to benchmark different distributed file systems
>> are welcome.
> Hmm, you are using FUSE.
> Performance measurements would be really nice to have.
>
> Writing fancy file systems using FUSE is easy. Making them fast and scalable
> is a damn hard job and often impossible.
>

-- 
João Pagaime
FCCN - Área de Infra-estruturas Aplicacionais
Av. do Brasil, n.º 101 - Lisboa
Telef. +351 218440100  Fax +351 218472167
www.fccn.pt

Aviso de Confidencialidade/Disclaimer
Esta mensagem é exclusivamente destinada ao seu destinatário, podendo
conter informação CONFIDENCIAL, cuja divulgação está expressamente
vedada nos termos da lei. Caso tenha recepcionado indevidamente esta
mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta
via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de
imediato. This message is intended exclusively for its addressee. It may
contain CONFIDENTIAL information protected by law. If this message has
been received by error, please notify us via e-mail or by telephone +351
218440100 and delete it immediately.