[one-users] (thanks) Re: experiences with distributed FS?
João Pagaime
jpsp at fccn.pt
Thu Feb 9 04:38:54 PST 2012
Thanks for sharing your knowledge
It will take a while to look into this very useful information
best regards,
João
here's a short summary by FS:
GPFS===============================
• ------------ (H) are looking forward to use GPFS as a clustered FS for
our upcoming KVM & OpenNebula
• we are already a happy GPFS user
• … GPFS … is pretty flexible ( Support for LX,AIX, and MS ) and very
scalable
• BUT it is not opensource -
http://www-03.ibm.com/systems/software/gpfs/resources.html
• Stability (is it maintainable without a general shutdown)?
o If best practise has been followed you can survive without a forced
downtime for years.
o In case you would use NFS ( hard mounted ) on top of GPFS you could
survive for decades ( In case a 60 sec NFS 'hang' would be Ok since this
approach would allow you to survive even a complete GPFS cluster reboot.
But best practise is to use a rolling update/upgrade approach so only
parts of the cluster server will be down for a certain time and not
affecting the service.
• Effort/learning curve - If you familiar with an clustered fs already -
A few days with a little help
• --------------(GUL) Beware of the clustering software, be sure to use
a very recent release (i.e. the on in Debian Stable).
• I tried it with an older, broken version of the clustering software
and initially it worked wonderfully well... Until the cluster software
began to malfunction and kick the servers out.
Conga CLVM / GFS2 (maybe not really hard distributed?)
==========================
• ----------------(A) I implemented the following, which I have in
production,
• I have 4 nodes connected to an external storage (NAS) for iSCSI, then
• deploy a cluster with Conga CLVM to implement, create a large volume
between 4 nodes and this GFS2 implemented.
• It really has been very successful in production and is the experience
that I can share with you.
• More documentation:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/index.html
• Check for the following manuals: Cluster Administration; Global File
System 2
MooseFS===============================
• -----------------(AZ) I have an installation of OpenNebula 3.0 with
Xen Hypervisor and a storage with MooseFS.
• The OpenNebula server is also the MooseFS master server; obviously we
have a second OpenNebula/MooseFS slave server configured with uCarp for HA.
• At the moment we have 3 storage nodes, every server is equiped with 2x
2TB SATA disks for a total of 11 TB disk space; we use a replica factor
of 2.
• On OpenNebula side, we use shared storage driver; the VM disks are
plain files, Xen driver is "file:"
• In past we used OpenNebula 2.2 with MooseFS driver for immediate
deployment via snapshot, but the driver wasn't updated to OpenNebula 3.0.
• We are very happy with MooseFS: it's very easy and robust. The only
consideration is on disk side: SATA is a little slow expecially in write
operation, then we use a lot of RAM on each storage server (8 GB) to
cache data.
• If you want, on my personal blog (http://www.azns.it) there are many
configuration examples of OpenNebula/MooseFS/uCarp. Sorry they are in
Italian
• --------------------(MM) And my choice: MooseFS - redundant any node
can go offline and data will be available, scalable, with striping, with
CoW, Create copy of huge images in a second, has internal checksum
correction, has commercial support, data deduplication (commercial
version only) and many other features. there are plugging for
OpenNebula. It has some bugs but you can start from this point.
• -----------------(HB) One issue with MooseFS is that it relies upon a
single Metadata server. Therefore, if that Metadata server fails, the
cluster fails. GlusterFS does not have a Metadata server.
• --------------------(GT) I assure you that it will work on ON 3.0 too
as is. However it should be updated following the new tm_shared
implementation. The upgrade will be probably trivial since the driver is
very simple, however I really don't have time to update and test it.
Feel free to fork it on github:
https://github.com/libersoft/opennebula-tm-moosefs update it against the
latest tm_shared driver and submit changes back.
• I've written a small article few months ago after a successful
deployment of OpenNebula on top of a MooseFS volume:
http://blog.opennebula.org/?p=1512. Short anwser: yes, they works, but
you need to fully understand how they work before using them with
OpenNebula. Build a test environment and try different configurations,
make some test plugging randomly power-chord and try to recover by yourself
• ------------------(CP) I can agree with the point on slowness
• I had a shared moosefs setup running (with only 4 moosefs servers,
with 2x 1Tb SATA drives in each), and found disk writes to be far too
slow to host our test environment correctly
• I have migrated this to a custom tm, which uses moosefs for image
store, and copies vm images to local storage to actually run.
• I am busy investigating moosefs slowness on a separate testing cluster
(and have so far improved bonnie++ rewrite speed from 1.5MB/s to
18MB/s). I should have something I am willing to put back into use as a
proper shared filesystem soon (I really like using the snapshot
capabilites to deploy vms)
• Just to clarify - moosefs is fast in almost all respects, but is slow
when continually reading from and writing to the same block of a file.
Unfortunately the workload on some of my vms seems to hit this case
fairly often, and that is what I am trying to optimise.
Lustre===============================
• -------------(MM) Lustre - does not have redundancy, If you will use
file striping between nodes and one nodes go offline all data are not
available.
Gluster===============================
• ------------(MM) Gluster - does not support KVM virtualization.
Software developer lead mentioned that it will be fixed in next release
(April).
• -------------(HB) Our research project is currently using GlusterFS
for our distributed NFS storage system.
• We're leveraging the distributed-replica configuration in which every
two pairs of servers is a replica pair, and all pairs form a distributed
cluster.
• We do not do data stripping since to achieve up-time reliability with
stripping would require too many servers.
• Furthermore, another nice feature of GlusterFS, is that you can just
install it into a VM, clone it a few times, and distribute them across
VMMs. However, we utilize physical servers using RAID-1.
Shipping Dog ===============================
• -----------(MM) Shipping Dog - Working only with images does not allow
to store plain files on it. one image can be connected to only one VM
per a time.
eXtremFS ===============================
• -----------(MM) eXtremFS - does not have support. If something not
working it is your problem even you ready to pay for fixing.
XtreemFS===============================
• -------------------(MB) I'm one of the developers of the distributed
file system XtreemFS.
• Internally we also run a OpenNebula cluster and use XtreemFS as shared
file system there. Since XtreemFS is POSIX compatible, you can use the
tm_nfs and just point OpenNebula to the mounted POSIX volume.
• ….. So far we haven't seen any problems with using XtreemFS in
OpenNebula, otherwise we would have fixed it
• Regarding the performance: We did not do any measurements so far. Any
numbers or suggestions how to benchmark different distributed file
systems are welcome.
• In XtreemFS the metadata of a volume is stored on one metadata server.
If you like, you can set it up replicated and then your file system will
have no single point of failure.
• Regarding support: Although we cannot offer commercial support, we
provide support through our mailing list and are always eager to help.
• ----------------(RW)… you are using FUSE.
Em 09-02-2012 10:49, richard -rw- weinberger escreveu:
> On Thu, Feb 9, 2012 at 11:17 AM, Michael Berlin
>> Regarding the performance: We did not do any measurements so far. Any
>> numbers or suggestions how to benchmark different distributed file systems
>> are welcome.
> Hmm, you are using FUSE.
> Performance measurements would be really nice to have.
>
> Writing fancy file systems using FUSE is easy. Making them fast and scalable
> is a damn hard job and often impossible.
>
--
João Pagaime
FCCN - Área de Infra-estruturas Aplicacionais
Av. do Brasil, n.º 101 - Lisboa
Telef. +351 218440100 Fax +351 218472167
www.fccn.pt
Aviso de Confidencialidade/Disclaimer
Esta mensagem é exclusivamente destinada ao seu destinatário, podendo
conter informação CONFIDENCIAL, cuja divulgação está expressamente
vedada nos termos da lei. Caso tenha recepcionado indevidamente esta
mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta
via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de
imediato. This message is intended exclusively for its addressee. It may
contain CONFIDENTIAL information protected by law. If this message has
been received by error, please notify us via e-mail or by telephone +351
218440100 and delete it immediately.
More information about the Users
mailing list