Mounting a federated storage cluster as part of a local file system

For the Belle-II experiment, we run more than 3500 user jobs in parallel on 6 different clouds which are all at different geographic locations far away from each other.  Running physics simulations, each of these jobs needs a set of 5 input files with about 5GB of input data. All available sets together are about 100GB of size and each job choose one of the sets as their input data.
However, if all jobs access a single storage site then it is very easy to run into problems, mainly due to:

  • high load on the storage servers
  • timeouts due to too slow data transfers when sharing the bandwidth of the storage site
  • slow (random) read access to the disks when providing the files for many different jobs in parallel, especially since the central storage server also serves data for other experiments
  • inefficiencies due to long distance data transfers

The best solution here would be to have the data files on different locations close to where the jobs run. This isn't easily possible because 
  • all storage sites need to be secured and allow access only for specific X509 certificates
  • all jobs are configured in the same way and it is random on which cloud it ends up - all jobs need to access the same storage endpoint
To solve this problem, we use Dynafed, a dynamic data federation that uses the webdav protocol. It provides a single endpoint and it is handling the authentication and authorization. The storage endpoints behind Dynafed have to support http(s)/webdav(s) but can be of very different origin. We use as storage endpoints:
  • Ceph based storage cluster (radosgw)
  • local storage on VMs in the clouds, using Minio to provide an S3 based interface
  • Amazon S3 storage
  • traditional Grid sites that already have the data

Since the overall amount of data is not that much, we copied all files to each of the endpoints and included these in dynafed in a way that they appear to be in the same logical location - for a user job it looks like the file is there only once but the access to it can be through any of the configured endpoints.
That way we can configure all jobs to access the same storage access point (dynafed's URL) but then the jobs get automatically redirected to the real storage endpoint that is closest to the cloud where the job itself runs. This solves the above mentioned problems and also has other advantages:
  • authorization and authentication handled by dynafed not by the storage site
    • authenticated jobs get a pre-authorized link to the data which has only a very limited lifetime
  • jobs in different clouds access different storage sites for data transfers but all are still configured in the same way
  • system is fail-safe against the outage of sites
    • jobs will just continue to access the data from the next closest site
  • system can dynamically be increased to account for additional load
    • new endpoints can easily be setup on different clouds at any time and added to dynafed, e.g. when an increased job load on a cloud causes performance problems on the currently used storage endpoints

However to be able to use that system, the jobs needs to access the data through http(s)/webdav(s). Unfortunately, that's not always the case.

To still be able to use the advantages described above, one can use gfalFS,  a fuse module that allows to mount the whole data federation as a single directory in the Linux file system.
To be able to use X509 based authentication, the environment variable X509_USER_PROXY needs to be set to the correct X509 proxy file and than a federation can be mounted as:

gfalFS -s $HOME/data davs://dyanfed.server.com:PORT/data 

which would mount the path /data/ accessible on the dynafed server dynafed.server.com in a users home directory accessible through $HOME/data
The big advantage here is that all files in the federation, no matter where they are actually stored, are accessible with standard linux tools; e.g. ls and cp. Accessing the federation that way will not loose any of the other advantages described above, especially access from different geographic locations will still end up to access files on different storage endpoints.

Running this configuration for more than 3500 jobs in parallel, it proved to be a very good way to make use of distributed storage in the case the needed transfer protocol is not supported by  the jobs running. 







Comments

Popular posts from this blog

Grid-mapfile based authentication for DynaFed

Monitoring Dynafed with ELK

Glint Version 2 Enters Production