Posts

Monitoring Dynafed with ELK

Image
The Dynamic Federation project (DynaFed)  being developed at CERN is intended to federate any number of different types of storage endpoints allowing users to read or write data transparently and efficiently. How it works in a nutshell: A client sends a COPY, GET, PUT, DELETE HTTP request to the DynaFed instance's URL. DynaFed decides which amongst the storage endpoints it federates is the "best" one to handle the request, sorted according to geographic location and available free space. The client receives a 302 redirect link pointing to this storage endpoint with the protocol and authentication tokens necessary for the transaction. The client re-sends the HTTP request directly to this storage endpoint. As we can see, after the redirect link is provided, DynaFed is out of the loop and as such is unaware of whether the client-storage endpoint transaction is successful or not. So while we cannot get this valuable information from Dynafed, we instead are interested i

HEPiX Presentations

We had the opportunity to presented our compute and storage system for HEP at the Fall HEPiX at KEK, Japan. In our first presentation , we showed how we run HEP workloads on distributed clouds. We showed how Cloudscheduler is used to start up VMs on any cloud accessible to us automatically when needed, how  Shoal  is used to connect to the squid closest to a VM, and how we implemented an Accounting and Benchmark system. Our second presentation covered the data storage part, which is for a distributed cloud system very different than for a traditional Grid storage site. In this presentation, we showed why a distributed cloud compute system needs a different storage approach, how we realize it, and how the authentication and authorization for such system works. We will present updates on our accounting and monitoring system as well as on the production usage of our distributed storage system  soon at CHEP in Sofia, Bulgaria.

Mounting a federated storage cluster as part of a local file system

For the  Belle-II experiment, we run more than 3500 user jobs in parallel on 6 different clouds which are all at different geographic locations far away from each other.  Running physics simulations, each of these jobs needs a set of 5 input files with about 5GB of input data. All available sets together are about 100GB of size and each job choose one of the sets as their input data. However, if all jobs access a single storage site then it is very easy to run into problems, mainly due to: high load on the storage servers timeouts due to too slow data transfers when sharing the bandwidth of the storage site slow (random) read access to the disks when providing the files for many different jobs in parallel, especially since the central storage server also serves data for other experiments inefficiencies due to long distance data transfers The best solution here would be to have the data files on different locations close to where the jobs run. This isn't easily possible

Shoal - Squid Proxy Discovery and Management

The Shoal system has been running stable in a production environment for several years now without much change. The goal of Shoal is to help provide contextualization to new virtual machines in a cloud production habitat. More simply- Shoal provides virtual machines with some squid proxies where they can retrieve the software and data they need to run their payloads without going all the way to the source. The Shoal system is broken down into three components: shoal-agent shoal-server shoal-client The shoal-agent is a daemon process that runs on a squid proxy cache. The daemon collects various health metrics and configuration information about the squid and sends the shoal-server a message via AMQP (Advanced Message Queuing Protocol). Each installation of shoal-agent will have a shoal-agent configuration file typically found at /etc/shoal/ shoal_agent.conf. This file allows you to configure several things about the squid cache such as which shoal-server to register to, who the

ACAT 2017

The 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research ( ACAT ) took place in Seattle this week. We presented our work on integrating the dynamic web federation into HEP computing as a poster . The conference focused on the use of machine learning algorithm in physics research with contributions from industry offering effective computing technology to execute workflows employing deep neural nets. These technologies offer solutions to the computing issues the field is facing in light of a great increase in data with a constant computing budget. When the LHC experiments were planned it was assumed that Dennard Scaling would solve this problem for us, it has become clear that this is not the case. It was shown that generative adversarial neural nets may be used to do do simulation, and that supervised learning may provide options for triggering and reconstruction. In some places these technologies are already used. nVidia, Microsoft, and DW

Glint Version 2 Enters Production

After several months of prototyping and development Glint version 2.0 (Glint v2) has entered production. Glint v2 is a standalone web service inspired by Colin Leavett-Brown and Ron Desmarais' original glint service (Glint v1). The idea of glint was to allow for image replication across multiple openstack clouds using a simple interface instead of manually downloading and uploading images to new locations. Version 2 differs from the original in that it is a dedicated web service instead of an extension of the Openstack Horizon dashboard. Unfortunately the Openstack developers had a different philosophy regarding image and repository management and decided not to accept glint v1 as a proprietary module. The Openstack 6 month development cycle made it unreasonable for a small group like UVic's HEPRC group to maintain Glint v1 as an openstack plugin. Instead a new version of the service was conce

Authorization in DynaFed, Part 2

As we showed previously , there is an easy way to use the information derived from VOMS-server based on grid-mapfiles to authorize a specific user to access a specific part of the dynamic federation. This solution was based on 3 parts: a grid-mapfile listing the DNs of all users from all supported VOs with all possible roles a text file (accessfile) that specifies the different privileges for the different parts of the storage federation a python script that is doing the authentication and authorization based on the 2 previously mentioned files While in this solution the grid-mapfile and accessfile can be changed anytime without the need to reload/restart the httpd and memcache process, there is also a simpler solution based on the internal authentication methods possible which however needs to restart httpd and memcache after each change. This one will be explained in the following. Using the built in authentication in Dynafed, one can grant access to a specific part o