Slurm User Group Meeting 2013
Hosted by SchedMD
Agenda
The 2013 SLURM User Group Meeting will be held on September 18 and 19 in Oakland, California, USA. The meeting will include an assortment of tutorials, technical presentations, and site reports. The Schedule amd Abstracts are shown below.
Meeting Information
The meeting will be held at California State University's Conference Center, 1000 Broadway Avenue, Suite 109, Oakland, California (Phone 510-208-7001, access from 11th Street). This state of the art facility is located adjacent to the 12th Street BART (Metro) station, with easy access to the entire San Francisco area. There is also frequent and free bus service to Jack London Square using the Broadway Shuttle.
Hotel Information
Many hotel options are available in Oakland, San Fransisco, and elsewhere in the area. Just be sure that your hotel has easy access to BART. Consider the hotels listed below as suggestions:
Waterfront Hotel
Like it says in the name, on the waterfront, with several nice restaurants nearby.
About 1 mile (2 km) from the conference center via the
Broadway Shuttle.
Ferry service to San Fransisco adjacent to the hotel.
Oakland Marriott City Center
Across the street from the conference center.
Discounted rooms are available to government employees.
Registration
The conference cost is $250 per person for registrations by 29 August and
$300 per person for late registration.
This includes presentations, tutorials, lunch and snacks on both days,
plus dinner on Wednesday evening.
Register here.
Schedule
September 18, 2013
Time | Theme | Speaker | Title |
---|---|---|---|
08:00 - 09:00 | Registration / Breakfast | ||
09:00 - 09:15 | Welcome | Morris Jette (SchedMD) | Welcome to Slurm User Group Meeting |
09:15 - 10:00 | Keynote | Dona Crawford (LLNL) | Future Outlook for Advanced Computing |
10:00 - 10:30 | Coffee break | ||
10:30 - 11:00 | Technical | Morris Jette, Danny Auble (SchedMD), Yiannis Georgiou (Bull) | Overview of Slurm version 2.6 |
11:00 - 12:00 | Tutorial | Yiannis Georgiou, Martin Perry, Thomas Cadeau (Bull), Danny Auble (SchedMD) | Energy Accounting and External Sensor Plugins |
12:00 - 13:00 | Lunch at conference center | ||
13:00 - 13:30 | Technical | Yiannis Georgiou , Thomas Cadeau (Bull), Danny Auble, Moe Jette (SchedMD) Matthieu Hautreux (CEA) | Evaluation of Monitoring and Control Features for Power Management |
13:30 - 14:00 | Technical | Matthieu Hautreux (CEA) | Debugging Large Machines |
14:00 - 14:30 | Technical | Alberto Falzone, Paolo Maggi (Nice) | Creating easy to use HPC portals with NICE EnginFrame and Slurm |
14:30 - 15:00 | Coffee break | ||
15:00 - 15:30 | Technical | David Glesser, Yiannis Georgiou, Joseph Emeras, Olivier Richard (Bull) | Slurm evaluation using emulation and replay of real workload traces |
15:30 - 16:30 | Tutorial | Rod Schultz, Yiannis Georgiou (Bull) Danny Auble (SchedMD) | Usage of new profiling functionalities |
18:00 - | Dinner | Lungomare, 1 Broadway Ave. |
September 19, 2013
Time | Theme | Speaker | Title |
---|---|---|---|
08:00 - 08:30 | Registration / Breakfast | ||
08:30 - 09:00 | Technical | Morris Jette, David Bigagli, Danny Auble (SchedMD) | Fault Tolerant Workload Management |
09:00 - 09:30 | Technical | Yiannis Georgiou (Bull) Matthieu Hautreux (CEA) | Slurm Layouts Framework |
09:30 - 10:00 | Technical | Bill Brophy (Bull) | License Management |
10:00 - 10:30 | Coffee break | ||
10:30 - 11:00 | Technical | Juan Pancorbo Armada (IRZ) | Multi-Cluster Management |
11:00 - 11:30 | Technical | Francois Daikhate, Matthieu Hautreux (CEA) | Depth Oblivious Hierarchical Fairshare Priority Factor |
11:30 - 12:00 | Technical | Dave Wallace (Cray) | Refactoring ALPS |
12:00 - 13:00 | Lunch at conference center | ||
13:00 - 13:20 | Site Report | Francois Diakhate, Francis Belot, Matthieu Hautreux (CEA) | CEA Site Report |
13:20 - 13:40 | Site Report | Tim Wickberg (George Washington University) | George Washington University Site Report |
13:40 - 14:00 | Site Report | Ryan Cox (BYU) | Brigham Young University Site Report |
14:00 - 14:20 | Site Report | Doug Hughes, Chris Harwell, Eric Radman, Goran Pocina, Michael Fenn (D.E. Shaw Research) | D.E. Shaw Research Site Report |
14:20 - 14:40 | Site Report | Dr. Ulf Markwardt (Technische Universitat Dresden) | Technische Universitat Dresden Site Report |
14:40 - 15:10 | Coffee break | ||
15:00 - 15:30 | Technical | Morris Jette (SchedMD), Yiannis Georgiou (Bull) | Slurm Roadmap |
15:30 - 16:30 | Discussion | Everyone | Open Discussion |
Abstracts
September 18, 2013
Overview of Slurm Version 2.6
Danny Auble, Morris Jette (SchedMD) Yiannis Georgiou (Bull)
This presentation will provide an overview of Slurm enhancements in version 2.6, released in May. Specific development to be described include:
- Support for job arrays, which increases performance and ease of use for sets of similar jobs.
- Support for MapReduce+.
- Added prolog and epilog support for advanced reservations.
- Much faster throughput for job step execution.
- Advanced reservations now supports specific different core count for each node.
- Added external sensors plugin to capture temperature and power data.
- Added job profiling capability.
- CPU count limits by partition.
Usage of Energy Accounting and External Sensor Plugins
Yiannis Georgiou, Martin Perry, Thomas Cadeau (Bull) Danny Auble (SchedMD)
Power Management has gradually passed from a trend to an important need in High Performance Computing. Slurm version 2.6 provides functionalities for energy consumption recording and accounting per node and job following both in-band and out-of-band strategies. The new implementations consist of two new plugins: One plugin allowing in-band collection of energy consumption data from the BMC of each node based on freeipmi library; Another plugin allowing out-of-band collection from a centralized storage based on rrdtool library. The second plugin allows the integration of external mechanisms like wattmeters to be taken into account for the energy consumption recording and accounting per node and job. The data can be used by users and administrators to improve the energy efficiency of their applications and the whole clusters in general.
The tutorial will provide a brief description of the various power management features in Slurm and will make a detailed review of the new plugins introduced in 2.6, with configuration and usage details along with examples of actual deployment.
Evaluation of Monitoring and Control Features for Power Management
Yiannis Georgiou , Thomas Cadeau(Bull), Danny Auble, Moe Jette(SchedMD), Matthieu Hautreux (CEA)
High Performance Computing platforms are characterized by their increasing needs in power consumption. The Resource and Job Management System (RJMS) is the HPC middleware responsible for distributing computing resources to user applications. Appearance of hardware sensors along with their support on the kernel/software side can be taken into account by the RJMS in order to enhance the monitoring and control of the executions with energy considerations. This essentially enables the applications' execution statistics for online energy profiling and gives the possibility to users to control the tradeoffs between energy consumption and performance. In this work we present the design and evaluation of a new framework, developed upon SLURM Resource and Job Management System, which allows energy consumption recording and accounting per node and job along with parameters for job energy control features based on static frequency scaling of the CPUs. We evaluate the overhead of the design choices and the precision of the energy consumption results with different HPC benchmarks (IMB,stream,HPL) on real-scale platforms and integrated wattmeters. Having as goal the deployment of the framework on large petaflopic clusters such as Curie, scalability is an important aspect.
Debugging Large Machines
Matthieu Hautreux (CEA)
This talk will present some cases of particularly interesting bugs that were studied/worked-around/corrected over the past few years on the petaflopic machines installed and used at CEA. The goal is to share with the administrator community some methods and tools helping to identify and in some cases work-around or correct unexpected performance issues or bugs.
Creating easy to use HPC portals with NICE EnginFrame and Slurm
Alberto Falzone, Paolo Maggi (Nice)
NICE EnginFrame is a popular framework to easily create HPC portals that provide user-friendly application-oriented computing and data services, hiding all the complexity of the underlying IT infrastructure. Designed for technical computing users in a broad range of markets (Oil&Gas, Automotive, Aerospace, Medical, Finance, Research, and more), EnginFrame simplifies engineers' and scientists' work through its intuitive, self-documenting interfaces, increasing productivity and streamlining data and resource management. Leveraging all the major HPC job schedulers and remote visualization technologies, EnginFrame translates user clicks into the appropriate actions to submit HPC jobs, create remote visualization sessions, monitor workloads on distributed resources, manage data and much more. In this work we describe the integration between the SLURM Workload Manager and EnginFrame. We will then illustrate how this integration can be leveraged to create easy to use HPC portals for SLURM-based HPC infrastructures.
Slurm evaluation using emulation and replay of real workload traces
David Glesser, Yiannis Georgiou, Joseph Emeras, Olivier Richard (Bull)
The experimentation and evaluation of Resource and Job Management Systems in HPC supercomputers are characterized by important complexities due to the inter-dependency of multiple parameters that have to be taken into control. In our study we have developed a methodology based upon emulated controlled experimentation, under real conditions, with submission of workload traces extracted from a production system. The methodology is used to perform comparisons of different Slurm configurations in order to deduce the best configuration for the typical workload that takes place on the supercomputer, without disturbing the production. We will present observations and evaluations results using real workload traces extracted from Curie supercomputer,Top500 system with 80640, replayed upon only 128 cores of a machine with similar architecture. Various interesting results are extracted and important side effects are discussed along with proposed configurations for each type of workloads. Ideas for improvements on Slurm are also proposed.
Usage of new profiling functionalities
Rod Schultz, Yiannis Georgiou (Bull), Danny Auble (SchedMD)
SLURM Version 2.6 includes the ability to gather detailed performance data on jobs. It has a plugin that stores the detailed data in an HDF5 file. Other plugin gather data on task performance such as cpu usage, memory usage, and local disk I/O; I/O to the Lustre file system; traffic through and Infiniband network interface; and energy information collected from IPMI. This tutorial will describe the new capability, show how to configure the various data sources, show examples of different data streams, and report on actual usage.
September 19, 2013
Fault Tolerant Workload Management
Morris Jette, David Bigagli, Danny Auble (SchedMD)
One of the major issues facing exascale computing is fault tolerance; how can a computer be effectively used if the typical job execution time exceeds its mean time between failure. Part of the solution is providing users with means to address failures in a coordinated fashion with a highly adaptable workload manager. Such a solution would support coordinated recognition of failures, notification of failing and failed components, replacement resources, and extended job time limits using negotiated interactive communications. This paper describes fault tolerance issues from the perspective of a workload manager and the implementation of solution designed to optimize job fault tolerance based upon the popular open source workload manager, Slurm.
Slurm Layouts Framework
Yiannis Georgiou (Bull), Matthieu Hautreux (CEA)
This talk will describe the origins and goals of the study concerning the Layouts Framework as well as first targets, current developments and results. The layouts framework aims at providing a uniform and generalized way to describe the hierarchical relations between resources managed by a RM in order to use that information in related RM internal logic. Examples of instantiated layouts could be the description of the network connectivity of nodes for the Slurm internal communication, the description of the power supply network and capacities per branch powering up the nodes, the description of the racking of the nodes, ...
License Management
Bill Brophy (Bull)
License management becomes an increasingly critical issue as the size of systems increase. These valuable resources deserve the same careful management as all other resources configured in a cluster. When licenses are being utilized in both interactive and batch execution environments with multiple resource managers involved the complexity of this task increases significantly. Current license management within SLURM is not integrated with any external license managers. This approach is adequate if all jobs requiring licenses are submitted through SLURM or if SLURM is given a subset of the licenses available on the system to sub manage. However, the case of sub management can result in underutilization of valuable license resources. Documentation for other resource managers describes their interaction with external license managers. For SLURM to become an active participant in license management an evolution to its management approach must occur. This article proposes a two-phased approach for accomplishing that transformation. In the first phase, enhancements are proposed for now SLURM internally deals with licenses: restriction of license to specific accounts or users, provides recommendations for keeping track of license information and suggestions for how this information can be displayed for a SLURM users or administrators. The second phase of this effort, which is considerably more ambitious, is to define an evolution of SLURM's approach to license management. This phase introduces an interaction between SLURM and external license managers. The goal of this effort is to increase SLURM's effectiveness in another area of resource management, namely management of software licenses.
Multi-Cluster Management
Juan Pancorbo Armada (IRZ)
As a service provider for scientific high performance computing, Leibniz Rechen Zentrum (LRZ) operates compute systems for use by educational institutions in Munich, Bavaria, as well as on the national level. LRZ provides own computing resources as well as housing and managing computing resources from other institutions such as Max Planck Institute, or Ludwig Maximilians University. The tier 2 Linux cluster operated at LRZ is a heterogeneous system with different types of compute nodes, divided into 13 different partitions, each of which is managed by SLURM. The various partitions are configured for the different needs and services requested, ranging from single node multiple core NUMAlink shared memory clusters, to a 16-way infiniband- connected cluster for parallel job execution, or an 8-way Gbit Ethernet cluster for serial job execution. The management of all partitions is centralized on a single VM. In this VM one SLURM cluster for each of these Linux cluster partitions is configured. The required SLURM control daemons run concurrently on this VM. With the use of a wrapper script called MSLURM, the SLURM administrator can send SLURM commands to any cluster in an easy-to use and flexible manner, including starting or stopping the complete SLURM subsystem. Although such a setup may not be desirable for large homogeneous supercomputing clusters, on small heterogeneous clusters it has its own advantages. No separate control node is required for each cluster for the slurmctld to run, so the control of small clusters can be grouped in a single control node. This feature also help to solve the restriction for some parameters that cannot be set to different values for different partitions in the same slurm.conf file; in that case it is possible to move such parameters to partition-specific slurm.conf files.
Preparing Slurm for use on the Cray XC30
Stephen Trofinoff, Colin McMurtrie (CSCS)
In this paper we describe the technical details associated with the preparation of Slurm for use on a XC30 system installed at the Swiss National Supercomputing Centre (CSCS). The system comprises external login nodes, internal login nodes and a new ALPS/BASIL version so a number of technical details needed to be overcome in order to have Slurm working, as desired, on the system. Due to the backward compatibility of ALPS/BASIL and the well-written code of Slurm, Slurm was able to run, as it had in the past on previous Cray systems, with little effort. However some problems were encountered and their identification and resolution is described in detail. Moreover, we describe the work involved in enhancing Slurm to utilize the new BASIL protocol. Finally, we provide detail on the work done to improve the Slurm task affinity bindings on a general-purpose Linux cluster so that they, as closely as possible, match the Cray bindings, thereby providing our users with some degree of consistency in application behavior between these systems.
Refactoring ALPS
Dave Wallace (Cray)
One of the hallmarks of the Cray Linux Environment is the Cray Application Level Placement Scheduler (ALPS). ALPS is a resource placement infrastructure used on all Cray systems. Developed by Cray, ALPS addresses the size, complexity, and unique resource management challenges presented by Cray systems. It works in conjunction with workload management tools such as SLURM to schedule, allocate, and launch applications. ALPS separates policy from placement, so it launches applications but does not conflict with batch system policies. The batch system interacts with ALPS via an XML interface. Over time, the requirement to support more and varied platform and processor capabilities, dynamic resource management and new workload manager features has led Cray to investigate alternatives to provide more flexible methods for supporting expanding workload manager capabilities on Cray systems. This presentation will highlight Cray's plans to expose low level hardware interfaces by refactoring ALPS to allow 'native' workload manager implementations that don't rely on the current ALPS interface mechanism.
CEA Site Report
Francois Daikhate, Francis Belot, Matthieu Hautreux (CEA)
The site report will detail the evolution of Slurm usage at CEA as well as recent developments used on production systems. A modification of the fairshare logic to better handle fair sharing of resources between unbalanced groups hierarchies will be detailed.
George Washington University Site Report
Tim Wickberg (George Washington University)
The site report will detail the evaluation of Slurm usage at George Washington University, and the new Colonial One System.
Brigham Young University Site Report
Ryan Cox (BYU)
The site report will detail the evaluation of Slurm at Brigham Young University.
D.E. Shaw Research Site Report
Doug Hughes, Chris Harwell, Eric Radman, Goran Pocina, Michael Fenn (D.E. Shaw Research)
DESRES uses SLURM to schedule Anton. Anton is a specialized supercomputer which executes molecular dynamics (MD) simulations of proteins and other biological macromolecules orders of magnitude faster than was previously possible. In this report, we present the current SLURM configuration for scheduling Anton and launching our MD application. We take advantage of the ability to run multiple slurmd programs on a single node and use them as place-holders for the Anton machines. We combine that with a pool of commodity Linux nodes which act as frontends to any of the Anton machines where the application is launched. We run a partition-specific prolog to insure machine health prior to starting a job and to reset ASICs if necessary. We also periodically run health checks and set nodes to drain or resume via scontrol. Recently we have also used the prolog to set a specific QOS for jobs which run on an early (and slower) version of the ASIC in order to adjust the fair-share UsageFactor.
DESRES also uses SLURM to schedule a cluster of commodity nodes for running regressions, our DESMOND MD program and various other computational chemistry software. The jobs are an interesting mix of those with MPI required and those without, short (minutes) and long (weeks).
DESRES is also investigating using SLURM to schedule a small cluster of 8-GPU nodes for a port of the DESMOND MD program to GPUs. This workload includes both full node 8-GPU jobs and multi-node full 8-GPU per node jobs, but also jobs with lower GPU requirements such that multiple jobs would be on a single node. We've made use of CPU affinity and binding. GRES was not quite flexible enough and we ended up taking advantage of the 8 CPU to 8 GPU opting to assign GPUs to specific CPUs.
Technische Universitat Dresden Site Report
Dr. Ulf Markwardt (Technische Universitat Dresden)
This site report will detail the recent introduction of Slurm on a new computer at Technische Universitat Dresden.
Depth Oblivious Hierarchical Fairshare Priority Factor
Francois Daikhate, Matthieu Hautreux (CEA)
As High Performance Computing use becomes prevalent in increasingly varied scientific and industrial fields, clusters often need to be shared by a growing number of user communities. One aspect of managing these heterogenous groups involves being able to schedule their jobs fairly according to their respective machine shares. In this talk we look at how slurm hierarchical fairshare algorithms handle this task when user groups form complex hierarchies. We propose an alternative formula to compute job priorities which improves fairness in this situation.
Slurm Roadmap
Morris Jette (SchedMD), Yiannis Georgiou (Bull)
Slurm continues to evolve rapidly, with two major releases per year. This presentation will outline Slurm development plans in the coming years. Particular attention will be given to describing anticipated workload management requirements for Exascale computing. These requirements include not only scalability issues, but a new focus on power management, fault tolerance, topology optimized scheduling, and heterogeneous computing.
Last modified 16 September 2013