Welcome!

Linux Authors: Reuven Cohen, Michael Sheehan, Lavenya Dilip, Ian Thain, Bruce Armstrong

Related Topics: Linux

Linux: Article

Linux.SYS-CON.com Cover Story: Rapid Cluster Deployment

From delivery to production in hours

At this point, we can run the following on the front-end:

# config_panfs -r [ip_address_of_shelf]

Restart the PanFS service on all compute nodes:

# cluster-fork 'service panfs restart'

(We included an IP in the extend-compute.xml for the shelf in advance.)

The cluster is ready to run jobs and you can read/write data to the shelf.

End of the basic installation for the cluster ***time 7:35:00***

So, you've seen the basics of setting up a cluster. The thing to remember is that Rocks is the middleware component that gives you the tools to do anything you want with the system, within reason. We have MANY customizations we will do on the cluster to support a variety of user requests. One example is enabling MPI-based applications to leverage Panasas-specific parallel I/O extensions.

Panasas offers an SDK that includes modifications for the MPI component implementing parallel I/O called ROMIO. ROMIO implements the MPI-IO layer in MPICH, one of the more popular MPI implementations. In the Panasas SDK is a patch to apply to MPICH (we run MVAPICH from OSU, based on MPICH and MVICH) that lets Panasas-specific features function.

Required items
- Panasas DirectFLOW client software that got installed during the initial cluster configuration
- Panasas SDK
- Source code for a ROMIO-based MPI implementation

Unpack the source and apply the patch:

# tar zxvf mvapich.tgz
# cd mpich
# patch NP1 < ~/romio.patch

Configure, make, and install MVAPICH including the following options (only PanFS flags shown)

# export CFLAGS=" -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -I \
# /opt/panfs/include/ -D__linux__=1"
# ./configure --with-romio --file-system=ufs+nfs+panfs
# make
# make install

The PanFS patch implements several MPI hints to specify data storage layout and/or concurrent write access when opening a file on a Panasas Storage Cluster.

Here's a sample PanFS-specific hint - panfs_concurrent_write - If the value of this hint is "1" open the file in concurrent write mode. If the value is "0" or the hint is missing, open the file in standard (non-concurrent write) mode.

So, now you have a much better idea of what's possible with the configuration of the system. Let's try a simple compile and run.

We use a simple program named "bounce" as the first benchmark for every system we build. This program times blocking send/received and reports the latency and bandwidth of the communication system. It's a great tool for us as it's small, portable, and tests our MPI F90 capability, something important to us. Here's how simple it is to use.

Compile:

$ mpif90 -o bounce bounce.f

Run:

$ qsub -I -lnodes=4:ppn=2
waiting for job.x to start
$ mpirun -ssh -np 8 -hostfile $PBS_NODEFILE ./bounce >& \
$PBS_WORKDIR/log.bounce
$ tail log.bounce

Start adding users:

# useradd alice
# passwd alice

What I typically do at this point is create an e-mail distribution list for the cluster for announcements of additions, changes, and maintenance windows. I'll include simple instructions on the compiling, location, and naming conventions used for the compilers and a sample PBS script. This usually answers most of the questions we'll have when first logging onto the cluster.

Here is a sample PBS script:

#!/bin/bash
#PBS -N BOUNCE
#PBS -e BOUNCE.err
#PBS -o BOUNCE.out
#PBS -m aeb
#PBS -M alice@cluster.com
#PBS -l nodes=8:ppn=1
#PBS -l walltime=30:00:00

PBS_O_WORKDIR='/home/alice/bounce'
export PBS_O_WORKDIR

### ---------------------------------------
### BEGINNING OF EXECUTION
### ---------------------------------------

echo The master node of this job is `hostname`
echo The working directory is `echo $PBS_O_WORKDIR`
echo This job runs on the following nodes:
echo `cat $PBS_NODEFILE`

### End of information preamble
cd $PBS_O_WORKDIR
cmd="/apps/bin/mpirun -ssh -np 8 -hostfile $PBS_NODEFILE
$PBS_O_WORKDIR/bounce"
echo "running mpirun with: $cmd in directory "`pwd`
$cmd >& $PBS_O_WORKDIR/log.bounce

Why did I select the solutions above? The Rocks Cluster Distribution is widely adopted with over 500 locations running the cluster distribution package. Many companies are offering fee-based support for the package as part of their offering. Dell even ships clusters with Rocks pre-installed.

Panasas was a blessing. Our researchers were finding it difficult to function on one system with 164 processors and a RAID SATA Disk Linux NFS server solution. I followed all the recommended solutions for optimizing NFS on Linux servers, but it was still easy to saturate. The first thing I tried was splitting users across multiple NFS servers, but this quickly became an issue since the users just scaled up their I/O traffic to run faster, which eventually created the same bottleneck - the NFS server.

I found Panasas through a contact at the AMD Developer Center who manages their clusters and decided to give it a try. I was impressed at the ease of installation on our cluster and the results I was able to achieve. I decided to run a benchmark on the Panasas system against our NFS servers. "bonnie++" was the tool I decided to use, and I also decided to put the executable in the shared location and run the tests through the queue.

More Stories By Steve Jones

Steve Jones is currently the technology operations manager at the Institute for Computational and Mathematical Engineering at Stanford University. Steve designed and administered a Top 500 Supercomputer and speaks regularly about the design and management of High Performance Computing Clusters, most recently as a keynote speaker at the annual Rocks-a-Palooza conference at the San Diego Supercomputing Center. His free time is spent with his significant other, Leilani, far away from a keyboard. More information about Steve can be found at http://www.hpcclusters.org.

Comments (4) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
clusteradmin.net 02/18/08 06:17:49 PM EST

For those who came here searching for cluster resources you may consider visiting my blog (http://clusteradmin.net) about cluster administration. Some introductory stuff, load-balancing guide, monitoring and other articles.

Thanks,

-marek

Grid 04/01/06 10:38:44 AM EST

Seems like SGE was not mentioned:
http://gridengine.sunsource.net

Grid 04/01/06 10:36:27 AM EST

Seems like SGE was not mentioned:
http://gridengine.sunsource.net

SYS-CON Belgium News Desk 03/17/06 09:36:01 AM EST

After building a number of clusters from the ground up -including one that made it to the Top500 Supercomputer list - I decided to try a service that many vendors now offer - having a system racked and stacked at the factory then shipped to us. Such a service saves a huge amount of time, not to mention my back, not having to build the cluster and cable all the equipment together. I've been a fan of well-cabled systems and have found the quality control to be acceptable. The key component is the pre-build requirements and verification before the system is built. This will ensure the system shipped is what is expected when it arrives at your front door. There can still be a fair amount of cabling that has to be done once it arrives, if you have a multi-rack configuration, but it's usually limited to plugging in the system's power and public network.