Lustre on AWS Cloud Clusters

To set up a Lustre file system quickly, skip to the quick start.

Amazon CLOUD CLUSTERS

When Amazon initially launched their cloud service EC2, it was primarily geared towards running simple web services. The hardware they offered fitted those needs: 1 Gbit/s network bandwidth and paravirtual machines. However, virtual performance is not the same as real performance. Amazon actually run several virtual machines on the same machine, and thus CPU, network, and I/O bandwidth are shared. On average the performance is good, but sometimes two virtual machines contend for the same resource. This is OK for running a web service, but it can be bad for parallel applications that need a lot of communication. Bandwidth of 1 Gbit/s is very slow compared to the kinds of networks used in supercomputing - 10GigE or Infiniband. The variance in performance of EC2 networking is also a major problem for most high performance applications, since they tend to use a lot of collective communications, where each processor needs to communicate with many others at the same time. If one link on one machine is temporarily slow, then this can affect all other machines. The larger the virtual cluster gets, the more this variance will affect the performance of the parallel application.

In 2010, Amazon started offering HPC instances. A cluster of these machines are connected with a high bandwidth network with a speed of 10 Gbit/s. With several of these HPC instances you can build a High Performance Compute (HPC) cluster in the cloud on which you can run parallel applications.

However, when you launch a number of HPC instances on Amazon EC2, you just have a bunch of idle machines. How do you make them useful? One of the things you may need is a shared file system. Amazon offers two storage solutions: S3 and EBS. If you want to run high performance applications, then neither of those will help you out. S3 is aimed at durability and availability, but single file performance is poor. EBS provides reliable, high performance block devices, which can be attached to a machine, but cannot be shared. When you need a shared file system, you'll have to run it yourself on your cluster.

For a Linux environment you can choose from a number of file systems, including

  • NFS, which lets you share one particular directory on the file system of one server
  • HDFS, the Hadoop Distributed File System, a clone of the Google File System
  • Lustre, a parallel distributed file system aimed at high performance applications

There are various documents on the web  on how to set up an NFS or HDFS file system. However, there are currently no descriptions of how to set up a Lustre file system on EC2. Cloudscale's big data analytics cloud service can be used with S3, HDFS, MapR or Lustre as backend file systems. Of these various backends, Lustre offers the highest levels of performance. As a helpful guide for those that want to use Lustre on AWS we have produced this quick how-to guide.

Lustre

Lustre is a parallel and distributed file system. This file system is hosted by several servers: one management server (MGS), one metadata server (MDS), and several object storage servers (OSS). Together they serve a single file system, which can be mounted on thousands of Linux clients. It achieves very high I/O throughput by distributing and striping files over multiple disks - Object Storage Targets (OST) in Lustre jargon. It is the ideal network file system for high performance computer clusters, and is used in most of the high end supercomputing systems around the world. Now Cloudscale is leading the charge to bring the power and performance of Lustre to cloud computing.

If you've used file systems like HDFS or MapR, you will be wondering how different Lustre is to set-up. The short answer is -  quite a bit. First of all, you'll need different hardware:

Lustre doesn't handle reliability within the software, but instead relies on specific hardware. It assumes that your storage devices -- Management Target (MGT), Metadata Target (MDT), and Object Storage Targets (OSTs) - are all built on reliable hardware, like RAID 6 arrays. Furthermore, if you want your file system to be high available, then the storage devices should be reachable from multiple servers, so that a server can take over the task of another when that fails. (failover).

Lustre servers are intended to have only one role: serving the file system. If you do use a server also as a client, then you may encounter some memory performance issues.

Lustre needs a fast network between the nodes, as storage and computation are not combined on the same machine.

Beyond that, it basically comes down to replacing the standard Linux kernel on your servers with a patched one, attaching and formatting your MGT, MDT, and OSTs, and mounting them. The machines that will function as clients don't need their kernels patched, but do need a package of kernel modules and some Lustre tools.

Summarizing, you will need:

  • Reliable and fast storage devices to serve as MGT, MDT, and OSTs,
  • Linux machines with lots of RAM to serve as MDS, MGS, and OSSs. It should be possible to replace their kernel with a custom version.
  • Fast network.
  • If you need High-Availability then storage devices should be automatically re-attachable to other servers, and separate failover software should be able to switch off servers.

It might seem daunting at first, but Amazon Web Services actually provide all hardware necessary to set-up a Lustre file system. And with this Cloudscale how-to, you can do it too.

Starting with storage, EBS satisfies all the requirements for MGT, MDT, and OSTs since (a) to the OS they behave like a normal hard-disk, (b) they are durable, and (c) they can be connected to any server in your cluster.

For servers, the AWS Cluster Compute instance types c1.4xlarge and cc2.8xlarge are suitable since they have a 10 Gbit/s interconnect, and they can switch off other machines via the Amazon EC2 API.

Quick Start

Step 1: Prepare an AMI

  1. Create a security group: Allow all network traffic within the security group, open up only the SSH port (22) to the outside world. Call it 'lustre'
  2. Start an Amazon CentOS 5.4 AMI: ami-7ea24a17 in the security group 'lustre'.
  3. Login as root with your 'lustre' key.
  4. Execute: yum -y upgrade
  5. Execute: yum -y install sysstat
  6. Download Lustre packages from Whamcloud: http://downloads.whamcloud.com/public/
  7. Install the packages
    yum install --nogpgcheck e2fsprogs-1.41.90.wc3-0redhat.x86_64.rpm \
       uuidd-1.41.90.wc3-0redhat.x86_64.rpm \
       kernel-2.6.18-238.19.1.el5_lustre.g65156ed.x86_64.rpm \
       lustre-2.1.0-2.6.18_238.19.1.el5_lustre.g65156ed_g9d71fe8.x86_64.rpm \
       lustre-ldiskfs-3.3.0-2.6.18_238.19.1.el5_lustre.g65156ed_g9d71fe8.x86_64.rpm \
       lustre-modules-2.1.0-2.6.18_238.19.1.el5_lustre.g65156ed_g9d71fe8.x86_64.rpm
  8. Disable SELinux: Edit /boot/grub/grub.conf, and add selinux=0 to boot parameters of Lustre kernel. For example the entry:

    title CentOS (2.6.18-238.19.1.el5_lustre.g65156ed)
      root (hd0,0)
      kernel /vmlinuz-2.6.18-238.19.1.el5_lustre.g65156ed ro root=/dev/VolGroup00/LogVol00 rhgb quiet console=tty0 console=ttyS0,115200n8
      initrd /initrd-2.6.18-238.19.1.el5_lustre.g65156ed.img

    becomes

    title CentOS (2.6.18-238.19.1.el5_lustre.g65156ed)
      root (hd0,0)
      kernel /vmlinuz-2.6.18-238.19.1.el5_lustre.g65156ed ro root=/dev/VolGroup00/LogVol00 rhgb quiet console=tty0 console=ttyS0,115200n8 selinux=0
      initrd /initrd-2.6.18-238.19.1.el5_lustre.g65156ed.img 
  9. Download the EC2 API tools, unzip them, and copy your AWS keys to the instance. For example, you unzipped the ec2 api tools in the directory /root/ec2-api-tools and you copied your keys pk.pem and cert.pem to /root/keys, then add the following to your ~/.bashrc
    export EC2_HOME=$HOME/ec2-api-tools
    export EC2_PRIVATE_KEY=$HOME/keys/pk.pem
    export EC2_CERT=$HOME/keys/cert.pem
    export PATH=$EC2_HOME/bin:$PATH
    export JAVA_HOME=/usr
  10. Make a directory lustre in your /root directory, and save the following scripts:
    /root/lustre/make-mds

    #!/bin/bash

    # Default settings. Change these to your needs.
    FSNAME=lustre # The name by which your file system will be exported.
    MDT_SIZE=80 # (in GB) The size of the MDT & MGS. Should be around 1-2% of total storage.


    # the script

    if [ $(whoami) != root ]; then
       echo You need to be root to run this script
       exit 1
    fi
    if [ x$EC2_HOME = x ]; then
      echo You should export your environment by using the -E option with sudo
      echo or you should set EC2_HOME, PATH, etc...
      exit 1
    fi

    instance=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
    zone=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
    size=1024
    device=/dev/sdf

    if [ -e $device ]; then
      echo Device $device already exists, so reusing that.
    else
      echo Creating ${size}GB volume in zone $zone
      volume=$(ec2-create-volume -s $size -z $zone | awk 'BEGIN { FS="\t" }; NR==1 { print $2 }' )

      echo Attaching volume $volume to device $device
      ec2-attach-volume $volume -i $instance -d $device
      sleep 10
    fi

    echo Making a partition of $MDT_SIZE GB.
    cylinders=$((133674 * $MDT_SIZE / 1000 ))
    sfdisk /dev/sdf <<EOF
    ,$cylinders,L
    EOF

    sleep 2

    echo Formatting volume as MDT and MGS
    mkfs.lustre --fsname=$FSNAME --mgs --mdt --verbose --index 0 ${device}1

    sleep 2

    echo Mount it
    mkdir -p /mnt/mdt0
    mount -t lustre ${device}1 /mnt/mdt0


    /root/lustre/make-oss

    #!/bin/bash

    # parameters
    MDS=$1
    INDEX_OFFSET=$2
    FSNAME=lustre

    if [ x$MDS = x ]; then
      echo You must supply the IP address of the MDS as the first parameter
      exit 1
    fi

    if [ x$INDEX_OFFSET = x ]; then
      echo You must supply the OST index offset as the second parameter
      exit 1
    fi

    # the script

    if [ $(whoami) != root ]; then
      echo You need to be root to run this script
      exit 1
    fi

    if [ x$EC2_HOME = x ]; then
      echo You should export your environment by using the -E option with sudo
      echo or you should set EC2_HOME, PATH, etc...
      exit 1
    fi

    instance=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
    zone=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
    size=1024
    devices="/dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn"

    echo Creating 8 volumes of ${size}GB in zone $zone
    index=$INDEX_OFFSET
    for device in $devices
    do
      volume=$(ec2-create-volume -s $size -z $zone | awk 'BEGIN { FS="\t" }; NR==1 { print $2 }' )

      echo Attaching volume $volume to device $device
      ec2-attach-volume $volume -i $instance -d $device

      sleep 10

      echo Formatting volume as OST
      mkfs.lustre --fsname=$FSNAME --ost --mgsnode=${MDS}@tcp --verbose --index $index ${device}

      echo Mount it
      mkdir -p /mnt/ost${index}
      mount -t lustre ${device} /mnt/ost${index}

      index=$(( $index + 1 ))
    done

  11. You also might want to add some users. For lustre it is important that all UIDs and GIDs are the same on all machines. You can do that by setting up NIS/YP, but that's beyond the scope of this text, or by adding some users now with the adduser command.
  12. If you don't like the fact that 'root' can login, you can edit the /usr/local/sbin/get-credentials.sh script so that it appends the key to the ~/.ssh/authorized_keys file of an other user. Don't forget that the owner and permissions of the ~/.ssh/authorized_keys file matter to SSH a lot. Also you might want to add this user to sudoers via visudo.
  13. Remove the key with which you're logged in now from /root/.ssh/authorized_keys so that you can choose a different key the next time you start the resulting AMI.
  14. Stop the instance.
  15. Create an image of the AMI. Let's call it 'centos-5-lustre-2.1'.
  16. Terminate the instance.

Step 2: Start your Lustre file system

  1. Determine the size and throughput requirements of your filesystem, so that you can determine how many Lustre servers you need. To give you a ball-park figure: One cc1.4xlarge OSS with 8 EBS volumes is able to read with 200 MB/s, write with 100 MB/s, and has a storage capacity of 8 TB. The size of the MDT should be around 80-160GB per such an OSS, see also the lustre manual. For the sake of this walk-through, assume that you start 1 MDS, 2 OSSs, and 1 client.
  2. Start 4 Cluster Compute instances with the freshly build 'centos-5-lustre-2.1' AMI within the same placement group. Use the security group 'lustre'.
  3. Login to the first. Change the MDT_SIZE parameter in the /root/lustre/make-mds, and run it. Wait for it to finish
  4. Login to second and third, and run on both the /root/lustre/make-oss script. It takes two command line parameters: the hostname of the MDS and OST index offset. The first parameter should be obvious, but the second probably needs a little explaining. In Lustre each OST gets its own index number. Because the script runs local on one OSS, it doesn't know what other OSSs are running, so this index offset number is the number it should add to each OST index. In the current example you should run
    bash /root/lustre/make-oss MDS 0
    on the first OSS and
    bash /root/lustre/make-oss MDS 8
    on the second, where you replace MDS with the internal hostname of the MDS.
  5. Before the scripts on the OSSs have finished you can already logon on the fourth instance and mount the lustre file system. So, make a mount point, e.g. /lustre, and mount it with
    mount -t lustre -o flock MDS@tcp:/lustre /lustre.
    Again, replace MDS with the internal hostname of the MDS. Instead of the parameter -o flock you can also use -o localflock, or omit it altogether. For a discussion about this parameter, see the lustre manual.
  6. With lfs df -h you can check on the client which OSTs are already available.
  7. When all OSTs have been attached, it is time to see how fast your file system really is. Bonnie++ is an excellent benchmarking tool, which you can download here. To install and run bonnie++ do the following from the command line:
    yum -y install gcc-c++ cd $HOME
    wget http://www.coker.com.au/bonnie++/bonnie++-1.03e.tgz
    tar xzvf bonnie++-1.03e.tgz
    cd bonnie++-1.03
    ./configure && make
    lfs setstripe -c -1 /lustre
    cd /lustre
    mkdir bonnie chown nobody:nobody bonnie
    $HOME/bonnie++-1.03e/bonnie++ -u nobody:nobody
  8. If the lustre file system is very slow, it could be that a single degraded EBS volume is the cause. In that case use iostat on an OSS or the monitoring tool in the AWS Console to find out which EBS volume is responsible, and then follow in the instructions in the lustre manual to remove the bad OST, after which the EBS volume can be unmounted, detached, and deleted.

More Information

More information about Lustre can be found on the following websites: Whamcloud , Oracle: www.lustre.org , OpenSFS , Wikipedia