QuickDeploymentGuide

Data SuperCell Deployment Guide

This document describes the various machines in a Data Supercell deployment and their functions.

Table of Contents

Deployment Infrastructure
Metadata FS Snapshots and Backups
System Activity Logging
Reporting
Monitoring
Maintenance

Deployment Infrastructure 1.1. SLASH2 Metadata Server(s)

One or more SLASH2 MDS servers (machines running slashd) are needed. These machines need storage for the SLASH2 file system metadata. This typically means low-latency storage such as SSD, and it should be redundant (mirror). This file system is actually handled by slashd via zfs-fuse internally so the steps to create the zpool for the metadata look like normal zpool creation. Refer to slashd(8) for more details. See Metadata Backups and Snapshots for discussion on how to protect this critical metadata against loss.

1.2. SLASH2 I/O servers

One or more SLASH2 I/O servers are necessary which actually store the data in the SLASH2 deployment.  Typical installations use modern Linux with ZFSOnLinux that present POSIX file systems upon which the lightweight SLASH2 I/O service sliod runs.
It is often desirable to split large amounts of storage across multiple zpools.  In this scenario, multiple instances of sliod will run on the same system.  Under these configurations, an IP address is necessary for each sliod instance.

1.3. Clients

These machines run the SLASH2 client daemon software mount_slash, which may run on a variety of machines, the only requirement being modern FUSE support (Linux, IllumOS, BSD, MacOS X):

Dedicated front-end machines, which have exclusive access to the SuperCell for users.
Compute resources, so jobs can access data from the SuperCell directly.
Administrative nodes, for any administration that needs to be performed without interfering with user workloads.
Test machines, to roll out configuration/system changes and test workloads in a nondisruptive manner to the rest of the deployment.

1.4. Auxiliary/Management

syslog servers, for aggregating all SuperCell activity information to a central location for analysis, reporting, etc.
database servers, for tracking historical activity.
SEC servers, for performing actions in response to events (data replication, etc.)
Nagios servers, for monitoring health of various SuperCell machines.
MDFS mirroring nodes, for scans, dumps, reports, etc.

Metadata FS Snapshots and Backups

This section outlines very basic instructions on getting familiar with ZFS snapshotting capabilities. This is important for a SuperCell deployment since the metadata file system (MDFS) is actually a zpool, so backing up, restoring, etc. are all necessary components of properly administering it.

While slashd is running, zfs and zpool commands will work the same way as one would expect under any other ZFS installation.

2.1. Terminology

The notion of a snapshot in ZFS is a near-instantaneous copy-on-write clone of the entire zpool at a given point in time. The snapshot has a name. The snapshot can be exported into a raw stream that can be used for recovery purposes later on. A snapshot stream is the raw file system data that has been exported via the zfs send command.

There are two types of snapshot streams: full and incremental (or “partial”). A full snapshot stream contains the entire contents of the zpool at a certain point of time, serialized into a single file, which may be compressed to save space, if desired. (SLASH2 metadata is highly compressible so it is advised to do so while using a parallel compression tool such as pbzip2 or pxz.) An incremental snapshot stream is made relative to an existing snapshot, similar to a logical diff of file system changes. 2.2. Snapshotting a ZFS file system

A snapshot can be made of the entire MDFS reflecting the current moment in time:

DSTAMP = `date +%Y-%m-%d_%H:%M:%S`

zfs snapshot s2_mdfs@$DSTAMP

Again, snapshots in ZFS apply to the entire file system; i.e. they cannot be made on a single subdirectory tree or individual file. The state for a snapshot is held internally inside the zpool. The zfs send and zfs recv commands can be used to externalize or re-internalize a snapshot (i.e. import and fully recover state in the live file system). More details on how snapshots work are available in a variety of ZFS documentation/guides.

2.3. Listing ZFS snapshots

All snapshots held in a zpool can be listed:

zfs list -t snapshot

NAME USED AVAIL REFER MOUNTPOINT s2_mdfs@$DSTAMP 0 - 64.5K -

2.4. Exporting a ZFS snapshot to a stream file

To externalize a snapshot into a compressed stream, an action necessary for proper backup maintenance, run:

zfs send s2_mdfs@$DSTAMP | nice bzip2 -7 -> \

/local/snapshots/full/s2_mdfs@$DSTAMP.snap.bz2

ls -lFah /local/snapshots/full/s2_mdfs@$DSTAMP.snap.bz2

-rw-r--r-- 1 root wheel 80k Jun 8 19:42 s2_mdfs@$DSTAMP-snap

As the metadata constitutes all data except for actual user data, the footprint is several orders of magnitude less than the SuperCell data volume. Still, the space can grow, so it is advisable to compress the streams. Snappy, bzip2, xz, etc. and the parallel versions of these tools are great things to have in the toolchain.

It is the advice of the authors to create a periodic (cron) process to automatically create these streams and back them up on other hosts for file system analysis and disaster recovery.

2.5. Incremental snapshot stream for MDFS mirroring nodes

When synchronizing clones of ZFS file systems, instead of serializing full snapshots, the ZFS incremental snapshot feature can leveraged, reducing export, transfer, and import time and resources. An incremental snapshot stream is made relative to another snapshot. This stream can then be copied to other machines and imported to facilitate the updating:

PREV_SNAP=`zfs list -t snapshot | grep s2_mdfs | tail -1 | awk '{print $1}'`

CURRENT=`date +%Y-%m-%d_%H:%M`

snapfn=$snapdir/partial/s2_mdfs@$CURRENT.snap.bz2

zfs send -i $PREV_SNAP s2_mdfs@$CURRENT | nice bzip2 -7 -> $snapfn

scp $snapfn destination:$snapfn

2.6. Verifying a snapshot stream

The zstreamdump(8) command may be used to verify that a ZFS snapshot stream was properly generated and can be expected to be used to rebuild a file system from backup (in ZFS terminology: “received”).

bzcat s2_mdfs@$DSTAMP.snap.bz2 | zstreamdump

2.7. Applying snapshots on an MDS clone system

Copy the full snapshot stream (/local/snapshots/full/s2_mdfs@$DSTAMP.snap.bz2) and the next incremental stream (/local/snapshots/part/s2_mdfs@$CURRENT.snap.bz2) from the MDS to the clone. Install slash2 source, make and make install in zfs-fuse directory. Then run these commands

to apply the full snapshot, followed by the incremental.

zfs-fuse

zpool create s2_clone /dev/disk/by-id/device_name

zfs set compression=lz4 s2_clone

zfs set atime=off s2_clone

zfs set readonly=on s2_clone

bzcat s2_mdfs@$DSTAMP.snap.bz2 | zfs recv -F $s2_clone

bzcat s2_mdfs@$CURRENT.snap.bz2 | zfs recv $s2_clone

Continue generating partial snapshots on the MDS, 3-4 times a day, copy the stream file to the clone system, and “zfs recv” the incremental streams, preferably from a cron job.

2.8. Recovering a zpool from a snapshot stream

For disaster recovery, the MDFS zpool needs to be rebuilt and the data imported back in from a recent full snapshot stream:

slmctl stop # kill SLASH2 slashd

zfs-fuse

zpool create s2_mdfs raidz3 /dev/disk/by-id/ata-...

zpool set compression… # any FS options

decompress /local/snapshots/full/s2_mdfs@snaptest0.zsnap | \

zfs recv -F s2_mdfs

Once this is complete, unmount /s2_mdfs, kill zfs-fuse, and relaunch slashd.

System Activity Logging 3.1. Log sources

All machines participating in the SuperCell deployment (MDS, IOS, and clients) should send all of their syslog messages to “syslog servers” to make aggregate log processing easier.

/etc/rsyslog.conf (or /etc/syslogd.conf, etc.) example:

Send a copy of everything to the log servers

. @@syslog0.yourdomain.com . @@syslog1.yourdomain.com

Non-dedicated client machines also intending to mount the SuperCell should send at least the daemon syslog messages, which should include all activity logged by the SLASH2 mount_slash client software.

/etc/syslog-ng.d/syslog-ng.conf example:

Send a copy of daemon messages the log servers

filter f_mountslash { program("mount_slash-%"); }; destination arclog0 { tcp("arclog0.psc.edu" port(514)); } ; destination arclog1 { tcp("arclog1.psc.edu" port(514)); } ; log { source(src); filter(f_mountslash); destination(arclog0); }; log { source(src); filter(f_mountslash); destination(arclog1); };

Note: the @ denotes UDP for the transport whereas @@ denotes TCP. TCP is recommended but may not be supported by all systems. 3.2. Log activity

Having a periodic test to ensure healthy service e.g. through a daily file access (read/write) cronjob is helpful as each component will register baseline activity to show up in the log processing system. 3.3. Log servers

We have two log servers: arclog0.psc.edu & arclog1.psc.edu
Log messages should be sent to BOTH servers so that we don't lose them (e.g. in case one server is down)
Logs for the archiver are kept in /var/log/arc on the log servers

messages from 'mount_slash' client daemons are sorted into /var/log/arc/mount_slash
logs from the metadata server(s) are stored into /var/log/arc/mdserver
logs from the I/O servers are stored into /var/log/arc/ioserver
logs from all firehoses are stored in /var/log/arc/client

Log Rotation

ioserver and mount_slash logs are rotated daily (since weekly logs get too large)
other logs are rotated weekly

to rotate any logs daily, add the log filename to /etc/logrotate.d/arclog

Logs copied to /arc/logs/server/arclog(0,1)

Logs are copied to this archiver directory
Changes can be made in /SYSTEM.CONF to set the copies to DAILY from WEEKLY

Reporting

4.1. File System Usage Reporting

You may find that you want to track and report various aspects of your file system including the current state of the system as well as the daily change (via syslog).

4.2. Current state reporting

The dumpfid tool included in the SLASH2 distribution traverses the file system and reports the attributes of each file encountered. It can return the uid, gid, file size, timestamp and permissions of each file in the file system.

As an example, one user directory is scanned with three worker threads to exploit parallelism provided by the underlying storage hardware (triple mirrored SSDs):

dumpfid -t 3 -O /local/reports/${user}_output_files.%n -RF \

 '%g %u %s %Tm<%Y-%m-%dT%H:%M:%S> %m %f'$'\n' /s2_mdfs_mirror/$user

4.3. Log Reporting

You can use syslogs with simple event correlator (SEC) to track what the filesystem is doing in real time. You can track reads, writes, deletes, etc...

Sample Reports (see Appendix A for a sample PSC report):

Top 10 readers and writers by day
All Hosts, Data/File Count Read and Written
Total Daily Growth
Current Size/Space Available
MDS Disk Capacity
Compression Ratio
Current state of file system: usage by group, user, file count, GB used

Monitoring

5.1. Nagios

Nagios is an extensible monitoring framework. A Nagios deployment consists of two components:

A Nagios server: a collector of service reports and service health monitor

Configuration of each node and service is required
The default Web-based monitoring interface may be used by many alternative packages are available such as Thuk which provide versatility on what is deemed important to be monitored.

Nagios clients: each service reports details on its health to the server

Each component in a SLASH2 deployment may use the slash2_check.py script included in the SLASH2 source distribution.  This script is configured to run out of a cronjob periodically, such as every 5 minutes.  This script only requires Python and the send_nsca Nagios client utility.

5.2. Other Frameworks

SEC – Simple Event Correlator for reacting to log events - http://simple-evcorr.sourceforge.net/SEC-tutorial/article.html
Ganglia is a third party rather feature complete monitoring package that is easy to install and can provide a number of useful performance observability stats: http://ganglia.sourceforge.net/

System Maintenance

6.1. IOS scrubbing

On the constituent I/O servers, periodic scrubbing of the zpools is recommended. Scrubbing in ZFS is a process where the file system itself scans and reads each piece of data stored for integrity. During this process, any consistency errors discovered are fixed.

The command zpool scrub $poolname initiates this process.

6.2. IOS/MDFS backend file system degradation detection

When errors are encountered on the underlying hardware comprising zpools, ZFS reports such conditions via zpool status. It is integral to have a monitoring/notification framework in place to alert administrators to take action on failing disks to avoid further degradation and potential data loss.

6.3. SLASH2 Software Install/Updates

First, obtain a clone of the SLASH2 code base on the local machine you will manage the updates from (i.e. workstation or laptop):

$ git clone https://github.com/pscedu/slash2-stable slash2-stable

   Next, run git pull after cd’ing to the directory

$ cd slash2-stable; git pull

New binaries should be available in the work directories on each machine.  The final procedure is to install and restart the respective binaries on each machine:

$ ssh $machine

$ cd /path/to/slash2/src

$ git pull; make build

$ sudo make install

$ sudo pkill sliod # for IOS nodes

$ sudo pkill slashd # for MDS nodes

$ sudo umount /supercell # for clients

Appendix A (sample of daily log reporting)

On 11/16/2016 14 users wrote 1,344 files totaling 8,024.000 GB

Top 15 Writers (GB) to /arc 11/16/2016

E Username Group GB Replicate?

   susan      ms4s84p     3,947.510           No
   blood      ms4s84p     2,543.602           No
 dsimmel      ms4s84p     1,402.645           No
 rbudden        staff        58.594           No
 rbudden     pscstaff        54.688           No
  ingres        staff        16.965           No
 nawendt      at4s84p         5.421           No
 sorescu         nscd         0.008           No
    root          arc         0.004           No
antonraw      antonbk         0.000           No
    root         root         0.000           No

Top 10 Writers (file count) to /arc 11/16/2016

Username        Group      Count   Replicate?

 dsimmel      ms4s84p        496           No
   blood      ms4s84p        383           No
   susan      ms4s84p        325           No
  ingres        staff         92           No
 nawendt      at4s84p         15           No
 rbudden        staff         15           No
 rbudden     pscstaff         14           No
    root         root          1           No
antonraw      antonbk          1           No
 sorescu         nscd          1           No

On 11/16/2016 11 users read 288 files totaling 38,118.797 GB

Top 10 Readers (GB) from /arc 11/16/2016

   susan    22,264.590
   blood    10,224.324
 dsimmel     5,509.431
 rbudden       115.304
 hewadew         3.579
   dchin         1.553
 backups         0.009
 sorescu         0.008
    root         0.000
 afsbkup         0.000

Top 10 Readers (file count) to /arc 11/16/2016

 dsimmel        112
   blood         81
   susan         46
 hewadew         26
   dchin         10
 backups          5
    root          2
 rbudden          2
 sorescu          2
 afsbkup          1

Hosts Read and Write (GB) 11/16/2016

                 Host        Read GB       Write GB

              arclog0          0.000          0.004
    br033.pvt.bridges     19,263.616      3,723.372
    br034.pvt.bridges     18,797.867      4,230.494
            firehose1          0.203          0.000
            firehose2          0.000         16.965
            firehose3          1.027          0.008
            firehose4          0.205          0.000
            firehose5          0.129          0.000
            firehose7         55.751         58.594

The largest written file for 11/16/2016 was 551 GB by user susan file id 0x0014000000146e43

The archiver has 1528.963 TB used and 1364.764 TB available

The archiver storage use grew by 3.931 TB since 11/16/2016

The daily growth is from 4 AM to 4 AM. The remainder of the report is from midnight to midnight.

SSD capacities over last 5 days

    Host         Date   Size TB/GB  Alloc TB/GB   Free TB/GB      Percent

illusion2 11/13/2016 888 465 423 52% illusion2 11/14/2016 888 466 422 52% illusion2 11/15/2016 888 467 421 52% illusion2 11/16/2016 888 469 419 52% illusion2 11/17/2016 888 470 418 52%

zfs_list over last 5 days

    Host         Date      Used GB     Avail GB     Refer GB      Mountpoint

illusion2 11/13/2016 465 409 418 /arc_s2ssd_prod illusion2 11/14/2016 466 408 418 /arc_s2ssd_prod illusion2 11/15/2016 467 407 418 /arc_s2ssd_prod illusion2 11/16/2016 469 406 418 /arc_s2ssd_prod illusion2 11/17/2016 470 404 418 /arc_s2ssd_prod

zfs_compression info over last 5 days

    Host         Date            Name   Comp Ratio   Compresion        Local

illusion2 11/13/2016 arc_s2ssd_prod 2.8 lz4 local illusion2 11/14/2016 arc_s2ssd_prod 2.8 lz4 local illusion2 11/15/2016 arc_s2ssd_prod 2.79 lz4 local illusion2 11/16/2016 arc_s2ssd_prod 2.79 lz4 local illusion2 11/17/2016 arc_s2ssd_prod 2.79 lz4 local

SLASH2

Funded in part by:

QuickDeploymentGuide

DSTAMP = date +%Y-%m-%d_%H:%M:%S

zfs snapshot s2_mdfs@$DSTAMP

zfs list -t snapshot

zfs send s2_mdfs@$DSTAMP | nice bzip2 -7 -> \

ls -lFah /local/snapshots/full/s2_mdfs@$DSTAMP.snap.bz2

PREV_SNAP=zfs list -t snapshot | grep s2_mdfs | tail -1 | awk '{print $1}'

CURRENT=date +%Y-%m-%d_%H:%M

snapfn=$snapdir/partial/s2_mdfs@$CURRENT.snap.bz2

zfs send -i $PREV_SNAP s2_mdfs@$CURRENT | nice bzip2 -7 -> $snapfn

scp $snapfn destination:$snapfn

bzcat s2_mdfs@$DSTAMP.snap.bz2 | zstreamdump

zfs-fuse

zpool create s2_clone /dev/disk/by-id/device_name

zfs set compression=lz4 s2_clone

zfs set atime=off s2_clone

zfs set readonly=on s2_clone

bzcat s2_mdfs@$DSTAMP.snap.bz2 | zfs recv -F $s2_clone

bzcat s2_mdfs@$CURRENT.snap.bz2 | zfs recv $s2_clone

slmctl stop # kill SLASH2 slashd

zfs-fuse

zpool create s2_mdfs raidz3 /dev/disk/by-id/ata-...

zpool set compression… # any FS options

decompress /local/snapshots/full/s2_mdfs@snaptest0.zsnap | \

Send a copy of everything to the log servers

Send a copy of daemon messages the log servers

dumpfid -t 3 -O /local/reports/${user}_output_files.%n -RF \

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SLASH2

Clone this wiki locally

DSTAMP = `date +%Y-%m-%d_%H:%M:%S`

PREV_SNAP=`zfs list -t snapshot | grep s2_mdfs | tail -1 | awk '{print $1}'`

CURRENT=`date +%Y-%m-%d_%H:%M`