Ingesting

Ingesting files from external storage

A particularly powerful feature of ngenea is the ability to ingest existing data into a Spectrum Scale/GPFS file system by "reverse stubbing". This process creates a migrated file stub on the file system which points to any file on a defined external storage target. The file is then immediately accessible via the Spectrum Scale / GPFS file system as if it had been natively created and then migrated via ngenea.

In this way, it is possible to rapidly and efficiently migrate any existing data into a Spectrum Scale / GPFS file system, without requiring a wholesale copy or move of data. Only metadata records need to be created prior to beginning use of the data via the Spectrum Scale / GPFS file system. Once this initial metadata creation is complete, data will automatically migrate to the file system on access, and can also be brought across as a background process.

The process for ingesting existing data holdings will vary based on requirements, but the process will typically consist of:

  1. define the ngenea configuration for the external storage
  2. generate a list of file/object paths to be ingested
  3. create any required directories
  4. create reverse stubs inside the directories with the command ngrecall --stub

Example - ingesting existing NFS storage

In this example, an existing storage system is mounted via NFS at /mnt/legacy on the ngenea node(s).

All data will be ingested into the legacy/ folder on the Spectrum Scale / GPFS file system at /gpfs1/.

The goal is to eventually move all data from the legacy system into the /gpfs1 file system.

Simultaneously, the /gpfs1 file system will be enabled with ngenea to migrate data to an S3 storage target, as per a standard ngenea deployment.

Master configuration files

Here, the external storage is coupled to the /gpfs1/legacy path. Since a different target is being used for subsequent migrations, this is created as a dedicated configuration file for the ingest (/opt/arcapix/etc/ngenea-ingest.conf). The default configuration file (/opt/arcapix/etc/ngenea.conf) defines how to recall data from the target as well as setting the default migration (and recall) target(s) for subesequently migrated data.

/opt/arcapix/etc/ngenea-ingest.conf

This specifies the location of the configuration file for the legacy storage (legacy_nfs.conf), and assigns it as the target to be used for files under the /gpfs1/legacy path, with the relative path set underneath /gpfs1/legacy. For example, a file with path /gpfs/legacy/folder1/file1 would be mapped to a file on the target storage at /folder1/file1, relative to its root - i.e. /mnt/legacy/folder1/file1 in this scenario.

[Storage legacy_nfs]
StorageType=FS
ConfigFile=/opt/arcapix/etc/ngeneahsm/legacy_nfs.conf
RemoteLocationXAttrRegex=legacy_nfs:(.+)
LocalFileRegex=/gpfs1/legacy/(.+)

/opt/arcapix/etc/ngenea.conf

This specifies the configuration file to be used where file data blocks are stored in the legacy_nfs target, and also an AWS object storage target which will be the default used for subsequent migrations for the whole /gpfs1 file system.

The use of READONLY as the LocalFileRegex effectively disables any use of the legacy NFS storage for standard data migrations.

[Storage aws_bucket1]
StorageType=AmazonS3
ConfigFile=/opt/arcapix/etc/ngeneahsm/aws_bucket1.conf
RemoteLocationXAttrRegex=aws_bucket1:(.+)
LocalFileRegex=/gpfs1/(.+)
[Storage legacy_nfs]
StorageType=FS
ConfigFile=/opt/arcapix/etc/ngeneahsm/legacy_nfs.conf
RemoteLocationXAttrRegex=legacy_nfs:(.+)
LocalFileRegex=/READONLY(.+)

Storage Target configuration files

/opt/arcapix/etc/ngeneahsm/legacy_nfs.conf

This specifies that the legacy storage is mounted at /mnt/legacy. It also enables DeleteOnRecall, as the goal is to move all data off the legacy storage over time.

If it were to be used ongoing as a migration target, DeleteOnRecall would typically be set to False.

[General]
RemoteLocationXAttr=legacy_nfs:$1
RetrieveObjectName=/mnt/legacy/$1
StoreObjectName=/mnt/legacy/$1
EnsureMountPoint=/mnt/legacy
DeleteOnRecall=True
ObjectXAttrManipulationMode=auto

/opt/arcapix/etc/ngeneahsm/aws_bucket1.conf

Note that we use DeleteOnRecall=False here, as this will be the general purpose migration target, and we wish to make use of premigration functionality.

[General]
AccessKeyId=ACCESSKEYID
SecretAccessKey=SECRETACCESSKEY
Bucket=my_ngenea_bucket
Region=eu-west-2
Scheme=HTTPS
SSLVerify=True
RemoteLocationXAttr=aws_bucket1:$1
RetrieveObjectName=$1
StoreObjectName=$1
DeleteOnRecall=False

Ingest script

The following script will:

  1. scan the legacy storage;
  2. create any required directories;
  3. create reverse stubs for all files contained, using the ngrecall --stub command

It uses the GNU Parallel tool to efficiently distribute the stub creation across many processes.

It will skip any files with name *.xattr, as these are the file name used by ngenea for storing metadata where xattrs are not natively supported by the mounted legacy file system.

It is safe to re-run it multiple times - it will skip past any files which already exist.

Depending on the speed of the systems and storage in use, it will typically create around 500-1000 stub files per second.

#!/bin/bash

THREADS=32

function usage() {
  echo "Usage: ngenea_ingest.sh <GPFS_FS_MOUNTPOINT> <INGEST_SOURCE_MOUNTPOINT> <INGEST_SOURCE_SUBFOLDER>"
  echo "$1"
  exit 1
}

function reverse_stub() {
  cd $1
  shift
  /opt/arcapix/bin/ngrecall --stub --config-file=/opt/arcapix/etc/ngenea-ingest.conf $@
}

function reverse_dir() {
 mkdir -p $1/$2
}

export -f reverse_dir
export -f reverse_stub

start_dir=`pwd`

if [ $# -ne 3 ]; then usage;  fi

if [ ! -e $2/$3 ]; then usage "$2/$3 not found";  fi

mountpoint -q $1 || usage "$1 is not a mountpoint"

mountpoint -q $2 || usage "$1 is not a mountpoint"


# Discover and stub:
if [ -d $2 ]; then
  cd $2
  # First, make all directories:
  find $3 -type d | SHELL=$(type -p bash) parallel -j $THREADS reverse_dir $1 {}
  # Now, reverse stub all files:
  find $3 -type f ! -name '*.xattr' | SHELL=$(type -p bash) parallel -j $THREADS --xargs reverse_stub $1 {}
fi

The script can then be executed on a sub-folder basis to do a selective ingest:

ngenea_ingest.sh /gpfs1 /mnt/legacy /mnt/legacy/folder1

Alternatively, to ingest the whole file system:

ngenea_ingest.sh /gpfs1 /mnt/legacy /mnt/legacy