# Ingesting¶

## Ingesting files from external storage¶

A particularly powerful feature of ngenea is the ability to ingest existing data into a Spectrum Scale/GPFS file system by "reverse stubbing". This process creates a migrated file stub on the file system which points to any file on a defined external storage target. The file is then immediately accessible via the Spectrum Scale / GPFS file system as if it had been natively created and then migrated via ngenea.

In this way, it is possible to rapidly and efficiently migrate any existing data into a Spectrum Scale / GPFS file system, without requiring a wholesale copy or move of data. Only metadata records need to be created prior to beginning use of the data via the Spectrum Scale / GPFS file system. Once this initial metadata creation is complete, data will automatically migrate to the file system on access, and can also be brought across as a background process.

The process for ingesting existing data holdings will vary based on requirements, but the process will typically consist of:

1. define the ngenea configuration for the external storage
2. generate a list of file/object paths to be ingested
3. create any required directories
4. create reverse stubs inside the directories with the command ngrecall --stub

## Example - ingesting existing NFS storage¶

In this example, an existing storage system is mounted via NFS at /mnt/legacy on the ngenea node(s).

All data will be ingested into the legacy/ folder on the Spectrum Scale / GPFS file system at /gpfs1/.

The goal is to eventually move all data from the legacy system into the /gpfs1 file system.

Simultaneously, the /gpfs1 file system will be enabled with ngenea to migrate data to an S3 storage target, as per a standard ngenea deployment.

### Master configuration files¶

Here, the external storage is coupled to the /gpfs1/legacy path. Since a different target is being used for subsequent migrations, this is created as a dedicated configuration file for the ingest (/opt/arcapix/etc/ngenea-ingest.conf). The default configuration file (/opt/arcapix/etc/ngenea.conf) defines how to recall data from the target as well as setting the default migration (and recall) target(s) for subesequently migrated data.

#### /opt/arcapix/etc/ngenea-ingest.conf¶

This specifies the location of the configuration file for the legacy storage (legacy_nfs.conf), and assigns it as the target to be used for files under the /gpfs1/legacy path, with the relative path set underneath /gpfs1/legacy. For example, a file with path /gpfs/legacy/folder1/file1 would be mapped to a file on the target storage at /folder1/file1, relative to its root - i.e. /mnt/legacy/folder1/file1 in this scenario.

[Storage legacy_nfs]
StorageType=FS
ConfigFile=/opt/arcapix/etc/ngeneahsm/legacy_nfs.conf
RemoteLocationXAttrRegex=legacy_nfs:(.+)
LocalFileRegex=/gpfs1/legacy/(.+)


#### /opt/arcapix/etc/ngenea.conf¶

This specifies the configuration file to be used where file data blocks are stored in the legacy_nfs target, and also an AWS object storage target which will be the default used for subsequent migrations for the whole /gpfs1 file system.

The use of READONLY as the LocalFileRegex effectively disables any use of the legacy NFS storage for standard data migrations.

[Storage aws_bucket1]
StorageType=AmazonS3
ConfigFile=/opt/arcapix/etc/ngeneahsm/aws_bucket1.conf
RemoteLocationXAttrRegex=aws_bucket1:(.+)
LocalFileRegex=/gpfs1/(.+)

[Storage legacy_nfs]
StorageType=FS
ConfigFile=/opt/arcapix/etc/ngeneahsm/legacy_nfs.conf
RemoteLocationXAttrRegex=legacy_nfs:(.+)


### Storage Target configuration files¶

#### /opt/arcapix/etc/ngeneahsm/legacy_nfs.conf¶

This specifies that the legacy storage is mounted at /mnt/legacy. It also enables DeleteOnRecall, as the goal is to move all data off the legacy storage over time.

If it were to be used ongoing as a migration target, DeleteOnRecall would typically be set to False.

[General]
RemoteLocationXAttr=legacy_nfs:$1 RetrieveObjectName=/mnt/legacy/$1
StoreObjectName=/mnt/legacy/$1 EnsureMountPoint=/mnt/legacy DeleteOnRecall=True ObjectXAttrManipulationMode=auto  #### /opt/arcapix/etc/ngeneahsm/aws_bucket1.conf¶ Note that we use DeleteOnRecall=False here, as this will be the general purpose migration target, and we wish to make use of premigration functionality. [General] AccessKeyId=ACCESSKEYID SecretAccessKey=SECRETACCESSKEY Bucket=my_ngenea_bucket Region=eu-west-2 Scheme=HTTPS SSLVerify=True RemoteLocationXAttr=aws_bucket1:$1
RetrieveObjectName=$1 StoreObjectName=$1
DeleteOnRecall=False


### Ingest script¶

The following script will:

1. scan the legacy storage;
2. create any required directories;
3. create reverse stubs for all files contained, using the ngrecall --stub command

It uses the GNU Parallel tool to efficiently distribute the stub creation across many processes.

It will skip any files with name *.xattr, as these are the file name used by ngenea for storing metadata where xattrs are not natively supported by the mounted legacy file system.

It is safe to re-run it multiple times - it will skip past any files which already exist.

Depending on the speed of the systems and storage in use, it will typically create around 500-1000 stub files per second.

#!/bin/bash

function usage() {
echo "Usage: ngenea_ingest.sh <GPFS_FS_MOUNTPOINT> <INGEST_SOURCE_MOUNTPOINT> <INGEST_SOURCE_SUBFOLDER>"
echo "$1" exit 1 } function reverse_stub() { cd$1
shift
/opt/arcapix/bin/ngrecall --stub --config-file=/opt/arcapix/etc/ngenea-ingest.conf $@ } function reverse_dir() { mkdir -p$1/$2 } export -f reverse_dir export -f reverse_stub start_dir=pwd if [$# -ne 3 ]; then usage;  fi

if [ ! -e $2/$3 ]; then usage "$2/$3 not found";  fi

mountpoint -q $1 || usage "$1 is not a mountpoint"

mountpoint -q $2 || usage "$1 is not a mountpoint"

# Discover and stub:
if [ -d $2 ]; then cd$2
# First, make all directories:
find $3 -type d | SHELL=$(type -p bash) parallel -j $THREADS reverse_dir$1 {}
# Now, reverse stub all files:
find $3 -type f ! -name '*.xattr' | SHELL=$(type -p bash) parallel -j $THREADS --xargs reverse_stub$1 {}
fi


The script can then be executed on a sub-folder basis to do a selective ingest:

ngenea_ingest.sh /gpfs1 /mnt/legacy /mnt/legacy/folder1


Alternatively, to ingest the whole file system:

ngenea_ingest.sh /gpfs1 /mnt/legacy /mnt/legacy