Ingesting files from external storage

A particularly powerful feature of ngenea is the ability to ingest existing data into a Spectrum Scale/GPFS file system by "reverse stubbing". This process creates a migrated file stub on the file system which points to any file on a defined external storage target. The file is then immediately accessible via the Spectrum Scale / GPFS file system as if it had been natively created and then migrated via ngenea.

In this way, it is possible to rapidly and efficiently migrate any existing data into a Spectrum Scale / GPFS file system, without requiring a wholesale copy or move of data. Only metadata records need to be created prior to beginning use of the data via the Spectrum Scale / GPFS file system. Once this initial metadata creation is complete, data will automatically migrate to the file system on access, and can also be brought across as a background process.

The process for ingesting existing data holdings will vary based on requirements, but the process will typically consist of:

  1. define the ngenea configuration for the external storage
  2. generate a list of file/object paths to be ingested
  3. create any required directories
  4. create reverse stubs inside the directories with the command ngrecall --stub

Example - ingesting existing NFS storage

In this example, an existing storage system is mounted via NFS at /mnt/legacy on the ngenea node(s).

All data will be ingested into the legacy/ folder on the Spectrum Scale / GPFS file system at /gpfs1/.

The goal is to eventually move all data from the legacy system into the /gpfs1 file system.

Simultaneously, the /gpfs1 file system will be enabled with ngenea to migrate data to an S3 storage target, as per a standard ngenea deployment.

Master configuration files

Here, the external storage is coupled to the /gpfs1/legacy path. Since a different target is being used for subsequent migrations, this is created as a dedicated configuration file for the ingest (/opt/arcapix/etc/ngenea-ingest.conf). The default configuration file (/opt/arcapix/etc/ngenea.conf) defines how to recall data from the target as well as setting the default migration (and recall) target(s) for subesequently migrated data.


This specifies the location of the configuration file for the legacy storage (legacy_nfs.conf), and assigns it as the target to be used for files under the /gpfs1/legacy path, with the relative path set underneath /gpfs1/legacy. For example, a file with path /gpfs/legacy/folder1/file1 would be mapped to a file on the target storage at /folder1/file1, relative to its root - i.e. /mnt/legacy/folder1/file1 in this scenario.

[Storage legacy_nfs]


This specifies the configuration file to be used where file data blocks are stored in the legacy_nfs target, and also an AWS object storage target which will be the default used for subsequent migrations for the whole /gpfs1 file system.

The use of READONLY as the LocalFileRegex effectively disables any use of the legacy NFS storage for standard data migrations.

[Storage aws_bucket1]
[Storage legacy_nfs]

Storage Target configuration files


This specifies that the legacy storage is mounted at /mnt/legacy. It also enables DeleteOnRecall, as the goal is to move all data off the legacy storage over time.

If it were to be used ongoing as a migration target, DeleteOnRecall would typically be set to False.



Note that we use DeleteOnRecall=False here, as this will be the general purpose migration target, and we wish to make use of premigration functionality.


Ingest script

The following script will:

  1. scan the legacy storage;
  2. create any required directories;
  3. create reverse stubs for all files contained, using the ngrecall --stub command

It uses the GNU Parallel tool to efficiently distribute the stub creation across many processes.

It will skip any files with name *.xattr, as these are the file name used by ngenea for storing metadata where xattrs are not natively supported by the mounted legacy file system.

It is safe to re-run it multiple times - it will skip past any files which already exist.

Depending on the speed of the systems and storage in use, it will typically create around 500-1000 stub files per second.



function usage() {
  echo "$1"
  exit 1

function reverse_stub() {
  cd $1
  /opt/arcapix/bin/ngrecall --stub --config-file=/opt/arcapix/etc/ngenea-ingest.conf $@

function reverse_dir() {
 mkdir -p $1/$2

export -f reverse_dir
export -f reverse_stub


if [ $# -ne 3 ]; then usage;  fi

if [ ! -e $2/$3 ]; then usage "$2/$3 not found";  fi

mountpoint -q $1 || usage "$1 is not a mountpoint"

mountpoint -q $2 || usage "$1 is not a mountpoint"

# Discover and stub:
if [ -d $2 ]; then
  cd $2
  # First, make all directories:
  find $3 -type d | SHELL=$(type -p bash) parallel -j $THREADS reverse_dir $1 {}
  # Now, reverse stub all files:
  find $3 -type f ! -name '*.xattr' | SHELL=$(type -p bash) parallel -j $THREADS --xargs reverse_stub $1 {}

The script can then be executed on a sub-folder basis to do a selective ingest: /gpfs1 /mnt/legacy /mnt/legacy/folder1

Alternatively, to ingest the whole file system: /gpfs1 /mnt/legacy /mnt/legacy