Plugin Development Walk-through

In this guide, we’ll walk through developing a custom plugin for PixStor Search.

Introduction

The plugin we’ll develop will be a metadata plugin, which will extract sheet names from Excel spreadsheets

This guide will cover:

  • writing a plugin
  • validating the plugin
  • installing and enabling the plugin
  • performing an ingest with the plugin

This is a relatively high level, practical guide. See the Plugin Development Overview for a more detailed description of plugin development.

See also Asynchronous Proxy Plugin Walk-Through for a description of developing an asynchronous, proxy-generating plugin.

Plugin template

We’ll start with a plugin template.

Create a new file in the user plugins directory

$ mkdir -p /opt/arcapix/usr/share/apsearch/plugins/available/user
$ vim /opt/arcapix/usr/share/apsearch/plugins/available/user/excel_plugin.py

We’ve put the plugin under available, not enabled - we’re not ready to enable it yet.

Warning

It’s best not to enable the plugin until we’re happy with the schema. Once the schema is registered, it’s hard to change.

The plugin can’t be used in ingests until it is enabled.

Paste the following code into the new file

from arcapix.search.metadata.plugins.base import Plugin, PluginStatus
from arcapix.search.metadata.helpers import Metadata


class SamplePlugin(Plugin):

    def namespace(self):
        """Return the namespace for the metadata for this plugin."""
        return 'sample'

    def is_async(self):
        """Return whether this plugin does any operations asynchronously."""
        return False

    def handles(self, ext=None, mimetype=None):
        """Return whether the plugin can handle a given file, based on extension/mimetype

        :param ext: file extension, including a leading '.'
        :param mimetype: file mime type, e.g. image/png

        :returns: True if the plugin needs to process this file, false otherwise
        """
        return False

    def schema(self):
        """Return the schema for the metadata produced by this plugin.

        All metadata will be validated against this before inserting into the database.
        """
        return []

    def _extract(self, filename):
        """Private worker function to extract metadata from a file.

        :param filename: file to extract data from

        :return: dict of metadata
        """
        raise NotImplementedError("TODO")

    def process(self, id_, file_, fileinfo=None):
        """Extract metadata from a given file and submit an update.

        :param id_: black box identifier which shall be passed to the metadata update functions
        :param file_: full path to the file
        :param fileinfo: dict of information which may have already been gathered and may be of use

        :returns: PluginStatus indicating success/failure
        """
        try:
            data = self._extract(file_)

            if not data:
                # didn't find any metadata
                return PluginStatus.SKIPPED

            if Metadata(id_, self).update(data):
                return PluginStatus.SUCCESS

            return PluginStatus.ERRORED

        except Exception as exc:
            self.logger.error(
                "An exception was raised while processing '%s' (%s): %s",
                file_, id_, exc
            )
            return PluginStatus.FATAL

Plugin check tool

As we develop this plugin, we’ll use the searchctl plugins check tool to check for issues.

Let’s try it now. To specify our plugin, we give it the full path and the plugin (class) name

For brevity, let’s define an environment variable for the plugin path

$ export PLUGIN_PATH=/opt/arcapix/usr/share/apsearch/plugins/available/user/excel_plugin.py
$ searchctl plugins check $PLUGIN_PATH::SamplePlugin

WARNING:arcapix.search.metadata.plugins.validation:plugin: plugin doesn't appear to be installed in plugin path
ERROR:arcapix.search.metadata.plugins.validation:sample: field isn't registered
WARNING:arcapix.search.metadata.plugins.validation:PxS middleware must be restarted after a plugin is installed:
WARNING:arcapix.search.metadata.plugins.validation:    $ systemctl restart apsearch-middleware
WARNING:arcapix.search.metadata.plugins.validation:No test files given. Skipping file-based checks. Some issues may be missed

SamplePlugin :: server :: sample
❌ field isn't registered
plugin
❗ plugin doesn't appear to be installed in plugin path

As we can see, it’s detected that the plugin isn’t enabled and installed. Since we’re not ready to enable the plugin yet, we can ignore this for now.

Filling in the template

Now we can start to fill in the plugin template.

First we’ll rename the plugin class to ExcelPlugin, and set the namespace to excel

class ExcelPlugin(Plugin):

    def namespace(self):
        return 'excel'

    # etc.

(docstrings have been omitted for clarity).

This is a synchronous plugin, so we can leave the is_async method as False.

Handles

For handles we want to match files with

Extension Mime type
.xls application/vnd.ms-excel
.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

So lets add these to our handles method

EXTENSIONS = ['.xls', '.xlsx']
MIME_TYPES = [
    'application/vnd.ms-excel',
    'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
]

class ExcelPlugin(Plugin):

    def handles(self, ext=None, mimetype=None):
        return ext in EXTENSIONS or mimetype in MIME_TYPES

Schema

The metadata we want this plugin to extract is a list of sheet names - a list of strings - so we define a schema as follows

def schema(self):
    return [
        {
            "name": "sheets",
            "prompt": "Names of sheets in this workbook",
            "value": {
                "datatype": "[String]"
            }
        }
    ]

The square brackets around [String] indicate that this is a list.

Review

Before we move on to implementing metadata extraction, lets review what we have so far.

Here’s the full, updated plugin

from arcapix.search.metadata.plugins.base import Plugin, PluginStatus
from arcapix.search.metadata.helpers import Metadata


EXTENSIONS = ['.xls', '.xlsx']
MIME_TYPES = [
    'application/vnd.ms-excel',
    'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
]


class ExcelPlugin(Plugin):

    def namespace(self):
        """Return the namespace for the metadata for this plugin."""
        return 'excel'

    def is_async(self):
        """Return whether this plugin does any operations asynchronously."""
        return False

    def handles(self, ext=None, mimetype=None):
        """Return whether the plugin can handle a given file, based on extension/mimetype

        :param ext: file extension, including a leading '.'
        :param mimetype: file mime type, e.g. image/png

        :returns: True if the plugin needs to process this file, false otherwise
        """
        return ext in EXTENSIONS or mimetype in MIME_TYPES

    def schema(self):
        """Return the schema for the metadata produced by this plugin.

        All metadata will be validated against this before inserting into the database.
        """
        return [
            {
                "name": "sheets",
                "prompt": "Names of sheets in this workbook",
                "value": {
                    "datatype": "[String]"
                }
            }
        ]

    def _extract(self, filename):
        """Private worker function to extract metadata from a file.

        :param filename: file to extract data from

        :return: dict of metadata
        """
        raise NotImplementedError("TODO")

    def process(self, id_, file_, fileinfo=None):
        """Extract metadata from a given file and submit an update.

        :param id_: black box identifier which shall be passed to the metadata update functions
        :param file_: full path to the file
        :param fileinfo: dict of information which may have already been gathered and may be of use

        :returns: PluginStatus indicating success/failure
        """
        try:
            data = self._extract(file_)

            if not data:
                # didn't find any metadata
                return PluginStatus.SKIPPED

            if Metadata(id_, self).update(data):
                return PluginStatus.SUCCESS

            return PluginStatus.ERRORED

        except Exception as exc:
            self.logger.error(
                "An exception was raised while processing '%s' (%s): %s",
                file_, id_, exc
            )
            return PluginStatus.FATAL

Let’s check the plugin with searchctl - remember that we renamed the plugin class

$ searchctl plugins check $PLUGIN_PATH::ExcelPlugin

WARNING:arcapix.search.metadata.plugins.validation:plugin: plugin doesn't appear to be installed in plugin path
ERROR:arcapix.search.metadata.plugins.validation:excel: field isn't registered
WARNING:arcapix.search.metadata.plugins.validation:PxS middleware must be restarted after a plugin is installed:
WARNING:arcapix.search.metadata.plugins.validation:    $ systemctl restart apsearch-middleware
WARNING:arcapix.search.metadata.plugins.validation:No test files given. Skipping file-based checks. Some issues may be missed

ExcelPlugin :: server :: excel
❌ field isn't registered
plugin
❗ plugin doesn't appear to be installed in plugin path

So far, so good; we haven’t broken anything.

Let’s check it against a sample file

$ searchctl plugins check $PLUGIN_PATH::ExcelPlugin /mmfs1/data/admin/expenses.xlsx
...
WARNING:arcapix.search.metadata.plugins.validation:Plugin returned PluginStatus.FATAL
ERROR:arcapix.search.metadata.plugins.validation:An error occured while processing "/mmfs1/data/admin/expenses.xlsx"
Traceback (most recent call last):
File "/usr/share/arcapix/apsearch/lib/python3.6/site-packages/arcapix/search/metadata/plugins/validation.py", line 692, in _check_process_no_errors_raised
    raising_process(self.plugin, MetadataId.from_string(0), file_)
File "process", line 11, in process
File "/opt/arcapix/usr/share/apsearch/plugins/available/user/excel_plugin.py", line 54, in _extract
    raise NotImplementedError("TODO")
NotImplementedError: TODO

ExcelPlugin :: /mmfs1/data/admin/expenses.xlsx :: process
❌ exception NotImplementedError: "TODO" was caught during process

Okay, time to extract some metadata.

Extracting metadata

To extract our metadata we’re going to use a 3rd-party library: openpyxl

from openpyxl import load_workbook

class ExcelPlugin(Plugin):

    # ...

    def _extract(self, filename):
        workbook = load_workbook(filename=filename)

        return {'sheets': workbook.sheetnames}

sheets is the field name we defined in the schema. workbook.sheetnames returns a list of strings, so there’s nothing more to do.

Let’s check it

$ searchctl plugins check $PLUGIN_PATH::ExcelPlugin /mmfs1/data/admin/expenses.xlsx
searchctl error: No module named openpyxl

Oops..!

Installing dependencies

Since our plugin relies on a 3rd-party library, we need to install it.

PixStor Search lives in it’s own virtual environment, it doesn’t have access to libraries installed outside of that environment (e.g. using yum or the system python)

So we need to install the dependency inside of Search’s virtual environment

$ /usr/share/arcapix/apsearch/bin/pip install openpyxl

Validate metadata

Okay, let’s try validating again

$ searchctl plugins check $PLUGIN_PATH::ExcelPlugin /mmfs1/data/admin/expenses.xlsx

WARNING:arcapix.search.metadata.plugins.validation:plugin: plugin doesn't appear to be installed in plugin path
ERROR:arcapix.search.metadata.plugins.validation:excel: field isn't registered
WARNING:arcapix.search.metadata.plugins.validation:PxS middleware must be restarted after a plugin is installed:
WARNING:arcapix.search.metadata.plugins.validation:    $ systemctl restart apsearch-middleware

ExcelPlugin :: server :: excel
❌ field isn't registered
plugin
❗ plugin doesn't appear to be installed in plugin path

No issues! (except that the plugin still isn’t installed).

Let’s increase the logging level to notify, so we can see what it’s extracting

$ APLOGLEVEL=notify searchctl plugins check $PLUGIN_PATH::ExcelPlugin /mmfs1/data/admin/expenses.xlsx
...
NOTIFY:arcapix.search.metadata.plugins.validation:Checking file "/mmfs1/data/admin/expenses.xlsx"
NOTIFY:arcapix.search.metadata.plugins.validation:File "/mmfs1/data/admin/expenses.xlsx" with extension ".xlsx" and mimetype "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" is handled by ExcelPlugin
NOTIFY:arcapix.search.metadata.plugins.validation:Plugin process returned <PluginStatus.SUCCESS: 1>
NOTIFY:arcapix.search.metadata.plugins.validation:Plugin processing took 0.176973 seconds
NOTIFY:arcapix.search.metadata.plugins.validation:Extracted metadata:
{
    "sheets": [
        "Summary",
        "Expenses",
        "Mileage"
    ]
}

Looks good to me.

Review

Here’s the complete plugin, updated to include the extract implementation

from openpyxl import load_workbook

from arcapix.search.metadata.plugins.base import Plugin, PluginStatus
from arcapix.search.metadata.helpers import Metadata


EXTENSIONS = ['.xls', '.xlsx']
MIME_TYPES = [
    'application/vnd.ms-excel',
    'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
]


class ExcelPlugin(Plugin):

    def namespace(self):
        """Return the namespace for the metadata for this plugin."""
        return 'excel'

    def is_async(self):
        """Return whether this plugin does any operations asynchronously."""
        return False

    def handles(self, ext=None, mimetype=None):
        """Return whether the plugin can handle a given file, based on extension/mimetype

        :param ext: file extension, including a leading '.'
        :param mimetype: file mime type, e.g. image/png

        :returns: True if the plugin needs to process this file, false otherwise
        """
        return ext in EXTENSIONS or mimetype in MIME_TYPES

    def schema(self):
        """Return the schema for the metadata produced by this plugin.

        All metadata will be validated against this before inserting into the database.
        """
        return [
            {
                "name": "sheets",
                "prompt": "Names of sheets in this workbook",
                "value": {
                    "datatype": "[String]"
                }
            }
        ]

    def _extract(self, filename):
        """Private worker function to extract metadata from a file.

        :param filename: file to extract data from

        :return: dict of metadata
        """
        workbook = load_workbook(filename=filename)

        return {'sheets': workbook.sheetnames}

    def process(self, id_, file_, fileinfo=None):
        """Extract metadata from a given file and submit an update.

        :param id_: black box identifier which shall be passed to the metadata update functions
        :param file_: full path to the file
        :param fileinfo: dict of information which may have already been gathered and may be of use

        :returns: PluginStatus indicating success/failure
        """
        try:
            data = self._extract(file_)

            if not data:
                # didn't find any metadata
                return PluginStatus.SKIPPED

            if Metadata(id_, self).update(data):
                return PluginStatus.SUCCESS

            return PluginStatus.ERRORED

        except Exception as exc:
            self.logger.error(
                "An exception was raised while processing '%s' (%s): %s",
                file_, id_, exc
            )
            return PluginStatus.FATAL

Now that we’re happy that our plugin works, let’s install and enable it.

Enabling the plugin

To enable a plugin, first we symlink it into the enabled directory

$ mkdir -p /opt/arcapix/usr/share/apsearch/plugins/enabled/user
$ ln -s /opt/arcapix/usr/share/apsearch/plugins/available/user/excel_plugin.py /opt/arcapix/usr/share/apsearch/plugins/enabled/user

Next, as the plugin check keeps reminding us, we need to restart the Search server to register the plugin

$ systemctl restart apsearch-middleware

Registering the plugin updates the Search database with the plugin’s namespace and schema.

To check if it’s successfully installed, we can list all the installed plugins

$ searchctl plugins list
excel_plugin::ExcelPlugin    # <--- that's our plugin
gpfs::GPFSPolicyPlugin
image::ImagePluin
stat::StatPlugin
...

And one last check - this time, we can use the short plugin name listed above

$ searchctl plugins check excel_plugin::ExcelPlugin
✔️ No issues found!

Perfect!

Perform an ingest

Now we can ingest some files using our plugin.

Let’s try ingesting our test file

$ APLOGLEVEL=info searchctl add-file /mmfs1/data/admin/expenses.xlsx --plugins excel_plugin::ExcelPlugin
...
INFO:arcapix.search.metadata.broker:Starting ingest of '/mmfs1/data/admin/expenses.xlsx' (3927216287062101581)
INFO:arcapix.search.metadata.broker:/mmfs1/data/admin/expenses.xlsx (3927216287062101581) default::DefaultPlugin SUCCESS
INFO:arcapix.search.metadata.broker:/mmfs1/data/admin/expenses.xlsx (3927216287062101581) excel_plugin::ExcelPlugin SUCCESS
NOTIFY:arcapix.search.metadata.broker:Ingested '/mmfs1/data/admin/expenses.xlsx' (3927216287062101581) with 2 plugins - 2 plugins were successful, 0 failed

We’re using the --plugins flag so the ingest only uses the excel plugin. (default::DefaultPlugin is a special, internal plugin, which is always ingested).

Open up the PixStor Search UI, and verify that the excel metadata is present for our test file.

Now we want to ingest any other excel files on the file system

$ searchctl ingest mmfs1 --include '*.xls,*.xlsx' --plugins excel_plugin::ExcelPlugin

Here, we’re using an --include to only match files with the extensions our plugin handles.

If you have scheduled ingests enabled, and search is in ‘rich’ mode, then the plugin should be used automatically next time the ingest is run.

Adding more metadata

Supposing we want extract more metadata with the plugin - let’s say, whether the spreadsheet is read-only - we simply add the new field to the schema, and update the _extract method

def schema(self):
    return [
        {
            "name": "sheets",
            "prompt": "Names of sheets in this workbook",
            "value": {
                "datatype": "[String]"
            }
        },
        {
            "name": "readonly",
            "prompt": "Indicates whether the workbook is read-only",
            "value": {
                "datatype": "Boolean"
            }
        }
    ]

def _extract(self, filename):
    workbook = load_workbook(filename=filename)

    return {
        'sheets': workbook.sheetnames,
        'readonly': workbook.read_only,
    }

Check the new metadata

$ APLOGLEVEL=notify searchctl plugins check excel_plugin::ExcelPlugin /mmfs1/data/admin/expenses.xlsx
...
ERROR:arcapix.search.metadata.plugins.validation:readonly: field isn't registered
WARNING:arcapix.search.metadata.plugins.validation:PxS middleware must be restarted after a plugin is installed:
WARNING:arcapix.search.metadata.plugins.validation:    $ systemctl restart apsearch-middleware
NOTIFY:arcapix.search.metadata.plugins.validation:Checking file "/mmfs1/data/sample_data/expenses.xlsx"
NOTIFY:arcapix.search.metadata.plugins.validation:File "/mmfs1/data/sample_data/expenses.xlsx" with extension ".xlsx" and mimetype "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" is handled by ExcelPlugin
NOTIFY:arcapix.search.metadata.plugins.validation:Plugin process returned <PluginStatus.SUCCESS: 1>
NOTIFY:arcapix.search.metadata.plugins.validation:Plugin processing took 0.318276 seconds
NOTIFY:arcapix.search.metadata.plugins.validation:Extracted metadata:
{
    "readonly": false,
    "sheets": [
        "Summary",
        "Expenses",
        "Mileage"
    ]
}
NOTIFY:arcapix.search.metadata.plugins.validation:Some issues were found with ExcelPlugin

ExcelPlugin :: server :: excel :: readonly
❌ field isn't registered

Once again, we have to restart the Search server to register the new field

$ systemctl restart apsearch-middleware
$ APLOGLEVEL=notify searchctl plugins check excel_plugin::ExcelPlugin /mmfs1/data/admin/expenses.xlsx
...
NOTIFY:arcapix.search.metadata.plugins.validation:Extracted metadata:
{
    "readonly": false,
    "sheets": [
        "Summary",
        "Expenses",
        "Mileage"
    ]
}
✔️ No issues found!

Adding new metadata is as easy as above.

Changing or removing existing metadata is hard. It requires either rebuilding the DB from scratch, or reindexing.

Summary

In this walk-through we developed a plugin which extracts metadata from Excel files.

We used searchctl plugins check to validate our progress as we went along.

Finally, we installed and enabled our plugin, and ingested some files with it.

For a discussion of more advanced topics, including developing plugins for generating proxies, and submitting asynchronous jobs, see the Plugin Development Overview and Asynchronous Proxy Plugin Walk-Through

Appendix: Complete Plugin

Here is the complete ExcelPlugin we developed over the course of this walk-through

from openpyxl import load_workbook

from arcapix.search.metadata.plugins.base import Plugin, PluginStatus
from arcapix.search.metadata.helpers import Metadata


EXTENSIONS = ['.xls', '.xlsx']
MIME_TYPES = [
    'application/vnd.ms-excel',
    'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
]


class ExcelPlugin(Plugin):

    def namespace(self):
        """Return the namespace for the metadata for this plugin."""
        return 'excel'

    def is_async(self):
        """Return whether this plugin does any operations asynchronously."""
        return False

    def handles(self, ext=None, mimetype=None):
        """Return whether the plugin can handle a given file, based on extension/mimetype

        :param ext: file extension, including a leading '.'
        :param mimetype: file mime type, e.g. image/png

        :returns: True if the plugin needs to process this file, false otherwise
        """
        return ext in EXTENSIONS or mimetype in MIME_TYPES

    def schema(self):
        """Return the schema for the metadata produced by this plugin.

        All metadata will be validated against this before inserting into the database.
        """
        return [
            {
                "name": "sheets",
                "prompt": "Names of sheets in this workbook",
                "value": {
                    "datatype": "[String]"
                }
            },
            {
                "name": "readonly",
                "prompt": "Indicates whether the workbook is readonly",
                "value": {
                    "datatype": "Boolean"
                }
            }
        ]

    def _extract(self, filename):
        """Private worker function to extract metadata from a file.

        :param filename: file to extract data from

        :return: dict of metadata
        """
        workbook = load_workbook(filename=filename)

        return {
            'sheets': workbook.sheetnames,
            'readonly': workbook.read_only,
        }

    def process(self, id_, file_, fileinfo=None):
        """Extract metadata from a given file and submit an update.

        :param id_: black box identifier which shall be passed to the metadata update functions
        :param file_: full path to the file
        :param fileinfo: dict of information which may have already been gathered and may be of use

        :returns: PluginStatus indicating success/failure
        """
        try:
            data = self._extract(file_)

            if not data:
                # didn't find any metadata
                return PluginStatus.SKIPPED

            if Metadata(id_, self).update(data):
                return PluginStatus.SUCCESS

            return PluginStatus.ERRORED

        except Exception as exc:
            self.logger.error(
                "An exception was raised while processing '%s' (%s): %s",
                file_, id_, exc
            )
            return PluginStatus.FATAL