Plugin Development Tutorial

PixStor Search provides stock plugins for specific well known data types, which can extract deep metadata from those files, and generate both thumbnail and larger size proxies.

However, it may be that a custom data type is required to be analysed. In such instances a custom plugin can be added by the end customer’s developers.

Development Process

Plugins are written in Python (http://www.python.org/).

There are two types of plugin:

Custom Metadata Plugins require the developer to:

  • Define a schema for the plugin, and the types of files it handles
  • Define the namespace for the metadata (which is used to ensure metadata with similar names does not clash)
  • Decide on a priority, if it is important that the plugin runs before/after others.
  • Extract the required information to fulfill that schema using whatever approach is convenient for them
  • Use the Plugin helper library to submit that information to the broker for processing

Custom Proxy generation plugins require the developer to:

  • Define the types of files the plugin can make proxies for :doc:schemas
  • Build the proxy using suitable tools, submit a processing function to be executed asynchonously, or submit to an external job manager
  • Use the plugin helper library to update the system about the location of the proxy file - this will then be moved into the Proxy Store.

Warning

It is important to note that the proxy ingest process moves the proxy that the plugin has generated. If you wish to retain a copy of the generated proxy, you must make that copy before submitting it for ingest.

Definition

Plugins must inherit from the arcapix supplied “Plugin” class, located in the arcapix.search.metadata.plugins.base package

Discovery

Developers should install plugins into the directory configured via the arcapix.search.metadata.plugins.path config setting (Default: /opt/arcapix/usr/share/apsearch/plugins).

The Broker will examine all files in there for classes which are derived from the Plugin base class, and adds them to the processing queue. Plugins with a name starting with an underscore will be ignored.

This process happens every time the broker starts, i.e. when the next scheduled run is performed - this may happen manually via the CLI tool, or automatically based on a recurring incremental ingest.

Schemas

Overview

The schema definition is a list of Python dictionaries, and is expected to be constant for any given version of a plugin. It is read once, when a file pool is built.

The schema can change between versions of a plugin, but note that depending on the change, data may need to be deleted and re-ingested. It is therefore recommended that careful consideration be given to the design of the schema at the outset

Data Types

The following PixStor Search Schema datatypes are supported:

Name Description Example
integer Integer less than 2^31-1 42
long Long integers of non-limited length 419430462003
float A floating Point number 12.6
double A double precision floating point 3.141592653589793…
string A human readable text QuickTime
datetime A value representing the date and time 2017-06-14T09:39:30
uri A URL to the file http://my.server/myproject/myfile.ext
geopoint A location in a latitude-longitude pair [-0.4683333, 51.760333]
[X] A set of values of type X red,green,blue
proxy Indicates that the field is a proxy N/A
remote A URL to an external resource /tagbox/similar?id=1

Data type should be chosen carefully. If you set a field as ‘integer’ then decide you need ‘long’ instead, the type can’t be changed without purging the database first and rebuilding from scratch.

By utilising a String field, and a valid_values constraint, one can implement the “Enumeration” concept found in many programming languages.

Note

Fields that contain a set of values are unordered, and duplicate values are ignored.

Note

Datetime can be a Python datetime object or an ISO-8601 formated date string

Note

Geopoints follow the GeoJSON format. In particular, this means a list of [longitude, latitude], in that specific order.

A Remote type field should provide a link to some external resource - this could be, for example, a REST API, or some foreign database. Performing a GET request on the provided URL should either return a Collection+JSON formatted collection of items, or else redirect to some PixStor Search query. For example, the external resource may come up with a collection of item ids, and return a redirect to a query matching those ids.

Augmenting Data Types

As well as specifying the datatype, the value of a metadata field can also define other expected attributes of the data such as list of valid values, or the character encoding/language.

Providing these may improve the efficiency and/or usability of querying or lead to a better UI experience. For example, one can specify default_filter=False to indicate that a particular property should not be presented in the dynamic guided search links.

Remote type fields can also provide a ‘hint’ entry - this will be passed through into the output from search, and can be used to describe the structure of the data returned by the link. For example, using “hint” : “collection” will cause the search user interface to assume the returned data is in the same format as it expects for all other search results, and render a relevant search refinement button.

Example

Digital images have a width, height and a megapixel size. As such we could choose to define a schema for a digital image as follows:

[{
   "name": "height",
   "prompt": "Video height",
   "value": {
       "datatype": "Integer"
       }
   },
   {
   "name": "width",
   "prompt": "Video width",
   "value": {
       "datatype": "Integer"
       }
   },
   {
   "name": "megapixels",
   "prompt": "Video megapixels",
   "value": {
       "datatype": "Float"
       }
   }]

Here we have defined three metadata attributes. Each metadata attribute has

  • name
  • prompt
  • value: datatype

For images, width and height are typically integer numbers and the megapixels is typically floating point. Each metadata attribute also has an associated prompt which provides a human-readable label for display in the PixStor Search UI.

Nesting

The schema can include nested fields. This is constructed as a field name as the key, and a list of dicts as the value - e.g.

{"user": [
    {
    "name": "name",
    "prompt": "User name",
    "value": {
        "datatype": "String"
        }
    },
    {
    "name": "id",
    "prompt": "User id",
    "value": {
        "datatype": "Integer"
        }
    }
]}

This is best utilised for fields which can be represented in different formats, such as above, where we have ‘user’ in a name format and numerical id format.

Nesting can also be used for grouping a small number closely related properties, where a separate namespace doesn’t make sense - e.g. lens.make and lens.model

Note - nesting shouldn’t be used for creating sub-namespaces - e.g. image.jpeg... In this case, it would be better to use just jpeg as the namespace.

Fields shouldn’t be nested deeper than two levels - i.e. namespace.field.subfield

A note on Taxonomy/Units

To make the system most useful, it is important to ensure the taxonomy both within and between plugins is consistent, as are the units. For example a singular field called “Image Size” in one plugin vs two fields called “Width” and “Height” in another. Or a File Size field that in one place is in bytes, and in another KBytes.

A future version will include tools to help highlight and resolve these sort of issues.

Namespace, handles, priority, offline, and async

Namespace

The namespace is a string, which must contain only alpha-numeric characters and underscores. It is used for grouping/name-clash resolution.

def namespace(self):
    return "image"

These can be unique per plugin, shared across multiple plugins e.g. one plugin may do basic extraction in a namespace, and another may perform more detailed extraction within the same namespace.

Namespace Advice

Namespaces can be considered heirarchically - for example

common - image - camera - canon

You should choose the namespace most appropriate for each given metadata field

As examples:

  • there’s no meaningful difference between the image width of a image and of a video, so both would use the image namespace.
  • focal length is a feature of a camera, so would use the camera namespace
  • camera properties which are specific to Canon cameras would use the canon namespace - Note: different cameras may offer the same properties under different names/formats. In this case those fields should be standardised and put under the camera namespace
  • copyright could apply to any filetype, so would use the common namespace
  • creation time can have different meanings depending on context - for example there is the time at which a photo was captured, and there’s the time at which the image was created on the filesystem. These two values would go under different namspaces - image and gpfs
  • lenses have multiple properties, but they are relavant to all cameras. So rather than having a lens namespace, they could be nested fields within the camera namespace - e.g. camera: [lens.make, lens.model]

Handles

Plugins must declare a function which is called by the broker to determine if it can handle a given file. The function has access to both a files mimetype (which the broker will already have determined via heuristics) and file extension.

For example, an image metadata plugin might include the following:

def handles(self, ext=None, mimetype=None):
    return (mimetype and mimetype.startswith("image/")
            or ext in ('.dpx', '.exr'))  # uncommon file types

If the broker is unable to determine a file’s mimetype (e.g. if the file can’t be read) its value will be None.

For uncommon file types, the mimetype will often be returned as application/octet-stream. In these cases it helps to specify extensions to match instead.

The file extension will be passed as a string starting with a period - e.g. .jpg

Note - handles shouldn’t be made too broad. If a plugin recieves files it can actually handle it will likely result in errors.

Priority

Plugins can define a priority between -1000 and +1000. All the shipped plugins have a priority between -100 and +100. If your plugin has no requirement to be called in any particular order, declare the priority as 0.

def priority(self):
    return 0

A low priority can be used to make a plugin run later in the queue, thus allowing it to (potentially) overwrite any existing metadata with the same field name. For example, there might be a plugin which does generic metadata extraction for all text type files (word count, etc.), and another which extracts some of the same metadata for html files specifically.

Offline

You can indicate that a plugin can handle offline files by having the offline method return True.

If a file is offline/migrated, trying to read it while extracting metadata will recall it. This can be expensive. For this reason, if a file is offline, it will only be passed to plugins which specify offline as True.

Offline plugins should not attempt to open offline files. PixStor Search will attempt to catch some calls to open(), but care should be taken by users as this won’t cover all methods for reading a file.

In particular, anything that reads a file via subprocess won’t be caught. To remedy this, use the execute method from arcapix.search.metadata.utils

Note

offline files will have a mimetype of None, so the handles method for offline plugins should either handle all files, or else be able to discriminate based on extension.

Async

The plugin should declare whether they intend to perform any operations asynchronously i.e. that the plugin will return before having completed everything it needs to do.

This doesn’t mean that any particular invocation will do so, but by declaring that no asynchronous processing will occur, the database ingest speed can be optimised.

For example, a video proxy generating plugin would do:

def async(self):
    return True

Extraction

Assuming the plugin has declared it will handle the file, the plugin’s process() function will be called. This function has 3 paramters, being the name of the file, a dictionary of information which has already been extracted and may be useful (for example, mimetype, size, access permissions), and a “black box” identifier for the ingest operation.

This identifier must be passed to subsequent invocations of the helper library functions

The exact way the extraction or proxy generation happens is entirely up to the plugin writer. However, it should be remembered that any external dependencies must be installed on all nodes which may perform processing, not just the node that the operation was initiated on.

The process function returns one of 5 constants:

Constant Meaning
FATAL an error occured while extracting metadata
ERRORED an error occured while sending metadata to the database
SUCCESS this plugin completed correctly
SKIPPED this plugin was unable to extract any metadata from the file
INPROGRESS this plugin submitted one or more operations asynchronously.

Async

By default, plugin processing is done ‘inline’. If a plugin needs to do particularly heavy processing - such as transcoding a video, or extracting text from a large pdf file - the processing can be done asynchronously instead.

Asynchronous processing can be done by any means, but the Plugin interface provides a _submit method which allows you to submit a processing function for asynchronous processing via a job engine (HTCondor by default).

def process(self, id_ file_, fileinfo=None):

    self._submit(processing_function, args=[id_, file_])

Here we pass some python processing function to _submit along with a list of arguments for the function.

Note - the async processing function must include appropriate calls to submit the extracted metadata/generated proxies (see below).

If a plugin does asynchronous processing, its async method must return True.

Submission

Metadata

Once metadata has been extracted, one of two helper objects is used to tell the broker about the metadata which has been generated. The metadata will first be validated against the declared schema, before being passed to the database.

Note - make sure the extracted metadata matches the defined schema. Fields of the wrong type or fields not specified in the schema will cause the metadata to be rejected.

Not all fields specified in the schema need to be extracted. For example, not all jpg files have exif data. Such file can simply omit those fields. It’s better to omit a missing field than post a null value (e.g. None, ‘’, ‘n/a’)

It is worth noting that the database is only updated lazily.

The helper must be passed the “blackbox” id, as well as the plugin object.

def process(self, id_, file_, fileinfo=None):
    '''
    Extracts metadata from the files, and calls relevant helper functions
    to submit it to the search service
    @param id_ Black box identifier which must be passed to helper operations
    @param file_ Full Pathname to the file
    @param fileinfo Structure containing already extracted metadata which may be useful.
                     It should not be assumed that any particular item is included
    @return PluginStatus One of SUCCESS, FATAL, ERRORED, SKIPPED,
                                INPROGRESS (only for plugins which declare aysnc() as true)
    '''
    try:
        data = { "width" : 100 } # In reality, metadata would be generated using whatever
                                 # specialist interogation techniques are appropriate
        if Metadata(id_, self).update(data):
            return PluginStatus.SUCCESS
        return PluginStatus.ERRORED
    except:
        return PluginStatus.FATAL

The update call returns True or False depending on whether or not the update operation was sucessful.

Note

Your plugins should catch any exception and return status FATAL, as raised exceptions will cause ingest to stop completely.

Proxy

A proxy plugin will create some sort of “proxy” of the original source file. This proxy will typically be reduced in size, or easier to render in a user interface, compared to the original.

In particular, the PixStor Search UI expects two special proxies for each file:

  • thumbnail: this is a smaller proxy - typically an image or animated gif - which appears in the UI thumbnail view. Its size should be that defined in the arcapix.search.proxies.thumbnail.size config (default: 150x150)
  • preview: this is a larger proxy - an image, or even a video or audio file - which is displayed in the UI preview pane. For images and videos, the size should be that defined in the arcapix.search.proxies.preview.size config (default: 400x300)

The workflow for a proxy plugin is very similar to that for a Metadata plugin, with the exception of the helper functions. As mentioned above, note that the proxy object will be MOVED into the proxy store - therefore, if the proxy is required elsewhere, a copy should be kept.

Tip: This can be done with potentially zero space usage by creating a hard-link to the file prior to ingest

In order to improve clarity, the method on the helper is called ingest() rather than update()

Following is an example of proxy generation using async functionality.

Note - proxy generation doesn’t have to be performed asynchronously. If it’s relatively lightweight, it can be performed inline.

def _process_async(self, id_, source_path, thumbnail_size):
    # Note: This function is not part of the defined interface specification for a plugin,
    # and is being used to demonstrate how async operations can be used
    '''
    Make the proxy and add it to the proxy store/db. Triggered asynchronously by process()
    @param id_ Black box identifier for the item
    @param source_path Source file path
    @param thumbnail_size Required size of the thumbnail
    @return Nothing - Exceptions are raised in the case of errors
    '''
    proxy_filename = self._make_proxy(source_path, thumbnail_size) # Function to create the proxy

    try:
        # Insert the proxy into the store. Note properties are the filename of the proxy,
        # the proxy type, a name which the proxy should have in the store (in case the proxy
        # has been named to some random name), and the mime-type of the proxy.
        Proxy(id_, None).ingest(proxy_filename, 'preview', 'preview.mp4', 'video/mpeg')
    finally:
        # Normally the proxy will have been removed, but tidy up in case of unexpected errors
        if os.path.exists(proxy_filename):
            os.remove(proxy_filename)


def process(self, id_, file_, fileinfo=None):
    '''
    Trigger generation of downsized version of the video for previewing.

    @param id_ Black box identifier to be passed to subsequent ingest operations
    @param file_ Name of the source file to proxy
    @param fileinfo Potentially useful information
    '''
    try:
        # the _submit method is defined on the Plugin base class
        self._submit(self._process_async, args=[id_, file_, (400, 300)])
        # Note the return of INPROGRESS here - to indcate that the operation has
        # not been completed
        return PluginStatus.INPROGRESS
    except:
        # Typically logging or other operations would be performed here.
        return PluginStatus.FATAL

Examples

Metadata Plugin

Note - there is no technical reason why a plugin cannot perform both Metadata and Proxy generation operations (assuming the same namespace), however this is not supported.

class SampleImagePlugin(Plugin):

    def namespace(self):
        '''
        Returns the namespace for the metadata for this plugin

        @return String
        '''
        return 'image'

    def async(self):
        '''
        Returns whether this plugin does any operations asynchronously

        @return False always (for this plugin)
        '''
        return False

    def handles(self, ext=None, mimetype=None):
        '''
        Return whether the plugin can handle a given file, based on extension/mimetype
        @param ext File extension (includes a leading '.')
        @param mimetype Mime type, e.g. image/png
        @return True if the plugin needs to process this file, false otherwise
        '''
        if mimetype and mimetype.startswith("image/"):
            return True
        # some umcommon images are identified as
        # application/octet-stream
        return ext in ('.dpx', 'exr')

    def schema(self):
        '''
        Returns the schema for the metadata produced by this plugin. All metadata will be validated
        against this before inserting

        @return Python nested data strucuture according the the schema definition format
        '''
        return [
        {
        "name": "height",
        "prompt": "Image height",
        "value": {
            "datatype": "Long"
            }
        },
        {
        "name": "width",
        "prompt": "Image width",
        "value": {
            "datatype": "Long"
            }
        },
        {
        "name": "megapixels",
        "prompt": "Image megapixels",
        "value": {
            "datatype": "Float"
            }
        }
    ]

    def _extract(self, filename):
        '''
        Private worker function to extract metadata from image files using EXIF
        @param filename File to work from
        @return dict structure conforming the defined schema
        '''
        exif = get_exiftool_data(filename)
        data = {"height": exif['ImageHeight'],
                "width": exif['ImageWidth'],
                "megapixels": exif['Megapixels']}
        return data

    def process(self, id_, file_, fileinfo=None):
        '''
        Extract metadata and submit an update for a given file.

        @param id_ A black box identifier which shall be passed to the metadata update functions
        @param file_ The full path to the file
        @param fileinfo Information which may have already been gathered and may be of use. (Not used in this plugin)
        @return Constant indicating success/failure (Cannot return INPROGRESS as async() is False)
        '''
        try:
            data = self._extract(file_)
            if Metadata(id_, self).update(data):
                return PluginStatus.SUCCESS
            return PluginStatus.ERRORED
        except:
            return PluginStatus.FATAL

Proxy Plugin

class SampleImageThumbnail(Plugin):

    def namespace(self):
        '''
        Returns namespace for the plugin. Note this may overlap with other plugins
        @return String of namespace
        '''
        return 'image'

    def handles(self, ext=None, mimetype=None):
        '''
        Return whether the plugin can handle a given file, based on extension/mimetype
        @param ext File extension (includes a leading '.')
        @param mimetype Mime type, e.g. image/png
        @return True if the plugin needs to process this file, false otherwise
        '''
        return ((mimetype or '').startswith('image/')
            or ext in ('.dpx', '.tga'))  # uncommon formats

    def schema(self):
        '''
        Return schema definition for a plugin
        @return Python dict for thumbnail structure
        '''
        return [{
            "name": "thumbnail",
            "prompt": "Thumbnail image",
            "value": {
                "datatype": "Proxy"
            }
        }]

    def generate_temp_filename(self, *args, **kwargs):
        '''
        Return a temporary file name that can be used to write the proxy to.
        '''
        #NB. In production, this would be more complex, since we would aim to
        #place the temporary file on the same filesystem as the proxy store
        with NamedTemporaryFile(*args, **kwargs) as f:
            return f.name

    def _make_proxy(self, source_path):
        '''
        Create a thumbnail of the specified size from the source file.
        @param source_path Path to the image
        @return Path to created thumbnail
        @raises Exception if image cannot be created
        '''
        source_image = load_image(source_path)

        # try to fix orientation if the image is rotated
        source_image = fix_orientation(source_image)

        # resize to the expected dimensions
        source_image = image_to_thumbnail(source_image, (150, 150)) # Thumbails are 150px

        proxy_filename = self.generate_temp_filename(suffix='.png')

        source_image.save(proxy_filename, 'PNG', optimize=True)

        return proxy_filename

    def process(self, id_, file_, fileinfo=None):
        '''
        Creates a thumbnail and submit to proxy store for a given file

        @param id_ A black box identifier which shall be passed to the metadata update functions
        @param file_ The full path to the file
        @param fileinfo Information which may have already been gathered and may be of use. (Not used in this plugin)
        @return Constant indicating success/failure (Cannot return INPROGRESS as async() is False)
        '''
        p = Proxy(id_, self)

        thumbnail_path = ''

        try:
            thumbnail_path = self._make_proxy(file_)

            if p.ingest(thumbnail_path, 'thumbnail', 'thumb.png', 'image/png'):
                return PluginStatus.SUCCESS
            return PluginStatus.ERRORED

        except:
            return PluginStatus.FATAL

        finally:
            if os.path.exists(thumbnail_path):
                os.remove(thumbnail_path)