Plugin API Reference

Plugin

class arcapix.search.metadata.plugins.base.Plugin

The Plugin will be instantiated one or more times - state should not be shared between instances.

Similarly, not all instances will be instantiated in the same Python process, so mutable class/static state should be avoided.

In essence, it should be a stateless object

augment(id_, file_, current_metadata, context)

Provides an entry point for the Dynamic Plugin to filter the results returned.

NB. Dynamic (augmented) plugins are not currently supported for third party usage, and are thus undocumented at present.

id_ is a blackbox identifier for this file, which needs to be given to the Metadata helper library. The reason for using the library is to allow the process function to perform a wider variety of extraction stages than a simple return value would offer.

The blackbox ID MAY be str()’ed - this will produce an identifier which will remain consistent between invocations i.e. the same file will always have the same string representation of it’s id.

file_ is the filename of the file as it appears in the index. Note. There is no absolute guarantee that the files exists or is readable at this point in time. It merely existed when the index was last updated.

current_metadata is the metadata which will be returned at present. NB. Priority is respected between plugins, so if require a specific order of amendments, this can be expressed using the priority() function. In the extreme case where you require different priority between the ingest portion and the dynamic portion this must be achieved with 2 plugins with overlapping namespaces. This variable is immutable.

context is a dictionary containing a variety of request context. Your plugin should not assume that any particular item will be set in this dictionary, as it may change depending on the calling context. However, the following are commonly included. This variable is immutable.

username - the username of the logged in user expiry_time - the time at which the current authentication mechanism will expire

The function must return a ‘PluginStatus’ which will have the following impact:

FATAL - will cause the search to fail entirely, SKIPPED - will cause any returned metadata to be ignored INPROGRESS is Invalid SUCCESS - normal execution ERRORED - a warning will be logged.

NB. An exception being raised in a plugin is considered FATAL, and will abort the search. Therefore, you should be careful to handle all reasonable exceptions within the augment method.

dynamic()

Indicates that the plugin has dynamic properties i.e. that it computes certain properties at runtime, rather than upon ingest. NB. Whilst it may be possible to ‘overwrite’ properties with new values at this point, this can be confusing since the search phase will return results based on the stored values. it is therefore recommended to use “new” properties for dynamic values,

handles(ext=None, mimetype=None)

Returns whether the plugin can handle a partcular file.

Given an extension and mimetype, return whether the plugin is applicable. handles can return True for all if it handles all file types.

As far as is practicable, plugins should define the minimal set of types they can process.

ext may or may not start with a dot

NB. A plugin should not assume however that search will not send it a file which doesn’t match, and it is therefore sensible for a plugin to deal elegantly with any input provided

is_async()

Returns whether this plugin will utilise Asynchronous functionality under any circumstances.

A value of False implies that process() cannot return INPROGRESS. This may be used to optimise workflows - for example, if all plugins which can handle a particular document are not async, then the metadata document can be lazily/bulk injected into the data store. Otherwise, it will need to be eagerly inserted, in order to ensure long running jobs don’t cause problems.

Plugins which declare handles=None, async=True should be _strongly avoided_, as these would cause all documents to insert eagerly

logger

A logger instance which should be used for plugin logging messages.

max_file_size()

Indicates the maximum size of a file that this plugin will handle.

Any file larger than this size will return SKIPPED when the plugin is sandboxed.

By default this value is taken from apcore-config. If no value is configured or is None, no limit is imposed.

max_process_time()

Indicates the maximum time that process can operate on a file (in seconds).

If the plugin is sandboxed, and takes longer than this time to complete, a Timeout exception will be raised.

Note

This time limit isn’t applied to asynchronous jobs

By default this value is taken from apcore-config. If no value is configured, the default is 10 minutes.

namespace()

Returns the namespace this plugin will operate under (static). (REQUIRED)

It is not essential that this namespace is unique - it is possible to write mulitple plugins within the same namespace for example to extract the same information in different ways

offline()

Returns whether or not the plugin can process offline/migrated files.

This means the plugin doesn’t attempt to open/read the file, as this would cause the expensive operation of recalling said file.

For protection, the Python default open() method will raise an exception if called when processing an offline file - however, this doesn’t protect against all possible open/read operations, so care should be taken when writing an offline plugin.

Note: offline files will have a mimetype of None, so the plugin’s handles method should either handle all, or else be able to discriminate based on extension.

priority()

The priority is a value between -1000 and 1000 which is used to indicate the relative ordering which plugins should be called.

This typically should only be used by the same author, although technically there is no reason why a user derived plugin should not use this to override a standard plugin’s metadata.

All ArcaPix curated plugins have a priority of -10 to +10

Plugins with higher priority get called first. Plugins with lower priority get called later, so can potentially overwrite metadata from higher priority plugins.

process(id_, file_, fileinfo=None)

Process the file, and extracted metadata from it.

Metadata is inserted using functions from the Metadata Helper Library.

The _fileinfo object may contain useful information about the file which has already been extracted, for example: fileinfo[‘mimetype’] or fileinfo[‘extension’]

It may also have GPFS specific metadata, for example, fileinfo[‘gpfs’][‘poolname’].

However, the plugin should operate if none of this information is available.

id_ is a blackbox identifier for this file, which needs to be given to the Metadata helper library. The reason for using the library is to allow the process function to perform a wider variety of extraction stages than a simple return value would offer.

The blackbox ID MAY be str()’ed - this will produce an identifier which will remain consistent between invocations i.e. the same file will always have the same string representation of it’s id.

NB. It should not be presumed that this function will be called from within the search scan process. It could be called from the job queue engine instead, and therefore be invoked an entirely different object instance

schema()

Return the schema that this plugin will provide.

All schema’s and namespaces will be merged and cross-validated to ensure they do not conflict.

This will then be validated against the central data store, with one of 3 outcomes:
  1. A match - startup proceeds
  2. A forward compatible difference (i.e. added fields) - the schema will be updated
  3. A forward incompatible difference (i.e. deleted fields, data would be destroyed)
  • startup will be aborted - CLI tools need to be used to force an overwrite

The schema will be based around the following grammar (roughly):

schema := namespace* namespace := (name, child) child := (namespace | (name, prompt, value [, default_filter]))* name := <unicode_string with no .’s - unique in this namespace> prompt := <Meaningful string to describe the value for use in Human printable documentation> default_filter := <Boolean - whether the field should be included in the “filters: ‘.’” projection> value := (datatype [, encoding] [, language] [, valid_vals*]) datatype := base_type | array | “Proxy” array := “[” + base_type + “]” base_type := “String” | “Integer” | “Long” | “Float” | “Double” | “Datetime” | “Boolean” | “GeoPoint” | “URI” encoding := <any recognised character encoding description, e.g. UTF-8, Latin-1, ASCII etc.> language := <Any recognised ISO langauge code e.g. en, de etc.> valid_values := <String, date, or URI literal>

PluginStatus

class arcapix.search.metadata.plugins.base.PluginStatus

SUCCESS - Processing completed normally, metadata ws updated.

SKIPPED - The file wasn’t of interest to this plugin (informational)

INPROGRESS - Processing was in some fashion defered, but there is no reason to believe it will fail

ERRORED - A non permanent error happened. Further files should be offered to the plugin

FATAL - A permanent error occured. Further files should not be offered to this plugin until service restart or manual reset

Metadata

class arcapix.search.metadata.helpers.Metadata(id_, plugin)

Helper to update metadata on an object.

Metadata is stored in

file:
metadata:
<pluginnamespace>:

a: 23 b: 24 c:

d: 25 e: 26

etc.

augment(data)

Amend/Augment metadata.

Only available during the search response phase.

clean_metadata(metadata)

Brings metadata into line with the plugin schema.

This involves removing null fields, coercing fields to their schema type, and removing fields which don’t belong to the schema.

Note - this mutates the provided metadata.

update(data)

Update the metadata store with the information stored in ‘data’, which must be conformant to the schema.

Any missing values are not removed.

There is no “insert” operation, as the wrapper document _may_ have already been put in by the broker.

validate(data)

Verify that the data matches the schema, without actually doing an update on the store.

This is the first thing called by update(), but may be useful to the the calling plugin.

Raises an exception if invalid, and returns the result when it has been taxonomy-stablised.

NB. This may well cause the data to transform somewhat

Proxy

class arcapix.search.metadata.helpers.Proxy(id_, plugin)

Helper to update proxy details on an object

Proxy metadata is stored in

file:
proxies:
<pluginnamespace>:
<proxyname>
location : 23 mimetype: 24

etc.

ingest(proxy_path, typeidentifier, filename, mimetype)

Move a proxy file into the configured proxy store, and puts the relevant metadata into the Data store.