Plugin API Reference



The Plugin will be instantiated one or more times - state should not be shared between instances.

Similarly, not all instances will be instantiated in the same Python process, so mutable class/static state should be avoided.

In essence, it should be a stateless object


Returns whether this plugin will utilise Asynchronous functionality under any circumstances.

A value of False implies that process() cannot return INPROGRESS. This may be used to optimise workflows - for example, if all plugins which can handle a particular document are not async, then the metadata document can be lazily/bulk injected into the data store. Otherwise, it will need to be eagerly inserted, in order to ensure long running jobs don’t cause problems.

Plugins which declare handles=None, async=True should be _strongly avoided_, as these would cause all documents to insert eagerly

handles(ext=None, mimetype=None)

Returns whether the plugin can handle a partcular file.

Given an extension and mimetype, return whether the plugin is applicable. handles can return True for all if it handles all file types.

As far as is practicable, plugins should define the minimal set of types they can process.

ext may or may not start with a dot

NB. A plugin should not assume however that search will not send it a file which doesn’t match, and it is therefore sensible for a plugin to deal elegantly with any input provided


Returns the namespace this plugin will operate under (static). (REQUIRED)

It is not essential that this namespace is unique - it is possible to write mulitple plugins within the same namespace for example to extract the same information in different ways


Returns whether or not the plugin can process offline/migrated files.

This means the plugin doesn’t attempt to open/read the file, as this would cause the expensive operation of recalling said file.

For protection, the Python default open() method will raise an exception if called when processing an offline file - however, this doesn’t protect against all possible open/read operations, so care should be taken when writing an offline plugin.

Note: offline files will have a mimetype of None, so the plugin’s handles method should either handle all, or else be able to discriminate based on extension.


The priority is a value between -1000 and 1000 which is used to indicate the relative ordering which plugins should be called.

This typically should only be used by the same author, although technically there is no reason why a user derived plugin should not use this to override a standard plugin’s metadata.

All ArcaPix curated plugins have a priority of -10 to +10

Plugins with higher priorities get called later (allowing them to overwrite earlier plugins)

process(id_, file_, fileinfo=None)

Process the file, and extracted metadata from it.

Metadata is inserted using functions from the Metadata Helper Library.

The _fileinfo object may contain useful information about the file which has already been extracted, for example: fileinfo[‘mimetype’] or fileinfo[‘extension’]

It may also have GPFS specific metadata, for example, fileinfo[‘gpfs’][‘poolname’].

However, the plugin should operate if none of this information is available.

id_ is a blackbox identifier for this file, which needs to be given to the Metadata helper library. The reason for using the library is to allow the process function to perform a wider variety of extraction stages than a simple return value would offer.

NB. It should not be presumed that this function will be called from within the search scan process. It could be called from the job queue engine instead, and therefore be invoked an entirely different object instance


Return the schema that this plugin will provide.

All schema’s and namespaces will be merged and cross-validated to ensure they do not conflict.

This will then be validated against the central data store, with one of 3 outcomes:
  1. A match - startup proceeds
  2. A forward compatible difference (i.e. added fields) - the schema will be updated
  3. A forward incompatible difference (i.e. deleted fields, data would be destroyed)
  • startup will be aborted - CLI tools need to be used to force an overwrite

The schema will be based around the following grammar (roughly):

schema := namespace* namespace := (name, child) child := (namespace | (name, prompt, value [, default_filter]))* name := <unicode_string with no .’s - unique in this namespace> prompt := <Meaningful string to describe the value for use in Human printable documentation> default_filter := <Boolean - whether the field should be included in the “filters: ‘.’” projection> value := (datatype [, encoding] [, language] [, valid_vals*]) datatype := base_type | array | “Proxy” array := “[” + base_type + “]” base_type := “String” | “Integer” | “Long” | “Float” | “Double” | “Datetime” | “Boolean” | “GeoPoint” | “URI” encoding := <any recognised character encoding description, e.g. UTF-8, Latin-1, ASCII etc.> language := <Any recognised ISO langauge code e.g. en, de etc.> valid_values := <String, date, or URI literal>



SUCCESS - Processing completed normally, metadata ws updated.

SKIPPED - The file wasn’t of interest to this plugin (informational)

INPROGRESS - Processing was in some fashion defered, but there is no reason to believe it will fail

ERRORED - A non permanent error happened. Further files should be offered to the plugin

FATAL - A permanent error occured. Further files should not be offered to this plugin until service restart or manual reset


class, plugin)

Helper to update metadata on an object.

Metadata is stored in


a : 23

b: 24


d : 25

e : 26


clean_metadata(metadata, schema=None)

Brings metadata into line with the plugin schema.

This involves removing null fields, coercing fields to their schema type, and removing fields which don’t belong to the schema.

Note - this mutates the provided metadata.


Update the metadata store with the information stored in ‘data’, which must be conformant to the schema. Any missing values are not removed.

There is no “insert” operation, as the wrapper document _may_ have already been put in by the broker.


Verify that the data matches the schema, without actually doing an update on the store.

This is the first thing called by update(), but may be useful to the the calling plugin.

Raises an exception if invalid, and returns the result when it has been taxonomy-stablised.

NB. This may well cause the data to transform somewhat


class, plugin)

Helper to update proxy details on an object

Proxy metadata is stored in

location : 23 mimetype: 24



Removes a proxy, both from the metadata and from the store.

For the time being, we’re assuming that the proxy is being deleted as part of deleting a file from the database - in that case we don’t have to worry about removing the field from the parent metadata document, because the parent document is going to be completely deleted anyway. This will probably have to change in future.

ingest(proxy_path, typeidentifier, filename, mimetype)

Move a proxy file into the configured proxy store, and puts the relevant metadata into the Data store.