######
Ingest
######

The ingest phase scans the filesystem, and builds a number of sets of files to be ingested in parallel.
This process is performed by the *broker*

Search provides two modes; 'Search' and 'Search Plus'.

.. warning::

    Search Plus ingests (reads) data. This means the Access Time (atime)
    of all files ingested is updated upon reading the file. If the customer
    is performing migrations, backups, Analytics or workflows based on
    Access Time (atime) then it must be stated to the customer that
    performing a Search ingest will mean that the customer will no-longer
    be able to differentiate files by Access Time.

    Search auto-deploys in lightweight 'file status' mode (stat-only) which
    provides minimal functionality and does not alter the
    Access Time (atime). More information is provided below.

The Broker
----------

For each pool, the broker examines each file in turn. Core metadata is extracted, and then plugins are offered access to the file
to update the metadata and generate proxies.

Plugins can declare a priority, to ensure they are processed in a particular order.
In addition, they can indicate to the broker that they wish to execute code asynchronously -
indicating this provides performance improvements during the ingest phase.

Finally, the plugin must define which sorts of files it is interested in - this is typically based on mime-type and/or file extension.


.. warning::

    Search only supports unicode-compliant paths.

    Files with non-unicode paths will not be ingested (AP000010051E)


Metadata Plugins
----------------

Metadata plugins examine the contents of a file, and retrieve interesting attributes about it -
for example, image widths and heights, word counts in documents, or GPS locations.

The plugin uses a helper object, defined in the arcapix libraries, to submit values for these attributes to the Database.

Each plugin must also define a 'schema', which indicates what items it intends to extract, the data types of those values,
and in some cases, information about valid values, or range bounds.

It should be noted that a minimal schema is quite easy to define, but the more precise the schema,
the better the performance and query experience of the resulting database.

For example, it would be possible to declare all values as strings, but this would make range-based searches impossible.

Proxy generation plugins
------------------------

Proxy generation plugins are usually called after metadata extraction has completed (though note the impact of the asynchronous capability).
They produce a file or files generated from the original. These will typically be a smaller version of the original (e.g. a thumbnail),
or some visualisation of the data, meant for consumption via user interfaces. However, proxies need not be 'browseable'.

In particular, there should be plugins which produce special proxies called ``<namespace>.thumbnail``, (e.g. image.thumbnail, video.thumbnail),
which can be used to represent files in "grid views". Whilst not absolutely required, it is expected.

Asynchronous support
--------------------

Plugins can also request that some of their processing is offloaded onto an asynchronous queue.
This is particularly helpful for very heavy processes such as video transcoding.
Support for this is builtin, and easily accessed from the plugin using the ``Plugin._submit`` method.