################## Using the REST API ################## This section contains information on how to make requests to PixStor Search by using the REST API. The API is compliant with the `Collection+JSON (C+J) specification `_, but note it also makes use of the collection level ``properties`` extension, and a minor extension to allow for templated URL's. It may be a little different to some other REST approaches, but is more predictable and easier to machine consume, since the schema is entirely consistent across any service which uses the C+J approach. It should be viewed as somewhat like a website, with links that can be followed, forms that can be filled in etc, with the strong caveat that it is designed for machine consumption, rather than human, rather than as a simple fire-and-forget approach. In addition, it is possible to short-circuit the process, but this requires an understanding of the URL format, which consuming the C+J directly does not. The URL format is considered to be less canonical (i.e. more likely to change between versions). .. note:: In order to reduce the documentation size, in all circumstances success is indicated by a 200, 201, or 202 HTTP status code. A 500 HTTP Status code can also be returned for a generic Server error All methods are ``GET``, unless otherwise specified. .. note:: In order to reduce the documentation size, in all circumstances OTHER THAN Authorisation requests, the following HTTP Headers must be included **Accept:** application/vnd.collection+json **WWW-Authenticate:** ============== Authentication ============== Before accessing the API, one must first authenticate, using the `RFC6749 oAuth2 `_ process oAuth2 process for resources. It should be noted that in this context, the developers application is the "Client", and typically utilisation is via ``grant_type=password`` Arcapix supplies a basic example Authorisation Server which supports only a ``grant_type`` of ``password``, and authenticated against the standard PAM system Note that the authentication server is a distinct service vs the search server, and is access on an alternative port. It must in all cases be via SSL. As standard, the search server runs on port 5000, but the auth server runs on 5001. Depending on the client library/language you are using, you may need to accept self-signed certificates or install the Arcapix CA. URL --- .. code:: https://authserver/oauth2/token Method ------ POST Request Parameters ------------------ ========== ================= ======== Parameter Description Required ========== ================= ======== grant_type Always "password" Yes username username Yes password password Yes ========== ================= ======== Success Status Code ------------------- 200 Payload ------- .. code-block:: JSON { "access_token": "feoF8MpdWqnUAI3FiMc9v6_PDspAfJPzc_-uwueC9I7IDkxz_hvYITVsNWZ5IOH19nfwIADhIpo9q_GDaCLyUGvA-_RUAEaPcurWFSTX5zClBGZ-I3n2WQbnvLVkvweVWGNilBTdNwdNndmNyqYI-lVt4RO1tIylV29mN7GQOMRXZAWKMXunc_0qpNpJy47M8tPZVVReXREnGd96SovGspKQ-AUAH1IcaD3mqlzrxiNg_j9cRP3KSdhSy_cHSuhN4QdX96jJ5TnsPPHXbFnK26k4jbBPb7sOx39LcXXOOuCjV_RioqaZHe_xt7l3tuuetxlNeU5PhgM2vJsWxBHQrJau9bG0pO24tkMEj5ByUBIH4EiXCyCtx9NbfpB_Hyu0KsHv8IFPcMAZlC7Ijcpg9g2zCa7iGIA_o-uYrHDzxg6sQPQVzgPmJuD1RkFVMXsbiwan7vFCFOscoeCKfcxHW8GTB9SFEZ3aErnGsHMgIRIvBbcH3nyIATcnaTVVZOKYP82851NJgHQUaCmZ1zDkjndbcmiAdvYnOh2EUVVlAoL0UiTLS4qh6EgEF4OIj3_blEn0iSzF5269tiDgaMYtf39839_2eN1zr9Td7BEs9srz5OWQm482Djz04LjL2veYhLOdxVaDYoiRYrvyeDblRPaMu4AWZmjlJEqtDSm664AARCAPIX", "expires_in": 86400, "token_type": "Bearer" } .. note:: Tokens by default expire after 24 hours. The sample auth server does not support token refresh - a new token must be requested. Tokens may not persist across server restarts, depending on the configuration of the server. Error status code ----------------- 400 - Bad request (NB. This is not terribly precise - 403 or 412 might be more useful but are not what's specified in the RFC) Onward Usage ------------ However it is achieved, the end point of a successful authentication is an **access token**. This must be passed to the search server via an appropriately encoded Bearer ``WWW-Authenticate`` Header. NB. Most libraries will take care of the encoding, if you pass the access token as the username, and an empty password e.g. .. code-block:: Python import requests requests.get("https://mypixsearchserver/api/files/", auth=requests.auth.HTTPBasicAuth(access_token, '')) ============= Billboard URL ============= The C+J exploration starts by retrieving the server's root URL. Provided the correct **access token** is passed via the standard WWW-Authenticate header, we will receive a response containing a list of possible queries. By filling in the paramters requested, one can craft a suitable query without knowing the URL structure. URL --- .. code:: https://mypixsearchserver/api/files/ Response -------- See :doc:`example_responses` Considering a snippet of the response above: .. code-block:: JSON { "data": [ { "prompt": "Search string", "name": "where", "value": "" } ], "href": "/files/?where={\"_all\":\"{where}\"}", "prompt": "Enter a string to search in all fields across all files", "rel": "search" } By replacing the ``{where}`` entries with values prompted for using the supplied prompts (``Search string``), a suitable query URL can be constructed - e.g. ``/files/?where={"_all":"promptedvalue"}`` A small command line tool might be written as follows: .. code-block:: Python r = requests.get("https://mypixsearchserver/api/files/", auth=HTTPBasicAuth(access_token, '')) query = r.json()['collection']['queries'][0] href = query['href'] print query['prompt']+"\n" for param in queries['data']: href = href.replace("{"+param['name']+"}", raw_input(param['prompt']+":\n")) results = requests.get(href, auth=HTTPAuth(access_token, '')) ================= Rich/Direct query ================= It is possible to directly query without going via the billboard URL, although this may mean your application needs updating should the URL format change. URL --- .. code:: https://mypixsearchserver/api/files/ Request Parameters ------------------ =========== =========================== ======== ============== Parameter Description Required Default =========== =========================== ======== ============== where Clause to filter results by Yes\* NA sort Key to sort by No relevance page desired page of results No first projection Specify fields to return No all properties max_results Amount of results per page No 10 =========== =========================== ======== ============== \*The where clause isn't strictly needed, but no items are returned if you do not provide one in order to reduce the chances of a malformed query overloading the server. where _____ filters ^^^^^^^ Filters are applied to specific, named properties. Property names are of the form ``.`` The format is as follows ``where={"property1":"value1", "property2":"value2"}``, which will produce an "AND" search. It is possible to pass multiple values (OR) with an array syntax ``where={"property1":["value1","value2"]}`` It is possible to apply an AND filter on a single field ``where={"property1": {"and": ["value1", "value2"]}}`` It is also possible to exclude a particular value ``where={"property1": {"not": "value1"}}`` Numerical and date-based properties can be search using ranges, with the keywords ``gt``, ``gte``, ``lt``, and ``lte`` e.g. ``where={"property1":{"gte":"value1","lt":"value2}}``. Date-range query values can be **milliseconds** since epoch (not seconds), or an iso-8601 formatted strings e.g. ``where={"core.modificationtime":{"gt":"2000-01-01T00:00"}}`` These property filters produce **exact** matches - this means the whole terms must match, including matching case. For example, ``{"location.city":"new"}`` will not match ``location.city: New York``, nor will ``{"location.city":"new york"}`` '_all' queries ^^^^^^^^^^^^^^ There is a special, magic ``_all`` field, which performs search across **all** properties. The ``_all`` query is tokenised and case-insensitive, meaning ``where={"_all": "new"}`` *would* match ``New York``. Additionally, the ``_all`` query supports a rich query syntax, including boolean operators and wildcards. Some example queries are: .. code-block:: python # files matching either cats or dogs cats OR dogs # files matching BOTH cats and dogs cats AND dogs # files matching cats and black, or dogs and black (cats OR dogs) AND black # files matching cats, but not matching black cats AND NOT black # files matching an exact filename # without quotes, this would be split into three terms: cats, 16, jpg "cats-16.jpg" # wildcard query cats-* # fuzzy search - files *almost* matching 'cast', such as 'cats' cast~ # boost - match either cats or dogs, favouring cats # that is, files matching cats will be preferred over those matching dogs in the results # this doesn't guarantee that cats will appear first - you may need to use a larger boost # if 'sort' is used (see below), it takes precedence over boosts cats^2 OR dogs # query a specific metadata field core.directory:cats # unlike filters (see above), field queries are 'analysed' # without quotes, the following would be split into: mmfs1, cats # for an exact match, either use a filter, or quote the query core.directory:"/mmfs1/cats" # query by range on a specific field # less than value core.size:<1024 # greater or equal to date core.modificationtime:>=2020-03-01 # value in range (inclusive) image.width:[800 TO 1920] These queries would be used as, e.g. ``where={"_all": "cats OR dogs"}`` In the case of exact match, where the search term is quoted, the quotation marks would need to be escaped, e.g. ``where={"_all": "\"cats-16.jpg\""}`` .. warning:: Avoid using queries with leading wildcards, like ``*.jpg``, or worse ``*foo*``. Queries with leading wildcards are very slow and resource heavy, and may timeout In the case of searching for files with a particular extension, one can simply search the extension without the wildcard, e.g. ``{"_all": "jpg"}`` 'all' queries can be combined with property filters - e.g. ``where={"_all": "cats", "core.size": {"gt": 1024}}`` .. note:: Queries don't match substrings - for example a query for ``cat`` won't match ``caterpillar`` To match a substring, you would have to use wildcards - ``cat*`` Similarly, strings aren't split on underscores, so ``cat`` won't match ``cat_pictures``. In that case, you would need to search for the full string ``cat_pictures`` sort ____ The sort property specifies a column to sort the data on, with a preceding ``-`` used to indicate an inversion of the sort. Multiple, comma-separated fields can be specified e.g. ``sort=-core.size,core.modificationtime`` .. note:: By default, the items are returned in a "relevance" order. Unless the filter has been very precise, a lot of matches are likely, and sorting on these matches is likely to not be terribly useful, as well as being a performance hit. projection __________ It is technically possible to request only a subset of the properties for items to be returned - if one knew for example that a particular metadata field was very large (say 10K or more), it may make sense to not have it returned, to reduce both network utilisation and JSON parsing overhead. The syntax is ``projection={"property1":0}`` to exclude a field. Alternatively, you can specify ``projection={"property1":1}`` to return *only* that field. page ____ The page property indicates where in a paged result set you wish to be. In essence, ``page*max_results`` is the index of the first result you want. NB. Using the C+J 'HATEOS' links means you don't need to do computations to provide "previous", "next", "last" type functionality - the required URL's are given to you. .. important:: The underlying database, Elasticsearch, has a pagination limit of 10k results. Pagination links take this limit into account, so if you follow "next" or "last" links you will never exceed the limit. If you explicitly request a page beyond the pagination limit, a 416 (Range Not Satisfiable) will be returned. This indicates that, while there are more results in the database, the REST server can't return them. max_results __________ The maximum number of results to return. This has a default of 25 and an absolute maximum of 1000. Smaller pages give faster results. Payload ------- (See typical response below) Error status codes ------------------ 403 - Forbidden - most likely incorrect access token Example request --------------- .. code-block:: HTTP GET https://mypixsearchserver/api/files/?where={"_all":"jpg"}&sort=core.pathname&projection={"core.size":0}&page=1&max_results=20 HTTP/1.1 Accept: application/vnd.collection+json WWW-Authenticate: ====================== Typical query response ====================== The response (in C+J format) will contain 4 major sections (For a full example, see :doc:`example_responses`) Items ----- This is a list of matches, typically the first 25. Each nested item will contain a "data" key, which in turn is a list of triples for the properties name, value, and prompt. .. code-block:: JSON { "items": [ { "href": "/files/3735374022151170231", "data": [ { "prompt": "File basename (string)", "name": "core.filename", "value": "cats-22.jpg" }, { "prompt": "File mime-type (string)", "name": "core.mimetype", "value": "image/jpeg" } ] } ] } Properties provided are **name** - name of the field property **value** - field value **prompt** - human readable description of the field The href attribute gives a direct link to this item, which will return this item, and only this item, with all properties returned. Thus, detail views can be built when used with projections. Item Links ---------- This contains a list of links to other resources connected with the item, typically the proxies. The type of proxy is indicated by the "rel" attribute. In particular, the special ``_thumbnail`` rel can be assumed to be a small image. .. code-block:: JSON { "items": [ { "href": "/files/3735374022151170231", "data": [ ], "links": [ { "prompt": "Thumbnail image", "name": "thumb.png", "render": "image", "accept": "image/png", "href": "/media/090/453/627/9045362721810216358.png", "rel": "_thumbnail" }, { "prompt": "Preview image", "name": "preview.png", "render": "image", "accept": "image/png", "href": "/media/051/069/261/5106926172767688680.png", "rel": "image.preview" } ] } ] } By careful utilisation of the **render**, **accept** and **href** attributes, user interfaces with the correct controls can be produced. Collection links ---------------- This provides a list of links to other related collections Firstly, it contains links to the previous, next, last etc. pages of results, using IANA registered relation types. This enable the crafting of paged result sets without explicit URL calculations .. code-block:: JSON { "links": [ { "render": "link", "href": "/files/?where={\"_all\":\"cats\"}&page=3", "prompt": "Last", "name": "last", "rel": "last" }, { "render": "link", "href": "/files/?where={\"_all\":\"cats\"}&page=2", "prompt": "Next", "name": "next", "rel": "next" } ] } But more interestingly, it contains the link to kick off the guided dynamic search. .. code-block:: JSON { "links": [ { "render": "link", "href": "/files/?where={\"_all\":\"cats\"}&projection={\"filters\": \".\"}", "prompt": "Search filters", "name": "_filters", "rel": "links" } ] } By following the href in that "link", you will retrieve a much wider variety of useful collections related to your search, which enables efficient drill down through large result sets. Collection properties --------------------- These provide a list of data items indicating the total number of hits. .. code-block:: JSON { "properties": [ { "prompt": "Number of matching documents", "name": "hits", "value": 73 } ] } ===================== Dynamic guided search ===================== If one follows a link of the form .. code:: https://mypixsearchserver/api/files/?projection={"filters":"."}&where=... either from the initial query response or by crafting directly, then a more verbose list of related collections of results is returned. (See :doc:`example_responses`). The actual results are not returned - this is a secondary operation. The links are of the form .. code-block:: JSON { "links": [ { "href": "\/files/?where={\"_all\":\"cats\"}&projection={\"filters\": \"core.size\"}", "prompt": "Core - Size (73)", "name": "core.size", "rel": "links" }, { "href": "\/files/?where={\"_all\": \"cats\", \"core.size\": {\"lt\": 44000, \"gte\": 27400}}", "prompt": "Core - Size - 27400 - 44000 (14)", "name": "core.size.27400-44000", "rel": "collection" } ] } Here, the ``rel`` attribute indicates that by following the link, you will get a new collection of items (bottom example), or a new collection of links (top example). The collection of items will be a subset of your initial search, but restricted by a certain property - in this case, restricting to only those files who have a size of between 27400 and 44000 bytes. The interesting thing about the results is that they are ordered in such a way as to present those which sub-divide the collection most effectively first. For example, if there were approximately equal numbers of True & False values for a given property, this would be a good candidate. If almost all the results were True, it would not be, and would be presented further down the list of links. With ranged values (e.g. integers or dates), the process is similar in concept, however, in that instance, the system automatically computes the most effective bucket sizes, with the aim of dividing the total into around 5 roughly equally sized sets. So in the above example, of the 73 values which match the initial search, 14 are in the range 27400-44000. This 'Auto-bucket', and 'Most useful' approach leads to users being able to rapidly reduce a large result set down to a more specific set of results. The first 5 or so properties return the sub-divisions inline, the remaining properties (the "less useful" ones) require an additional fetch step - these are indicated by the ``"rel"="links"``, without matching "collection" entries. Contrived Example ----------------- One could envisage a system with a wizard, asking a series of questions based on the "most useful" discriminator, in order to get to one page of results in the fewest number of steps. In pseudo code, this might look like .. code-block:: Python href="/files/?where={"_all":"jpg"} while True: items=get_href(href) # Get the items matching the search if len(items) Method ------ PATCH Headers ------- ============ ======================================= ======== =============================== Parameter Description Required Default ============ ======================================= ======== =============================== If-Match etag for the current version of the doc Yes NA Content-Type the mimetype of the data being sent Yes application/vnd.collection+json ============ ======================================= ======== =============================== Payload ------- Collection+JSON template of key-values to update - e.g. .. code-block:: JSON { "template": { "data": [ { "name": "core.creator", "value": "arcapix" } ] } } Success Status Code ------------------- 200 (OK) Response -------- .. code-block:: JSON { "collection": { "href": "/files/3721279936826738506", "items": [ { "href": "/files/3721279936826738506", "data": [ { "prompt": "_updated", "name": "_updated", "value": "2019-01-16T13:18:33" }, { "prompt": "_created", "name": "_created", "value": "2018-10-12T09:08:57" }, { "prompt": "_status", "name": "_status", "value": "OK" }, { "prompt": "_id", "name": "_id", "value": 3721279936826738506 }, { "prompt": "_etag", "name": "_etag", "value": "bfd9c38ac604b7a86c5b34242c2c940a0f84b9af" } ], "links": [] } ], "version": "1.0", "links": [ { "render": "link", "href": "/files/3721279936826738506", "prompt": "File", "name": "self", "rel": "self" } ] } } Error Status Code ----------------- 403 (Forbidden) - invalid access token; this might be caused by incorrect user or password, token expired, or user doesn't have the ``update_search_metadata`` auth right 412 (Precondition Failed) - ETAG is invalid or outdated 422 (Unprocessable Entity) - the PATCHed metadata failed validation. In this case, the response body should include an explanation of the issue(s) - e.g. .. code-block:: JSON { "collection": { "error": { "title": "Error", "message": "common : {u'creator': 'must be of string type'}" } } } 428 (Precondition Missing) - ``If-Match`` header (ETAG) wasn't provided What can be updated ------------------- You can update the value for any field defined in the PixStor Search schema. A field is in the PxS schema if it is defined in one of the installed PxS plugins. Any metadata field not defined in the schema will be rejected with status 422. Similarly, updated values are validated against the schema - e.g. you can't update a string field with an integer. Any value that fails validation will be rejected with status 422. Note that if a given metadata field is populated via a plugin, any user changes made to the value of that field are likely to be replaced during the next ingest. One possible way around this is to create a 'schema plugin' - a plugin which defines a schema, but doesn't extract any metadata. .. code-block:: python class TagSchemaPlugin(Plugin): def namespace(self): return 'user' def handles(self, mimetype, ext): # doesn't handle any files return False def schema(self): return [{ "name": "tags", "prompt": "file tags", "value": { "datatype": "[String]" # list of strings } }] def process(self, id_, path, fileinfo=None): return PluginStatus.SUCCESS This plugin will add the ``user.tags`` field to PxS's schema - making the field 'valid'. But the plugin doesn't generate any metadata itself, so won't override any user-provided values on ingest. ================== Deleting Documents ================== Documents can be removed from the database by performing a DELETE request against a given file. It's not possible to remove individual metadata fields. Only whole documents can be removed. To delete metadata, you will need an auth token, and your user must have the ``delete_search_metadata`` auth right. URL --- .. code:: https://mypixsearchserver/api/files/ Method ------ DELETE Headers ------- ============ ======================================= ======== =============================== Parameter Description Required Default ============ ======================================= ======== =============================== If-Match etag for the current version of the doc Yes NA ============ ======================================= ======== =============================== Success Status Code ------------------- 204 (No Content) Response -------- Error Status Code ----------------- 403 (Forbidden) - invalid access token; this might be caused by incorrect user or password, token expired, or user doesn't have the ``delete_search_metadata`` auth right 412 (Precondition Failed) - ETAG is invalid or outdated 428 (Precondition Missing) - ``If-Match`` header (ETAG) wasn't provided =============== Delete by Query =============== It is also possible to bulk delete multiple documents in one go. This is done by performing a DELETE request against the ``/files`` endpoint, with some query. e.g. .. code:: DELETE https://mypixsearchserver/api/files/?where={"core.extension":".DS_Store"} .. warning:: In general, you should avoid using delete by query, and great care should be taken if you do use it. There are no special safety checks, so it is very easy to unintentionally delete large numbers of documents. URL --- .. code:: https://mypixsearchserver/api/files/?where= Method ------ DELETE Headers ------- Unlike a single file delete, delete by query doesn't require an If-Match header, since each file matched by the query has its own unique ETag. Consequently, you don't have the safety that ETags provide for per-file delete. Success Status Code ------------------- 204 (No Content) Response -------- Error Status Code ----------------- 403 (Forbidden) - invalid access token; this might be caused by incorrect user or password, token expired, or user doesn't have the ``update_search_metadata`` auth right