Using the REST API

This section contains information on how to make requests to PixStor Search by using the REST API.

The API is compliant with the Collection+JSON (C+J) specification, but note it also makes use of the collection level properties extension, and a minor extension to allow for templated URL’s.

It may be a little different to some other REST approaches, but is more predictable and easier to machine consume, since the schema is entirely consistent across any service which uses the C+J approach.

It should be viewed as somewhat like a website, with links that can be followed, forms that can be filled in etc, with the strong caveat that it is designed for machine consumption, rather than human, rather than as a simple fire-and-forget approach.

In addition, it is possible to short-circuit the process, but this requires an understanding of the URL format, which consuming the C+J directly does not. The URL format is considered to be less canonical (i.e. more likely to change between versions).

Note

In order to reduce the documentation size, in all circumstances success is indicated by a 200, 201, or 202 HTTP status code. A 500 HTTP Status code can also be returned for a generic Server error

All methods are GET, unless otherwise specified.

Note

In order to reduce the documentation size, in all circumstances OTHER THAN Authorisation requests, the following HTTP Headers must be included

Accept: application/vnd.collection+json

WWW-Authenticate: <bearer based token string>

Authentication

Before accessing the API, one must first authenticate, using the RFC6749 oAuth2 process oAuth2 process for resources. It should be noted that in this context, the developers application is the “Client”, and typically utilisation is via grant_type=password

Arcapix supplies a basic example Authorisation Server which supports only a grant_type of password, and authenticated against the standard PAM system

Note that the authentication server is a distinct service vs the search server, and is access on an alternative port. It must in all cases be via SSL. As standard, the search server runs on port 5000, but the auth server runs on 5001. Depending on the client library/language you are using, you may need to accept self-signed certificates or install the Arcapix CA.

URL

https://authserver/oauth2/token

Method

POST

Request Parameters

Parameter Description Required
grant_type Always “password” Yes
username username Yes
password password Yes

Success Status Code

200

Payload

{
  "access_token": "feoF8MpdWqnUAI3FiMc9v6_PDspAfJPzc_-uwueC9I7IDkxz_hvYITVsNWZ5IOH19nfwIADhIpo9q_GDaCLyUGvA-_RUAEaPcurWFSTX5zClBGZ-I3n2WQbnvLVkvweVWGNilBTdNwdNndmNyqYI-lVt4RO1tIylV29mN7GQOMRXZAWKMXunc_0qpNpJy47M8tPZVVReXREnGd96SovGspKQ-AUAH1IcaD3mqlzrxiNg_j9cRP3KSdhSy_cHSuhN4QdX96jJ5TnsPPHXbFnK26k4jbBPb7sOx39LcXXOOuCjV_RioqaZHe_xt7l3tuuetxlNeU5PhgM2vJsWxBHQrJau9bG0pO24tkMEj5ByUBIH4EiXCyCtx9NbfpB_Hyu0KsHv8IFPcMAZlC7Ijcpg9g2zCa7iGIA_o-uYrHDzxg6sQPQVzgPmJuD1RkFVMXsbiwan7vFCFOscoeCKfcxHW8GTB9SFEZ3aErnGsHMgIRIvBbcH3nyIATcnaTVVZOKYP82851NJgHQUaCmZ1zDkjndbcmiAdvYnOh2EUVVlAoL0UiTLS4qh6EgEF4OIj3_blEn0iSzF5269tiDgaMYtf39839_2eN1zr9Td7BEs9srz5OWQm482Djz04LjL2veYhLOdxVaDYoiRYrvyeDblRPaMu4AWZmjlJEqtDSm664AARCAPIX",
  "expires_in": 86400,
  "token_type": "Bearer"
}

Note

Tokens by default expire after 24 hours. The sample auth server does not support token refresh - a new token must be requested. Tokens may not persist across server restarts, depending on the configuration of the server.

Error status code

400 - Bad request (NB. This is not terribly precise - 403 or 412 might be more useful but are not what’s specified in the RFC)

Onward Usage

However it is achieved, the end point of a successful authentication is an access token. This must be passed to the search server via an appropriately encoded Bearer WWW-Authenticate Header.

NB. Most libraries will take care of the encoding, if you pass the access token as the username, and an empty password e.g.

import requests
requests.get("https://mypixsearchserver/api/files/", auth=requests.auth.HTTPBasicAuth(access_token, ''))

Billboard URL

The C+J exploration starts by retrieving the server’s root URL.

Provided the correct access token is passed via the standard WWW-Authenticate header, we will receive a response containing a list of possible queries.

By filling in the paramters requested, one can craft a suitable query without knowing the URL structure.

URL

https://mypixsearchserver/api/files/

Response

See Example Responses

Considering a snippet of the response above:

{
  "data": [
    {
      "prompt": "Search string",
      "name": "where",
      "value": ""
    }
  ],
  "href": "/files/?where={\"_all\":\"{where}\"}",
  "prompt": "Enter a string to search in all fields across all files",
  "rel": "search"
}

By replacing the {where} entries with values prompted for using the supplied prompts (Search string), a suitable query URL can be constructed - e.g. /files/?where={"_all":"promptedvalue"}

A small command line tool might be written as follows:

r = requests.get("https://mypixsearchserver/api/files/", auth=HTTPBasicAuth(access_token, ''))

query = r.json()['collection']['queries'][0]
href = query['href']

print query['prompt']+"\n"

for param in queries['data']:
    href = href.replace("{"+param['name']+"}", raw_input(param['prompt']+":\n"))

results = requests.get(href, auth=HTTPAuth(access_token, ''))

Rich/Direct query

It is possible to directly query without going via the billboard URL, although this may mean your application needs updating should the URL format change.

URL

https://mypixsearchserver/api/files/

Request Parameters

Parameter Description Required Default
where Clause to filter results by Yes* NA
sort Key to sort by No relevance
page desired page of results No first
projection Specify fields to return No all properties
max_results Amount of results per page No 10

*The where clause isn’t strictly needed, but no items are returned if you do not provide one in order to reduce the chances of a malformed query overloading the server.

where

filters

Filters are applied to specific, named properties. Property names are of the form <namespace>.<property>

The format is as follows where={"property1":"value1", "property2":"value2"}, which will produce an “AND” search.

It is possible to pass multiple values (OR) with an array syntax where={"property1":["value1","value2"]}

It is possible to apply an AND filter on a single field where={"property1": {"and": ["value1", "value2"]}}

It is also possible to exclude a particular value where={"property1": {"not": "value1"}}

Numerical and date-based properties can be search using ranges, with the keywords gt, gte, lt, and lte e.g. where={"property1":{"gte":"value1","lt":"value2}}.

Date-range query values can be milliseconds since epoch (not seconds), or an iso-8601 formatted strings e.g. where={"core.modificationtime":{"gt":"2000-01-01T00:00"}}

These property filters produce exact matches - this means the whole terms must match, including matching case. For example, {"location.city":"new"} will not match location.city: New York, nor will {"location.city":"new york"}

‘_all’ queries

There is a special, magic _all field, which performs search across all properties.

The _all query is tokenised and case-insensitive, meaning where={"_all": "new"} would match New York.

Additionally, the _all query supports a rich query syntax, including boolean operators and wildcards.

Some example queries are:

# files matching either cats or dogs
cats OR dogs

# files matching BOTH cats and dogs
cats AND dogs

# files matching cats and black, or dogs and black
(cats OR dogs) AND black

# files matching cats, but not matching black
cats AND NOT black

# files matching an exact filename
# without quotes, this would be split into three terms: cats, 16, jpg
"cats-16.jpg"

# wildcard query
cats-*

# fuzzy search - files *almost* matching 'cast', such as 'cats'
cast~

# boost - match either cats or dogs, favouring cats
# that is, files matching cats will be preferred over those matching dogs in the results
# this doesn't guarantee that cats will appear first - you may need to use a larger boost
# if 'sort' is used (see below), it takes precedence over boosts
cats^2 OR dogs

# query a specific metadata field
core.directory:cats

# unlike filters (see above), field queries are 'analysed'
# without quotes, the following would be split into: mmfs1, cats
# for an exact match, either use a filter, or quote the query
core.directory:"/mmfs1/cats"

# query by range on a specific field

# less than value
core.size:<1024

# greater or equal to date
core.modificationtime:>=2020-03-01

# value in range (inclusive)
image.width:[800 TO 1920]

These queries would be used as, e.g. where={"_all": "cats OR dogs"}

In the case of exact match, where the search term is quoted, the quotation marks would need to be escaped, e.g. where={"_all": "\"cats-16.jpg\""}

Warning

Avoid using queries with leading wildcards, like *.jpg, or worse *foo*. Queries with leading wildcards are very slow and resource heavy, and may timeout

In the case of searching for files with a particular extension, one can simply search the extension without the wildcard, e.g. {"_all": "jpg"}

‘all’ queries can be combined with property filters - e.g. where={"_all": "cats", "core.size": {"gt": 1024}}

Note

Queries don’t match substrings - for example a query for cat won’t match caterpillar

To match a substring, you would have to use wildcards - cat*

Similarly, strings aren’t split on underscores, so cat won’t match cat_pictures. In that case, you would need to search for the full string cat_pictures

sort

The sort property specifies a column to sort the data on, with a preceding - used to indicate an inversion of the sort. Multiple, comma-separated fields can be specified e.g. sort=-core.size,core.modificationtime

Note

By default, the items are returned in a “relevance” order. Unless the filter has been very precise, a lot of matches are likely, and sorting on these matches is likely to not be terribly useful, as well as being a performance hit.

projection

It is technically possible to request only a subset of the properties for items to be returned - if one knew for example that a particular metadata field was very large (say 10K or more), it may make sense to not have it returned, to reduce both network utilisation and JSON parsing overhead.

The syntax is projection={"property1":0} to exclude a field.

Alternatively, you can specify projection={"property1":1} to return only that field.

page

The page property indicates where in a paged result set you wish to be. In essence, page*max_results is the index of the first result you want.

NB. Using the C+J ‘HATEOS’ links means you don’t need to do computations to provide “previous”, “next”, “last” type functionality - the required URL’s are given to you.

Important

The underlying database, Elasticsearch, has a pagination limit of 10k results.

Pagination links take this limit into account, so if you follow “next” or “last” links you will never exceed the limit.

If you explicitly request a page beyond the pagination limit, a 416 (Range Not Satisfiable) will be returned. This indicates that, while there are more results in the database, the REST server can’t return them.

max_results

The maximum number of results to return. This has a default of 25 and an absolute maximum of 1000. Smaller pages give faster results.

Payload

(See typical response below)

Error status codes

403 - Forbidden - most likely incorrect access token

Example request

GET https://mypixsearchserver/api/files/?where={"_all":"jpg"}&sort=core.pathname&projection={"core.size":0}&page=1&max_results=20 HTTP/1.1
Accept: application/vnd.collection+json
WWW-Authenticate: <bearer based token string>

Typical query response

The response (in C+J format) will contain 4 major sections

(For a full example, see Example Responses)

Items

This is a list of matches, typically the first 25. Each nested item will contain a “data” key, which in turn is a list of triples for the properties name, value, and prompt.

{
   "items": [
      {
        "href": "/files/3735374022151170231",
        "data": [
          {
            "prompt": "File basename (string)",
            "name": "core.filename",
            "value": "cats-22.jpg"
          },
          {
            "prompt": "File mime-type (string)",
            "name": "core.mimetype",
            "value": "image/jpeg"
          }
          ]
       }
   ]
}

Properties provided are

name - name of the field property value - field value prompt - human readable description of the field

The href attribute gives a direct link to this item, which will return this item, and only this item, with all properties returned. Thus, detail views can be built when used with projections.

Collection properties

These provide a list of data items indicating the total number of hits.

{
"properties": [
  {
    "prompt": "Number of matching documents",
    "name": "hits",
    "value": 73
  }
]
}

Updating metadata

Metadata can be updated by performing a PATCH request against a given file.

To update metadata, you will need an auth token, and you user must have the update_search_metadata auth right. By default, only the special ‘broker’ user (which performs ingest) has full read-write permission. Additional users can be given the update_search_metadata right via apconfig. Only a sysadmin will have permission to make this change.

URL

https://mypixsearchserver/api/files/<fileid>

Method

PATCH

Headers

Parameter Description Required Default
If-Match etag for the current version of the doc Yes NA
Content-Type the mimetype of the data being sent Yes application/vnd.collection+json

Payload

Collection+JSON template of key-values to update - e.g.

{
  "template": {
    "data": [
      {
        "name": "core.creator",
        "value": "arcapix"
      }
    ]
  }
}

Success Status Code

200 (OK)

Response

{
  "collection": {
      "href": "/files/3721279936826738506",
      "items": [
          {
              "href": "/files/3721279936826738506",
              "data": [
                  {
                      "prompt": "_updated",
                      "name": "_updated",
                      "value": "2019-01-16T13:18:33"
                  },
                  {
                      "prompt": "_created",
                      "name": "_created",
                      "value": "2018-10-12T09:08:57"
                  },
                  {
                      "prompt": "_status",
                      "name": "_status",
                      "value": "OK"
                  },
                  {
                      "prompt": "_id",
                      "name": "_id",
                      "value": 3721279936826738506
                  },
                  {
                      "prompt": "_etag",
                      "name": "_etag",
                      "value": "bfd9c38ac604b7a86c5b34242c2c940a0f84b9af"
                  }
              ],
              "links": []
          }
      ],
      "version": "1.0",
      "links": [
          {
              "render": "link",
              "href": "/files/3721279936826738506",
              "prompt": "File",
              "name": "self",
              "rel": "self"
          }
      ]
  }
}

Error Status Code

403 (Forbidden) - invalid access token; this might be caused by incorrect user or password, token expired, or user doesn’t have the update_search_metadata auth right

412 (Precondition Failed) - ETAG is invalid or outdated

422 (Unprocessable Entity) - the PATCHed metadata failed validation. In this case, the response body should include an explanation of the issue(s) - e.g.

{
  "collection": {
      "error": {
          "title": "Error",
          "message": "common : {u'creator': 'must be of string type'}"
      }
  }
}

428 (Precondition Missing) - If-Match header (ETAG) wasn’t provided

What can be updated

You can update the value for any field defined in the PixStor Search schema. A field is in the PxS schema if it is defined in one of the installed PxS plugins.

Any metadata field not defined in the schema will be rejected with status 422. Similarly, updated values are validated against the schema - e.g. you can’t update a string field with an integer. Any value that fails validation will be rejected with status 422.

Note that if a given metadata field is populated via a plugin, any user changes made to the value of that field are likely to be replaced during the next ingest.

One possible way around this is to create a ‘schema plugin’ - a plugin which defines a schema, but doesn’t extract any metadata.

class TagSchemaPlugin(Plugin):

  def namespace(self):
    return 'user'

  def handles(self, mimetype, ext):
    # doesn't handle any files
    return False

  def schema(self):
    return [{
      "name": "tags",
      "prompt": "file tags",
      "value": {
        "datatype": "[String]"  # list of strings
      }
    }]

  def process(self, id_, path, fileinfo=None):
    return PluginStatus.SUCCESS

This plugin will add the user.tags field to PxS’s schema - making the field ‘valid’. But the plugin doesn’t generate any metadata itself, so won’t override any user-provided values on ingest.

Deleting Documents

Documents can be removed from the database by performing a DELETE request against a given file.

It’s not possible to remove individual metadata fields. Only whole documents can be removed.

To delete metadata, you will need an auth token, and your user must have the delete_search_metadata auth right.

URL

https://mypixsearchserver/api/files/<fileid>

Method

DELETE

Headers

Parameter Description Required Default
If-Match etag for the current version of the doc Yes NA

Success Status Code

204 (No Content)

Response

<empty>

Error Status Code

403 (Forbidden) - invalid access token; this might be caused by incorrect user or password, token expired, or user doesn’t have the delete_search_metadata auth right

412 (Precondition Failed) - ETAG is invalid or outdated

428 (Precondition Missing) - If-Match header (ETAG) wasn’t provided

Delete by Query

It is also possible to bulk delete multiple documents in one go. This is done by performing a DELETE request against the /files endpoint, with some query.

e.g.

DELETE https://mypixsearchserver/api/files/?where={"core.extension":".DS_Store"}

Warning

In general, you should avoid using delete by query, and great care should be taken if you do use it.

There are no special safety checks, so it is very easy to unintentionally delete large numbers of documents.

URL

https://mypixsearchserver/api/files/?where=<query>

Method

DELETE

Headers

Unlike a single file delete, delete by query doesn’t require an If-Match header, since each file matched by the query has its own unique ETag.

Consequently, you don’t have the safety that ETags provide for per-file delete.

Success Status Code

204 (No Content)

Response

<empty>

Error Status Code

403 (Forbidden) - invalid access token; this might be caused by incorrect user or password, token expired, or user doesn’t have the update_search_metadata auth right