Python API Almanac

Introduction

This almanac provides a basic background on Python concepts. In particular those that are relevant to utilising Callback Functions and List Processing with the ArcaPix API. In general the API can be utilised without requiring in-depth knowledge of the following topics, however awareness can aid troubleshooting.

For more information on these topics, various links exist throughout this document. The official Python documentation is highly recommended as a first point of reference.

Generators

A Generator is similar to a list in that it represents a collection of items that can be iterated (looped) over. Unlike a list a generator doesn’t load the whole collection into memory. Instead each item is loaded and returned as requested. This reduces the required memory usage when your collection contains millions of elements.

A side effect of this is that some functions - such as ‘len’ - won’t work with generators directly. Iterating a generator determines the total number of elements (E.G. for i in gen: count += 1). Alternatively the generator has to be converted to a list before it can be passed to len. However, converting to a list will re-introduce the memory usage issues that the generator seeks to avoid.

Python Functions as Objects

In Python, functions are objects. Function objects are evaluated (‘called’) by adding a set of brackets.

>>> sum
<build-in function sum>

>>> sum([1,2,3])
6

This behaviour allows functions to assign variables or pass variables into other functions.

E.G. Function objects sum and len are passed to another function (apply_function).

>>> def apply_function(fn, lst):
...    return fn(lst)

>>> apply_function(sum, [1, 2, 3]) # returns sum([1, 2, 3])
6

>>> apply_function(len, [1, 2, 3]) # returns len([1, 2, 3])
3

Anonymous Functions

Python provides the ability to create ‘anonymous’ functions using the lambda keyword. These functions are ‘anonymous’ because they are not bound to a function name.

E.G. the conventional way of creating a function to add together two numbers:

>>> def add(x, y):
...    return x + y

>>> add(3, 4)
7

Or the same function using the lambda syntax:

>>> add = lambda x, y : x + y

>>> add(3, 4)
7

The format of a lambda-based function is:

>>> function_name = lambda var1, var2, ... : return_expression

Lambda functions ca be defined ‘in-line’ at runtime, without the requirement to name the function.

If we wanted to pass the add() function into a List Processing Rule, we could define the add() function and pass the function by name:

>>> ListProcessingRule(processor=add)

or the function could be defined ‘in-line’ utilising lambda:

>>> ListProcessingRule(processor=lambda x, y: x + y)

Both methods are equally valid but for simple functions using lambda is more compact. Typically lambda is utilised where functionality is not required for reuse.

Things You Can’t Do With Lambdas

Lambdas cannot perform assignment. E.G. dictionary assignment is not possible:

>>> reducefn = lambda d, (key,val): d[key] = val
SyntaxError: can't assign to lambda

Lambdas can technically update collections. E.G.:

>>> reducefn = lambda mylist, val: mylist.append(val)

>>> result = [1, 2, 3]
>>> reducefn(result, 4)
>>> result
[1, 2, 3, 4]

However, the mylist.append() function will always return None to the lambda as Python’s list.append() function always returns None.

>>> result = reducefn(result, 5)
>>> result
None

This is a problem for MapReduceRules. Instead, a function needs to be defined which explicitly returns the updated list:

>>> def reducefn(lst, val):
...    lst.append(val)
...    return lst

One-Liners and Virtue of Clarity

Python provides list comprehension which can compress a for-loop into a single line.

E.G. a for-loop in traditional long style:

1
2
3
4
5
def total_size(file_list):
    result = 0
    for f in file_list:
       result = result + f.filesize
    return result

The for-loop can be compressed into a single line:

1
2
def total_size(file_list):
   return sum(f.filesize for f in file_list)

As the function is on a single line, the lambda syntax can be utilised to define the function in-line.

E.G. A ListProcessingRule is defined using lambda in-line style:

1
ListProcessingRule( 'size_list', processor=lambda file_list: sum(f.filesize for f in file_list) )

Aside from compactness there is no run-time advantage to defining a function inline rather than standalone or to using one line rather than many.

It is advisable not to write all functions in-line. E.G. quickly determine what this ‘one-liner’ achieves:

MapReduceRule('path_sizes',
              mapfn = lambda x: Counter(
              {"/".join(x.pathname.split("/")[:i]):x.filesize
              for i in range(2,x.pathname.count("/")+1)}),
              output = lambda x: "\n".join("%s  %s" %
              (name.ljust(max(len(i) for i in x.keys())),
              str(size).rjust(10)) for (name, size) in sorted(x.items()))
              )

For simple functions in-line definitions are perfectly applicable. However, more often than not, clarity is far superior to brevity.

Serialisation

Serialisation (also known as pickling) is used to convert a Python function to bytecode - a string of data that can be stored and passed to the relevant driver script provided by the ArcaPix API.

Cloudpickle

The ArcaPix API uses the cloudpickle library for serialisation. Verify that the ‘cloudpickle’ library is installed on your system by running:

$ pip show cloudpickle
---
Metadata-Version: 1.1
Name: cloudpickle
Version: 0.2.1

If cloudpickle is not installed, it can added by performing a pip install

$ pip install cloudpickle

Make sure you have the latest version and that all nodes in your cluster have the same version

Debugging Serialisation

To determine whether a callable can be (de)serialised properly perform a test via the Python console:

>>> from arcapix.fs.gpfs import serialise

>>> def my_func():
...    # function code

>>> serialise(my_func)
'eJyFi81KxDAABtOfXde…'

Copy the resulting serialisation, restart the Python console, then de-serialise the callable:

>>> from arcapix.fs.gpfs import deserialise

>>> f = deserialise('eJyFi81KxDAABtOfX...')
>>> f
<function my_func at 0x7fe5d0810230>

An error will be raised if (de)serialising cannot be performed. It is advisable to call the de-serialised function to ensure it still exhibits the required behaviour.

Note

Pickles aren’t compatible between python versions. A function serialised in Python 2 can’t be deserialised in Python 3 (or vice versa)

Handling Import Issues

A typical situation that might cause a de-serialisation error is when a function relies on a module which cannot be imported by the relevant driver scripts.

In such instances it can help to include any import statements within your function definition, and utilise try-except where applicable:

1
2
3
4
5
6
def tweet():
   try:
      import twitter
   except ImportError as err:
      sys.stderr.write("Error: failed to import module ({})".format(err))
   ...