GSoC’13 Summary-2: Numpy’s Bottlenecks and its Removal

4 min readSep 24, 2013

In the last post, I have mentioned the tools which I used for profiling Numpy. Call-graph made my life not easier but a bit simpler to detect time-consuming methods. But the time took just indicates the possibility of bottlenecks as many of the time-consuming methods are highly optimized. Hence, critically reading code along with call-graph is the trick

Identifying bottlenecks

Following types, bottlenecks were observed

Overhead for smaller arrays

Numpy used to release Global Interpreter Lock or GIL for all two or one operand loops. But for a short-length array, it used to produce relative overhead instead.
So, releasing GIL for smaller operations was not beneficial at all.

Redundant code doing extra work

PyArrayCanCastArrayTo used to check even if the newtype is Null. In PyArrayFromArray, It was found that if the argument newtype is Null, it get the value of oldtype. So, technically it has to check if casting is possible for the same type, which is useless.
Numpy used to converts the Python scalar into its matching scalar (e.g. PyLong -> int32) and then extract the C value from the NumPy scalar.
Clearing the error flags at every function call, then checking it. This happens unconditional even-if there is no need to do.

Improper structure

For every single operation call, NumPy has to extract the value of buffersize, errormask, and name to pack and build an error object. These two functions, _extract_pyvals, and PyUFunc_GetPyValues together use significant time. It takes useless time because all this time is spent on looking up entries in a python dict, extract them, and convert them into C-level data. Not once but doing that again and again on every operation. Also which remain unused if no error occurs.
loop selection method for scalar operation is inefficient and consumes time. It checks through all associated dtypes of function one by one from the types array.

Not so effective memory allocation

When allocating the return value, NumPy allocates memory twice. One for the array object itself, and a second time for the array data.
Also, shapes + strides together have 2*ndim elements, but to hold them NumPy allocates a memory region sized to hold 3*ndim elements.

Removing bottlenecks

I really appreciate the effort and mentoring by Charles Harris, which provides the lead to remove the above-mentioned bottlenecks. Making Numpy a bit quick for smaller arrays too.

Removing know overhead for smaller arrays

By Avoiding GIL release for smaller arrays.

NPY_BEGIN_THREADS_THRESHOLDED(loop_size), a variant of NPY_BEGIN_THREADS macro has been created in ndarraytypes.h with a threshold. Threshold has been taken 500 conservatively and loss of guessing low can be neglected.

#define NPY_BEGIN_THREADS_THRESHOLDED(loop_size) do { \\
 if (loop_size > 500) { \\
 _save = PyEval_SaveThread(); \\
 } \\
} while (0);

Avoid doing extra work, remove redundancy

In PyArrayCanCastArrayTo, it is better to bypass this for newtype is Null and the flag is 0. As the following improvement is made,

oldtype = PyArray_DESCR(arr);
if (newtype == NULL) {
 /* Check if object is of array with Null newtype.
 *  If so return it directly instead of checking for casting.
 */
 if (flags == 0) {
 Py_INCREF(arr);
 return (PyObject *)arr;
 }
 newtype = oldtype;
 Py_INCREF(oldtype);
}

Avoiding conversion of know dtype to NumPy Scalar

As, when trying to extract the underlying C value from a Python integer. Numpy converts the Python scalar into its matching scalar (e.g. PyLong -> int32) and then it extracts the C value from the NumPy scalar.

Raul Cota did optimization for floats by avoiding its conversation. I extended it to an integer which was a bit messy because different OS handle int differently. I have avoided conversion for know integer type but extracting its value. like, For byte, short, int, long

#if PY_VERSION_HEX >= 0x03000000
 if(PyLong_CheckExact(a)){
 *arg1 = PyLong_AsLong(a);
 return 0; 
 }
#else
 if (PyInt_CheckExact(a)){
 *arg1 = PyInt_AS_LONG(a);
 return 0;
 }
#endif

Avoiding heavy fp clear flag

Nathaniel pointed out something mysterious with floating-point clear error-flags.

UFUNC_CHECK_STATUS is just a single macro to check the status and clear errors. It clear error flags every time after checking. We should avoid a clear operation if not needed, as it is a bit expensive and take a significant amount of time. I have avoided clear if not needed to save time. Before each ufunc loop when PyUFunc_clearfperr() flag error is checked, then clearing them if necessary. Now, checking results in macro doesn\'t get ignored unlike before.

Re-factoring

Deferred the creation of the errobject to when an error actually occurs. This same huge time used to consumed by PyUFunc_GetPyValues, for building error object.

Replacing loop by specialized conditions. Most of the functions share identical signatures. E.g These sets (add, subtracts), (arccos, arcsin, arctan, arcsinh, arccosh) share the same signature array. As if known, there are only 32 distinct signature arrays. I make a code generator to identify and make the specialized conditions for each distinct signature array.

I also tried to stash two memory allocations for value return to one but it would take a huge amount of refactoring. Hence it moved out of scope for GSoC.

Something Interesting…

I came to know about tinyarray, which is similar to NumPy array but optimized for small sizes. Christoph Groth, who developed it, used more compact objects like For example, the array object is a PyVarObject and not a PyObject. The array data is held in the same structure as the Python object itself, only a single memory allocation per Array is done. So this could a good start if someone wants to optimize, memory allocation.

And more…

It is not possible to write about all coding and pull-request on this post. But all those modifications have been written on Google doc http://goo.gl/M3vrnz

Originally published at http://arinkverma1.wordpress.com on September 24, 2013.