Function application over numpy's matrix row/column

Question:

I am using Numpy to store data into matrices. Coming from R background, there has been an extremely simple way to apply a function over row/columns or both of a matrix.

Is there something similar for python/numpy combination? It’s not a problem to write my own little implementation but it seems to me that most of the versions I come up with will be significantly less efficient/more memory intensive than any of the existing implementation.

I would like to avoid copying from the numpy matrix to a local variable etc., is that possible?

The functions I am trying to implement are mainly simple comparisons (e.g. how many elements of a certain column are smaller than number x or how many of them have absolute value larger than y).

Asked By: petr

||

Answers:

Almost all numpy functions operate on whole arrays, and/or can be told to operate on a particular axis (row or column).

As long as you can define your function in terms of numpy functions acting on numpy arrays or array slices, your function will automatically operate on whole arrays, rows or columns.

It may be more helpful to ask about how to implement a particular function to get more concrete advice.


Numpy provides np.vectorize and np.frompyfunc to turn Python functions which operate on numbers into functions that operate on numpy arrays.

For example,

def myfunc(a,b):
    if (a>b): return a
    else: return b
vecfunc = np.vectorize(myfunc)
result=vecfunc([[1,2,3],[5,6,9]],[7,4,5])
print(result)
# [[7 4 5]
#  [7 6 9]]

(The elements of the first array get replaced by the corresponding element of the second array when the second is bigger.)

But don’t get too excited; np.vectorize and np.frompyfunc are just syntactic sugar. They don’t actually make your code any faster. If your underlying Python function is operating on one value at a time, then np.vectorize will feed it one item at a time, and the whole
operation is going to be pretty slow (compared to using a numpy function which calls some underlying C or Fortran implementation).


To count how many elements of column x are smaller than a number y, you could use an expression such as:

(array['x']<y).sum()

For example:

import numpy as np
array=np.arange(6).view([('x',np.int),('y',np.int)])
print(array)
# [(0, 1) (2, 3) (4, 5)]

print(array['x'])
# [0 2 4]

print(array['x']<3)
# [ True  True False]

print((array['x']<3).sum())
# 2
Answered By: unutbu

Selecting elements from a NumPy array based on one or more conditions is straightforward using NumPy’s beautifully dense syntax:

>>> import numpy as NP
>>> # generate a matrix to demo the code
>>> A = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> A
  array([[6, 7, 6, 4, 8],
         [7, 3, 7, 9, 9],
         [4, 2, 5, 9, 8],
         [3, 8, 2, 6, 3],
         [2, 1, 8, 0, 0],
         [8, 3, 9, 4, 8],
         [3, 3, 9, 8, 4],
         [5, 4, 8, 3, 0]])

how many elements in column 2 are greater than 6?

>>> ndx = A[:,1] > 6
>>> ndx
      array([False,  True, False, False,  True,  True,  True,  True], dtype=bool)
>>> NP.sum(ndx)
      5

how many elements in last column of A have absolute value larger than 3?

>>> A = NP.random.randint(-4, 4, 40).reshape(8, 5)
>>> A
  array([[-4, -1,  2,  0,  3],
         [-4, -1, -1, -1,  1],
         [-1, -2,  2, -2,  3],
         [ 1, -4, -1,  0,  0],
         [-4,  3, -3,  3, -1],
         [ 3,  0, -4, -1, -3],
         [ 3, -4,  0, -3, -2],
         [ 3, -4, -4, -4,  1]])

>>> ndx = NP.abs(A[:,-1]) > 3
>>> NP.sum(ndx)
      0

how many elements in the first two rows of A are greater than or equal to 2?

>>> ndx = A[:2,:] >= 2
>>> NP.sum(ndx.ravel())    # 'ravel' just flattens ndx, which is originally 2D (2x5)
      2

NumPy’s indexing syntax is pretty close to R’s; given your fluency in R, here are the key differences between R and NumPy in this context:

NumPy indices are zero-based, in R, indexing begins with 1

NumPy (like Python) allows you to index from right to left using negative indices–e.g.,

# to get the last column in A
A[:, -1], 

# to get the penultimate column in A
A[:, -2] 

# this is a big deal, because in R, the equivalent expresson is:
A[, dim(A)[0]-2]

NumPy uses colon “:” notation to denote “unsliced”, e.g., in R, to
get the first three rows in A, you would use, A[1:3, ]. In NumPy, you
would use A[0:2, :] (in NumPy, the “0” is not necessary, in fact it
is preferable to use A[:2, :]

Answered By: doug

I also come from a more R background, and bumped into the lack of a more versatile apply which could take short customized functions. I’ve seen the forums suggesting using basic numpy functions because many of them handle arrays. However, I’ve been getting confused over the way “native” numpy functions handle array (sometimes 0 is row-wise and 1 column-wise, sometimes the opposite).

My personal solution to more flexible functions with apply_along_axis was to combine them with the implicit lambda functions available in python. Lambda functions should very easy to understand for the R minded who uses a more functional programming style, like in R functions apply, sapply, lapply, etc.

So for example I wanted to apply standardisation of variables in a matrix. Tipically in R there’s a function for this (scale) but you can also build it easily with apply:

(R code)

apply(Mat,2,function(x) (x-mean(x))/sd(x) ) 

You see how the body of the function inside apply (x-mean(x))/sd(x) is the bit we can’t type directly for the python apply_along_axis. With lambda this is easy to implement FOR ONE SET OF VALUES, so:

(Python)

import numpy as np
vec=np.random.randint(1,10,10)  # some random data vector of integers

(lambda x: (x-np.mean(x))/np.std(x)  )(vec)

Then, all we need is to plug this inside the python apply and pass the array of interest through apply_along_axis

Mat=np.random.randint(1,10,3*4).reshape((3,4))  # some random data vector
np.apply_along_axis(lambda x: (x-np.mean(x))/np.std(x),0,Mat )

Obviously, the lambda function could be implemented as a separate function, but I guess the whole point is to use rather small functions contained within the line where apply originated.

I hope you find it useful !

Answered By: markcelo

Pandas is very useful for this. For instance, DataFrame.apply() and groupby’s apply() should help you.

Answered By: Peter Battaglia
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.