How to find all occurences of a substring in a numpy string array

Question:

I’m trying to find all occurences of a substring in a numpy string array. Let’s say:

myArray = np.array(['Time', 'utc_sec', 'UTC_day', 'Utc_Hour'])
sub = 'utc'

It should be case insensitive, so the method should return [1,2,3].

Asked By: eljamba

||

Answers:

You can use if sub in string to check it.

import numpy as np

myArray = np.array(['Time', 'utc_sec', 'UTC_day', 'Utc_Hour'])
sub = 'utc'

count = 0
found = []
for item in myArray:
    if sub in item.lower():
        count += 1
        found.append(count)

print(found)

output:

[1, 2, 3]
Answered By: Filipdominik

A vectorized approach using np.char.lower and np.char.find

import numpy as np
myArray = np.array(['Time', 'utc_sec', 'UTC_day', 'Utc_Hour'])
res = np.where(np.char.find(np.char.lower(myArray), 'utc') > -1)[0]
print(res)

Output

[1 2 3]

The idea is to use np.char.lower to make np.char.find case-insensitive, then fetch the indices that contains the sub-string using np.where.

Answered By: Dani Mesejo

We can use list comprehension te get the right indexes :

occ = [i for i in range(len(myArray)) if 'utc' in myArray[i].lower()]

Output

>>> print(occ)
... [1, 2, 3]

Let’s make a general use from this question: we will set up a function returning occurences indexes of any sub-character inside a numpy string array.

get_occ_idx(sub, np_array):
    """ Occurences index of substring in a numpy string array
    """
    
    assert sub.islower(), f"Your substring '{sub}' must be lower case (should be : {sub.lower()})"
    assert all(isinstance(x, str)==False for x in np_array), "All items in the array must be strings"
    assert all(sub in x.lower() for x in np_array), f"There is no occurence of substring :'{sub}'"
    
    occ = [i for i in range(len(np_array)) if sub in np_array[i].lower()]
    
    return occ
Answered By: Khaled DELLAL
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.