Find the nth occurrence of substring in a string

Question:

This seems like it should be pretty trivial, but I am new at Python and want to do it the most Pythonic way.

I want to find the index corresponding to the n’th occurrence of a substring within a string.

There’s got to be something equivalent to what I WANT to do which is

mystring.find("substring", 2nd)

How can you achieve this in Python?

Asked By: prestomation

||

Answers:

I’d probably do something like this, using the find function that takes an index parameter:

def find_nth(s, x, n):
    i = -1
    for _ in range(n):
        i = s.find(x, i + len(x))
        if i == -1:
            break
    return i

print find_nth('bananabanana', 'an', 3)

It’s not particularly Pythonic I guess, but it’s simple. You could do it using recursion instead:

def find_nth(s, x, n, i = 0):
    i = s.find(x, i)
    if n == 1 or i == -1:
        return i 
    else:
        return find_nth(s, x, n - 1, i + len(x))

print find_nth('bananabanana', 'an', 3)

It’s a functional way to solve it, but I don’t know if that makes it more Pythonic.

Answered By: Mark Byers

Mark’s iterative approach would be the usual way, I think.

Here’s an alternative with string-splitting, which can often be useful for finding-related processes:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

And here’s a quick (and somewhat dirty, in that you have to choose some chaff that can’t match the needle) one-liner:

'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')
Answered By: bobince

Understanding that regex is not always the best solution, I’d probably use one here:

>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 
11
Answered By: Mark Peters

Here is another approach using re.finditer.
The difference is that this only looks into the haystack as far as necessary

from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start() 
Answered By: John La Rooy

Here’s a more Pythonic version of the straightforward iterative solution:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

Example:

>>> find_nth("foofoofoofoo", "foofoo", 2)
6

If you want to find the nth overlapping occurrence of needle, you can increment by 1 instead of len(needle), like this:

def find_nth_overlapping(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+1)
        n -= 1
    return start

Example:

>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3

This is easier to read than Mark’s version, and it doesn’t require the extra memory of the splitting version or importing regular expression module. It also adheres to a few of the rules in the Zen of python, unlike the various re approaches:

  1. Simple is better than complex.
  2. Flat is better than nested.
  3. Readability counts.
Answered By: Todd Gamblin
>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
...   if s[n:n+2] =="ab":
...     print n,i
...     j=j+1
...     if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position:  6
12 a
14 a
Answered By: ghostdog74

Here’s another re + itertools version that should work when searching for either a str or a RegexpObject. I will freely admit that this is likely over-engineered, but for some reason it entertained me.

import itertools
import re

def find_nth(haystack, needle, n = 1):
    """
    Find the starting index of the nth occurrence of ``needle`` in 
    ``haystack``.

    If ``needle`` is a ``str``, this will perform an exact substring
    match; if it is a ``RegexpObject``, this will perform a regex
    search.

    If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
    ``needle`` doesn't appear in ``haystack`` ``n`` times,
    return ``-1``.

    Arguments
    ---------
    * ``needle`` the substring (or a ``RegexpObject``) to find
    * ``haystack`` is a ``str``
    * an ``int`` indicating which occurrence to find; defaults to ``1``

    >>> find_nth("foo", "o", 1)
    1
    >>> find_nth("foo", "o", 2)
    2
    >>> find_nth("foo", "o", 3)
    -1
    >>> find_nth("foo", "b")
    -1
    >>> import re
    >>> either_o = re.compile("[oO]")
    >>> find_nth("foo", either_o, 1)
    1
    >>> find_nth("FOO", either_o, 1)
    1
    """
    if (hasattr(needle, 'finditer')):
        matches = needle.finditer(haystack)
    else:
        matches = re.finditer(re.escape(needle), haystack)
    start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
    try:
        return next(start_here)[1].start()
    except StopIteration:
        return -1
Answered By: Hank Gay

This will find the second occurrence of substring in string.

def find_2nd(string, substring):
   return string.find(substring, string.find(substring) + 1)

Edit: I haven’t thought much about the performance, but a quick recursion can help with finding the nth occurrence:

def find_nth(string, substring, n):
   if (n == 1):
       return string.find(substring)
   else:
       return string.find(substring, find_nth(string, substring, n - 1) + 1)
Answered By: Sriram Murali

The replace one liner is great but only works because XX and bar have the same lentgh

A good and general def would be:

def findN(s,sub,N,replaceString="XXX"):
    return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)
Answered By: Charles Doutriaux

I’m offering some benchmarking results comparing the most prominent approaches presented so far, namely @bobince’s findnth() (based on str.split()) vs. @tgamblin’s or @Mark Byers’ find_nth() (based on str.find()). I will also compare with a C extension (_find_nth.so) to see how fast we can go. Here is find_nth.py:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

def find_nth(s, x, n=0, overlap=False):
    l = 1 if overlap else len(x)
    i = -l
    for c in xrange(n + 1):
        i = s.find(x, i + l)
        if i < 0:
            break
    return i

Of course, performance matters most if the string is large, so suppose we want to find the 1000001st newline (‘n’) in a 1.3 GB file called ‘bigfile’. To save memory, we would like to work on an mmap.mmap object representation of the file:

In [1]: import _find_nth, find_nth, mmap

In [2]: f = open('bigfile', 'r')

In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

There is already the first problem with findnth(), since mmap.mmap objects don’t support split(). So we actually have to copy the whole file into memory:

In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s

Ouch! Fortunately s still fits in the 4 GB of memory of my Macbook Air, so let’s benchmark findnth():

In [5]: %timeit find_nth.findnth(s, 'n', 1000000)
1 loops, best of 3: 29.9 s per loop

Clearly a terrible performance. Let’s see how the approach based on str.find() does:

In [6]: %timeit find_nth.find_nth(s, 'n', 1000000)
1 loops, best of 3: 774 ms per loop

Much better! Clearly, findnth()‘s problem is that it is forced to copy the string during split(), which is already the second time we copied the 1.3 GB of data around after s = mm[:]. Here comes in the second advantage of find_nth(): We can use it on mm directly, such that zero copies of the file are required:

In [7]: %timeit find_nth.find_nth(mm, 'n', 1000000)
1 loops, best of 3: 1.21 s per loop

There appears to be a small performance penalty operating on mm vs. s, but this illustrates that find_nth() can get us an answer in 1.2 s compared to findnth‘s total of 47 s.

I found no cases where the str.find() based approach was significantly worse than the str.split() based approach, so at this point, I would argue that @tgamblin’s or @Mark Byers’ answer should be accepted instead of @bobince’s.

In my testing, the version of find_nth() above was the fastest pure Python solution I could come up with (very similar to @Mark Byers’ version). Let’s see how much better we can do with a C extension module. Here is _find_nthmodule.c:

#include <Python.h>
#include <string.h>

off_t _find_nth(const char *buf, size_t l, char c, int n) {
    off_t i;
    for (i = 0; i < l; ++i) {
        if (buf[i] == c && n-- == 0) {
            return i;
        }
    }
    return -1;
}

off_t _find_nth2(const char *buf, size_t l, char c, int n) {
    const char *b = buf - 1;
    do {
        b = memchr(b + 1, c, l);
        if (!b) return -1;
    } while (n--);
    return b - buf;
}

/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
    PyObject_HEAD
    char *data;
    size_t size;
} mmap_object;

typedef struct {
    const char *s;
    size_t l;
    char c;
    int n;
} params;

int parse_args(PyObject *args, params *P) {
    PyObject *obj;
    const char *x;

    if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
        return 1;
    }
    PyTypeObject *type = Py_TYPE(obj);

    if (type == &PyString_Type) {
        P->s = PyString_AS_STRING(obj);
        P->l = PyString_GET_SIZE(obj);
    } else if (!strcmp(type->tp_name, "mmap.mmap")) {
        mmap_object *m_obj = (mmap_object*) obj;
        P->s = m_obj->data;
        P->l = m_obj->size;
    } else {
        PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
        return 1;
    }
    P->c = x[0];
    return 0;
}

static PyObject* py_find_nth(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyMethodDef methods[] = {
    {"find_nth", py_find_nth, METH_VARARGS, ""},
    {"find_nth2", py_find_nth2, METH_VARARGS, ""},
    {0}
};

PyMODINIT_FUNC init_find_nth(void) {
    Py_InitModule("_find_nth", methods);
}

Here is the setup.py file:

from distutils.core import setup, Extension
module = Extension('_find_nth', sources=['_find_nthmodule.c'])
setup(ext_modules=[module])

Install as usual with python setup.py install. The C code plays at an advantage here since it is limited to finding single characters, but let’s see how fast this is:

In [8]: %timeit _find_nth.find_nth(mm, 'n', 1000000)
1 loops, best of 3: 218 ms per loop

In [9]: %timeit _find_nth.find_nth(s, 'n', 1000000)
1 loops, best of 3: 216 ms per loop

In [10]: %timeit _find_nth.find_nth2(mm, 'n', 1000000)
1 loops, best of 3: 307 ms per loop

In [11]: %timeit _find_nth.find_nth2(s, 'n', 1000000)
1 loops, best of 3: 304 ms per loop

Clearly quite a bit faster still. Interestingly, there is no difference on the C level between the in-memory and mmapped cases. It is also interesting to see that _find_nth2(), which is based on string.h‘s memchr() library function, loses out against the straightforward implementation in _find_nth(): The additional “optimizations” in memchr() are apparently backfiring…

In conclusion, the implementation in findnth() (based on str.split()) is really a bad idea, since (a) it performs terribly for larger strings due to the required copying, and (b)
it doesn’t work on mmap.mmap objects at all. The implementation in find_nth() (based on str.find()) should be preferred in all circumstances (and therefore be the accepted answer to this question).

There is still quite a bit of room for improvement, since the C extension ran almost a factor of 4 faster than the pure Python code, indicating that there might be a case for a dedicated Python library function.

Answered By: Stefan

Providing another “tricky” solution, which use split and join.

In your example, we can use

len("substring".join([s for s in ori.split("substring")[:2]]))
Answered By: Ivor Zhou

Simplest way?

text = "This is a test from a test ok" 

firstTest = text.find('test')

print text.find('test', firstTest + 1)
Answered By: forbzie

How about:

c = os.getcwd().split('\')
print '\'.join(c[0:-2])
Answered By: GetItDone

This is the answer you really want:

def Find(String,ToFind,Occurence = 1):
index = 0 
count = 0
while index <= len(String):
    try:
        if String[index:index + len(ToFind)] == ToFind:
            count += 1
        if count == Occurence:
               return index
               break
        index += 1
    except IndexError:
        return False
        break
return False
Answered By: yarz-tech

This will give you an array of the starting indices for matches to yourstring:

import re
indices = [s.start() for s in re.finditer(':', yourstring)]

Then your nth entry would be:

n = 2
nth_entry = indices[n-1]

Of course you have to be careful with the index bounds. You can get the number of instances of yourstring like this:

num_instances = len(indices)
Answered By: modle13

Building on modle13‘s answer, but without the re module dependency.

def iter_find(haystack, needle):
    return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]

I kinda wish this was a builtin string method.

>>> iter_find("http://stackoverflow.com/questions/1883980/", '/')
[5, 6, 24, 34, 42]
Answered By: Zv_oDD
# return -1 if nth substr (0-indexed) d.n.e, else return index
def find_nth(s, substr, n):
    i = 0
    while n >= 0:
        n -= 1
        i = s.find(substr, i + 1)
    return i
Answered By: Jason

Solution without using loops and recursion.

Use the required pattern in compile method and enter the desired
occurrence in variable ‘n’ and the last statement will print the
starting index of the nth occurrence of the pattern in the given
string. Here the result of finditer i.e. iterator is being converted
to list and directly accessing the nth index.

import re
n=2
sampleString="this is history"
pattern=re.compile("is")
matches=pattern.finditer(sampleString)
print(list(matches)[n].span()[0])
Answered By: Karthik

Here is my solution for finding nth occurrance of b in string a:

from functools import reduce


def findNth(a, b, n):
    return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)

It is pure Python and iterative. For 0 or n that is too large, it returns -1. It is one-liner and can be used directly. Here is an example:

>>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1)
7
Answered By: 黄锐铭

For the special case where you search for the n’th occurence of a character (i.e. substring of length 1), the following function works by building a list of all positions of occurences of the given character:

def find_char_nth(string, char, n):
    """Find the n'th occurence of a character within a string."""
    return [i for i, c in enumerate(string) if c == char][n-1]

If there are fewer than n occurences of the given character, it will give IndexError: list index out of range.

This is derived from @Zv_oDD’s answer and simplified for the case of a single character.

Answered By: coldfix

Def:

def get_first_N_words(mytext, mylen = 3):
    mylist = list(mytext.split())
    if len(mylist)>=mylen: return ' '.join(mylist[:mylen])

To use:

get_first_N_words('  One Two Three Four ' , 3)

Output:

'One Two Three'
Answered By: Chadee Fouad

Avoid a failure or incorrect output when the input value for occurrence provided is higher than the actual count of occurrence. For example, in a string ‘overflow’ if you would check the 3rd occurrence of ‘o’ ( it has only 2 occurrences ) then below code will return a warning or message indicating that the occurrence value has exceeded.

Input Occurrence entered has exceeded the actual count of Occurrence.

def check_nth_occurrence (string, substr, n):

## Count the Occurrence of a substr
    cnt = 0
    for i in string:
        if i ==substr:
            cnt = cnt + 1
        else:
            pass

## Check if the Occurrence input has exceeded the actual count of Occurrence

    if n > cnt:
        print (f' Input Occurrence entered has exceeded the actual count of Occurrence')
        return

## Get the Index value for first Occurrence of the substr

   index = string.find(substr)

## Get the Index value for nth Occurrence of Index
    while index >= 0 and n > 1:
        index = string.find(substr, index+ 1)
        n -= 1
  return index
Answered By: PythonLover

Here’s a simple and fun way to do it:

def index_of_nth(text, substring, n) -> int:
    index = 0
    for _ in range(n):
        index = text.index(substring, index) + 1
    return index - 1
Answered By: Zachary Chiodini

Just in-case anyone wants to find n-th from the back:

def find_nth_reverse(haystack: str, needle: str, n: int) -> int:
    end = haystack.rfind(needle)

    while end >= 0 and n > 1:
        end = haystack.rfind(needle, 0, end - len(needle))
        n -= 1

    return end
Answered By: Sabih Ismail

I used findnth() function and ran into some issues, so I rewrote a faster version of the function (no list splitting):

def findnth(haystack, needle, n):
    if not needle in haystack or haystack.count(needle) < n:
        return -1

    last_index = 0
    cumulative_last_index = 0
    for i in range(0, n):
        last_index = haystack[cumulative_last_index:].find(needle)
        cumulative_last_index += last_index
        
        # if not last element, then jump over it
        if i < n-1:
            cumulative_last_index += len(needle)

    return cumulative_last_index
Answered By: Abdulellah Alsaheel

A simple solution for those with basic programming knowledge:

# Function to find the nth occurrence of a substring in a text
def findnth(text, substring, n):

# variable to store current index in loop
count = -1

# n count
occurance = 0

# loop through string
for letter in text:
    
    # increment count
    count += 1
    
    # if current letter in loop matches substring target
    if letter == substring:
        
        # increment occurance
        occurance += 1
        
        # if this is the nth time the substring is found
        if occurance == n:
            
            # return its index
            return count
        
# otherwise indicate there is no match
return "No match"

# example of how to call function
print(findnth('C$100$150xx', "$", 2))
Answered By: Elias

I solved it like this.

def second_index(text: str, symbol: str) -> [int, None]:
"""
    returns the second index of a symbol in a given text
"""
first = text.find(symbol)
result = text.find(symbol,first+1)
if result > 0: return result 
Answered By: Mylinear
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.