Runtime of python's if substring in string

Question:

What is the big O of the following if statement?

if "pl" in "apple":
   ...

What is the overall big O of how python determines if the string “pl” is found in the string “apple”

or any other substring in string search.

Is this the most efficient way to test if a substring is in a string? Does it use the same algorithm as .find()?

Asked By: Liondancer

||

Answers:

You can use timeit and test it yourself:

maroun@DQHCPY1:~$ python -m timeit 's = "apple";s.find("pl")'
10000000 loops, best of 3: 0.125 usec per loop
maroun@DQHCPY1:~$ python -m timeit 's = "apple";"pl" in s'
10000000 loops, best of 3: 0.0371 usec per loop

Using in is indeed faster (0.0371 usec compared to 0.125 usec).

For actual implementation, you can look at the code itself.

Answered By: Maroun

In Python 3.4.2, it looks like they are resorting to the same function, but there may be a difference in timing nevertheless. For example, s.find first is required to look up the find method of the string and such.

The algorithm used is a mix between Boyer-More and Horspool.

Answered By: skyking

I think the best way to find out is to look at the source. This looks like it would implement __contains__:

static int
bytes_contains(PyObject *self, PyObject *arg)
{
    Py_ssize_t ival = PyNumber_AsSsize_t(arg, PyExc_ValueError);
    if (ival == -1 && PyErr_Occurred()) {
        Py_buffer varg;
        Py_ssize_t pos;
        PyErr_Clear();
        if (PyObject_GetBuffer(arg, &varg, PyBUF_SIMPLE) != 0)
            return -1;
        pos = stringlib_find(PyBytes_AS_STRING(self), Py_SIZE(self),
                             varg.buf, varg.len, 0);
        PyBuffer_Release(&varg);
        return pos >= 0;
    }
    if (ival < 0 || ival >= 256) {
        PyErr_SetString(PyExc_ValueError, "byte must be in range(0, 256)");
        return -1;
    }

    return memchr(PyBytes_AS_STRING(self), (int) ival, Py_SIZE(self)) != NULL;
}

in terms of stringlib_find(), which uses fastsearch().

Answered By: Vlad

The time complexity is O(N) on average, O(NM) worst case (N being the length of the longer string, M, the shorter string you search for). As of Python 3.10, heuristics are used to lower the worst-case scenario to O(N + M) by switching algorithms.

The same algorithm is used for str.index(), str.find(), str.__contains__() (the in operator) and str.replace(); it is a simplification of the Boyer-Moore with ideas taken from the Boyer–Moore–Horspool and Sunday algorithms.

See the original stringlib discussion post, as well as the fastsearch.h source code; until Python 3.10, the base algorithm has not changed since introduction in Python 2.5 (apart from some low-level optimisations and corner-case fixes).

The post includes a Python-code outline of the algorithm:

def find(s, p):
    # find first occurrence of p in s
    n = len(s)
    m = len(p)
    skip = delta1(p)[p[m-1]]
    i = 0
    while i <= n-m:
        if s[i+m-1] == p[m-1]: # (boyer-moore)
            # potential match
            if s[i:i+m-1] == p[:m-1]:
                return i
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + skip # (horspool)
        else:
            # skip
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + 1
    return -1 # not found

as well as speed comparisons.

In Python 3.10, the algorithm was updated to use an enhanced version of the Crochemore and Perrin’s Two-Way string searching algorithm for larger problems (with p and s longer than 100 and 2100 characters, respectively, with s at least 6 times as long as p), in response to a pathological edgecase someone reported. The commit adding this change included a write-up on how the algorithm works.

The Two-way algorithm has a worst-case time complexity of O(N + M), where O(M) is a cost paid up-front to build a shift table from the s search needle. Once you have that table, this algorithm does have a best-case performance of O(N/M).

Answered By: Martijn Pieters