Regular Expression Matching First Non-Repeated Character

Question:

TL;DR

re.search("(.)(?!.*1)", text).group() doesn’t match the first non-repeating character contained in text (it always returns a character at or before the first non-repeated character, or before the end of the string if there are no non-repeated characters. My understanding is that re.search() should return None if there were no matches).
I’m only interested in understanding why this regex is not working as intended using the Python re module, not in any other method of solving the problem

Full Background

The problem description comes from https://www.codeeval.com/open_challenges/12/. I’ve already solved this problem using a non-regex method, but revisited it to expand my understanding of Python’s re module.
The regular expressions i thought would work (named vs unnamed backreferences) are:

(?P<letter>.)(?!.*(?P=letter)) and (.)(?!.*1) (same results in python2 and python3)

My entire program looks like this

import re
import sys
with open(sys.argv[1], 'r') as test_cases:
    for test in test_cases:
        print(re.search("(?P<letter>.)(?!.*(?P=letter))",
                        test.strip()
                       ).group()
             )

and some input/output pairs are:

rain | r
teetthing | e
cardiff | c
kangaroo | k
god | g
newtown | e
taxation | x
refurbished | f
substantially | u

According to what I’ve read at https://docs.python.org/2/library/re.html:

  • (.) creates a named group that matches any character and allows later backreferences to it as 1.
  • (?!...) is a negative lookahead which restricts matches to cases where ... does not match.
  • .*1 means any number (including zero) of characters followed by whatever was matched by (.) earlier
  • re.search(pattern, string) returns only the first location where the regex pattern produces a match (and would return None if no match could be found)
  • .group() is equivalent to .group(0) which returns the entire match

I think these pieces together should solve the stated problem, and it does work like I think it should for most inputs, but failed on teething. Throwing similar problems at it reveals that it seems to ignore repeated characters if they are consecutive:

tooth | o      # fails on consecutive repeated characters
aardvark | d   # but does ok if it sees them later
aah | a        # verified last one didn't work just because it was at start
heh | e        # but it works for this one
hehe | h       # What? It thinks h matches (lookahead maybe doesn't find "heh"?)
heho | e       # but it definitely finds "heh" and stops "h" from matching here
hahah | a      # so now it won't match h but will match a
hahxyz | a     # but it realizes there are 2 h characters here...
hahxyza | h    # ... Ok time for StackOverflow

I know lookbehind and negative lookbehind are limited to 3-character-max fixed length strings, and cannot contain backreferences even if they evaluate to a fixed length string, but I didn’t see the documentation specify any restrictions on negative lookahead.

Asked By: stevenjackson121

||

Answers:

Well let’s take your tooth example – here is what the regex-engine does (a lot simplified for better understanding)

Start with t then look ahead in the string – and fail the lookahead, as there is another t.

tooth
^  °

Next take o, look ahead in the string – and fail, as there is another o.

tooth
 ^°

Next take the second o, look ahead in the string – no other o present – match it, return it, work done.

tooth
  ^

So your regex doesn’t match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.

Answered By: Sebastian Proske

The reason why your regex is not working is that it will not match a character that is followed by the same character, but there is nothing to prevent it from matching a character that isn’t followed by the same character, even if it is preceded by the same character.

Answered By: brianpck

Regular expressions are not optimal for the task even if you use alternative implementations of re that do not limit lookbehind by fixed length strings (such as Matthew Barnett’s regex).

The easiest way is to count occurrences of letters and print the first one with frequency equal to 1:

import sys
from collections import Counter, OrderedDict

# Counter that remembers that remembers the order entries were added
class OrderedCounter(Counter, OrderedDict):
    pass

# Calling next() once only gives the first entry
first=next

with open(sys.argv[1], 'r') as test_cases:
    for test in test_cases:
        lettfreq = OrderedCounter(test)
        print(first((l for l in lettfreq if lettfreq[l] == 1)))
Answered By: Pavel Smrz

Sebastian’s answer already explains pretty well why your current attempt doesn’t work.

.NET

Since you’re revo is interested in a .NET flavor workaround, the solution becomes trivial:

(?<letter>.)(?!.*?k<letter>)(?<!k<letter>.+?)

Demo link

This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below).

So for each letter (?<letter>.) we check:

  • if it’s repeated further in the input (?!.*?k<letter>)
  • if it was already encountered before (?<!k<letter>.+?)
    (we have to skip the letter we’re testing when going backwards, hence the +).

Python

The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace k with g (which is quite unfortunate as with this module g is a group backreference, whereas with PCRE it’s a recursion).

The regex is:

(?<letter>.)(?!.*?g<letter>)(?<!g<letter>.+?)

And here’s an example:

$ python
Python 2.7.10 (default, Jun  1 2015, 18:05:38)
[GCC 4.9.2] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.search(r'(?<letter>.)(?!.*?g<letter>)(?<!g<letter>.+?)', 'tooth')
<regex.Match object; span=(4, 5), match='h'>

PCRE

Ok, now things start to get dirty: since PCRE doesn’t support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not.

Unfortunately, the regex engine doesn’t provide random access memory support. The best we can get in terms of generic memory is a stack – but that’s not sufficient for this purpose, as a stack only lets us access its topmost element.

If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let’s see this on a limited alphabet of the three letters abc:

# Anchor the pattern
A

# For each letter, test to see if it's duplicated in the input string
(?(?=[^a]*+a[^a]*a)(?<da>))
(?(?=[^b]*+b[^b]*b)(?<db>))
(?(?=[^c]*+c[^c]*c)(?<dc>))

# Skip any duplicated letter and throw it away
[a-c]*?K

# Check if the next letter is a duplicate
(?:
  (?(da)(*FAIL)|a)
| (?(db)(*FAIL)|b)
| (?(dc)(*FAIL)|c)
)

Here’s how that works:

  • First, the A anchor ensures we’ll process the input string only once
  • Then, for each letter X of our alphabet, we’ll set up a is duplicate flag dX:
    • The conditional pattern (?(cond)then|else) is used there:
      • The condition is (?=[^X]*+X[^X]*X) which is true if the input string contains the letter X twice.
      • If the condition is true, the then clause is (?<dX>), which is an empty capturing group that will match the empty string.
      • If the condition is false, the dX group won’t be matched
    • Next, we lazily skip valid letters from our alphabet: [a-c]*?
    • And we throw them out in the final match with K
    • Now, we’re trying to match one letter whose dX flag is not set. For this purpose, we’ll do a conditional branch: (?(dX)(*FAIL)|X)
      • If dX was matched (meaning that X is a duplicated character), we (*FAIL), forcing the engine to backtrack and try a different letter.
      • If dX was not matched, we try to match X. At this point, if this succeeds, we know that X is the first non-duplicated letter.

That last part of the pattern could also be replaced with:

(?:
  a (*THEN) (?(da)(*FAIL))
| b (*THEN) (?(db)(*FAIL))
| c (*THEN) (?(dc)(*FAIL))
)

Which is somewhat more optimized. It matches the current letter first and only then checks if it’s a duplicate.

The full pattern for the lowercase letters a-z looks like this:

# Anchor the pattern
A

# For each letter, test to see if it's duplicated in the input string
(?(?=[^a]*+a[^a]*a)(?<da>))
(?(?=[^b]*+b[^b]*b)(?<db>))
(?(?=[^c]*+c[^c]*c)(?<dc>))
(?(?=[^d]*+d[^d]*d)(?<dd>))
(?(?=[^e]*+e[^e]*e)(?<de>))
(?(?=[^f]*+f[^f]*f)(?<df>))
(?(?=[^g]*+g[^g]*g)(?<dg>))
(?(?=[^h]*+h[^h]*h)(?<dh>))
(?(?=[^i]*+i[^i]*i)(?<di>))
(?(?=[^j]*+j[^j]*j)(?<dj>))
(?(?=[^k]*+k[^k]*k)(?<dk>))
(?(?=[^l]*+l[^l]*l)(?<dl>))
(?(?=[^m]*+m[^m]*m)(?<dm>))
(?(?=[^n]*+n[^n]*n)(?<dn>))
(?(?=[^o]*+o[^o]*o)(?<do>))
(?(?=[^p]*+p[^p]*p)(?<dp>))
(?(?=[^q]*+q[^q]*q)(?<dq>))
(?(?=[^r]*+r[^r]*r)(?<dr>))
(?(?=[^s]*+s[^s]*s)(?<ds>))
(?(?=[^t]*+t[^t]*t)(?<dt>))
(?(?=[^u]*+u[^u]*u)(?<du>))
(?(?=[^v]*+v[^v]*v)(?<dv>))
(?(?=[^w]*+w[^w]*w)(?<dw>))
(?(?=[^x]*+x[^x]*x)(?<dx>))
(?(?=[^y]*+y[^y]*y)(?<dy>))
(?(?=[^z]*+z[^z]*z)(?<dz>))

# Skip any duplicated letter and throw it away
[a-z]*?K

# Check if the next letter is a duplicate
(?:
  a (*THEN) (?(da)(*FAIL))
| b (*THEN) (?(db)(*FAIL))
| c (*THEN) (?(dc)(*FAIL))
| d (*THEN) (?(dd)(*FAIL))
| e (*THEN) (?(de)(*FAIL))
| f (*THEN) (?(df)(*FAIL))
| g (*THEN) (?(dg)(*FAIL))
| h (*THEN) (?(dh)(*FAIL))
| i (*THEN) (?(di)(*FAIL))
| j (*THEN) (?(dj)(*FAIL))
| k (*THEN) (?(dk)(*FAIL))
| l (*THEN) (?(dl)(*FAIL))
| m (*THEN) (?(dm)(*FAIL))
| n (*THEN) (?(dn)(*FAIL))
| o (*THEN) (?(do)(*FAIL))
| p (*THEN) (?(dp)(*FAIL))
| q (*THEN) (?(dq)(*FAIL))
| r (*THEN) (?(dr)(*FAIL))
| s (*THEN) (?(ds)(*FAIL))
| t (*THEN) (?(dt)(*FAIL))
| u (*THEN) (?(du)(*FAIL))
| v (*THEN) (?(dv)(*FAIL))
| w (*THEN) (?(dw)(*FAIL))
| x (*THEN) (?(dx)(*FAIL))
| y (*THEN) (?(dy)(*FAIL))
| z (*THEN) (?(dz)(*FAIL))
)

And here’s the demo on regex101, complete with unit tests.

You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It’s primarily of educational interest and should not be used for any serious application.


For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents:

  • A becomes ^
  • X (*THEN) (?(dX)(*FAIL)) can be replaced with (?(dX)(?!)|X)
  • You may throw away the K and replace the last noncapturnig group (?:) with a named group like (?<letter>) and treat its content as the result.

The only required but somewhat unusual construct is the conditional group (?(cond)then|else).

Answered By: Lucas Trzesniewski
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.