Find "one letter that appears twice" in a string

Question:

I’m trying to catch if one letter that appears twice in a string using RegEx (or maybe there’s some better ways?), for example my string is:

ugknbfddgicrmopn

The output would be:

dd

However, I’ve tried something like:

re.findall('[a-z]{2}', 'ugknbfddgicrmopn')

but in this case, it returns:

['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']   # the except output is `['dd']`

I also have a way to get the expect output:

>>> l = []
>>> tmp = None
>>> for i in 'ugknbfddgicrmopn':
...     if tmp != i:
...         tmp = i
...         continue
...     l.append(i*2)
...     
... 
>>> l
['dd']
>>> 

But that’s too complex…


If it’s 'abbbcppq', then only catch:

abbbcppq
 ^^  ^^

So the output is:

['bb', 'pp']

Then, if it’s 'abbbbcppq', catch bb twice:

abbbbcppq
 ^^^^ ^^

So the output is:

['bb', 'bb', 'pp']
Asked By: Remi Guan

||

Answers:

As a Pythonic way You can use zip function within a list comprehension:

>>> s = 'abbbcppq'
>>>
>>> [i+j for i,j in zip(s,s[1:]) if i==j]
['bb', 'bb', 'pp']

If you are dealing with large string you can use iter() function to convert the string to an iterator and use itertols.tee() to create two independent iterator, then by calling the next function on second iterator consume the first item and use call the zip class (in Python 2.X use itertools.izip() which returns an iterator) with this iterators.

>>> from itertools import tee
>>> first = iter(s)
>>> second, first = tee(first)
>>> next(second)
'a'
>>> [i+j for i,j in zip(first,second) if i==j]
['bb', 'bb', 'pp']

Benchmark with RegEx recipe:

# ZIP
~ $ python -m timeit --setup "s='abbbcppq'" "[i+j for i,j in zip(s,s[1:]) if i==j]"
1000000 loops, best of 3: 1.56 usec per loop

# REGEX
~ $ python -m timeit --setup "s='abbbcppq';import re" "[i[0] for i in re.findall(r'(([a-z])2)', 'abbbbcppq')]"
100000 loops, best of 3: 3.21 usec per loop

After your last edit as mentioned in comment if you want to only match one pair of b in strings like "abbbcppq" you can use finditer() which returns an iterator of matched objects, and extract the result with group() method:

>>> import re
>>> 
>>> s = "abbbcppq"
>>> [item.group(0) for item in re.finditer(r'([a-z])1',s,re.I)]
['bb', 'pp']

Note that re.I is the IGNORECASE flag which makes the RegEx match the uppercase letters too.

Answered By: Mazdak

Using back reference, it is very easy:

import re
p = re.compile(ur'([a-z])1{1,}')
re.findall(p, u"ugknbfddgicrmopn")
#output: [u'd']
re.findall(p,"abbbcppq")
#output: ['b', 'p']

For more details, you can refer to a similar question in perl: Regular expression to match any character being repeated more than 10 times

Answered By: Gurupad Hegde
>>> l = ['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']
>>> import re
>>> newList = [item for item in l if re.search(r"([a-z]{1})1", item)]
>>> newList
['dd']
Answered By: Mayur Koshti

You need use capturing group based regex and define your regex as raw string.

>>> re.search(r'([a-z])1', 'ugknbfddgicrmopn').group()
'dd'
>>> [i+i for i in re.findall(r'([a-z])1', 'abbbbcppq')]
['bb', 'bb', 'pp']

or

>>> [i[0] for i in re.findall(r'(([a-z])2)', 'abbbbcppq')]
['bb', 'bb', 'pp']

Note that , re.findall here should return the list of tuples with the characters which are matched by the first group as first element and the second group as second element. For our case chars within first group would be enough so I mentioned i[0].

Answered By: Avinash Raj

Maybe you can use the generator to achieve this

def adj(s):
    last_c = None
    for c in s:
        if c == last_c:
            yield c * 2
        last_c = c

s = 'ugknbfddgicrmopn'
v = [x for x in adj(s)]
print(v)
# output: ['dd']
Answered By: xhg
A1 = "abcdededdssffffccfxx"

print A1[1]
for i in range(len(A1)-1):
    if A1[i+1] == A1[i]:
        if not A1[i+1] == A1[i-1]:
            print A1[i] *2
Answered By: Mark White

“or maybe there’s some better ways”

Since regex is often misunderstood by the next developer to encounter your code (may even be you),
And since simpler != shorter,

How about the following pseudo-code:

function findMultipleLetters(inputString) {        
    foreach (letter in inputString) {
        dictionaryOfLettersOccurrance[letter]++;
        if (dictionaryOfLettersOccurrance[letter] == 2) {
            multipleLetters.add(letter);
        }
    }
    return multipleLetters;
}
multipleLetters = findMultipleLetters("ugknbfddgicrmopn");
Answered By: Lavi Avigdor

It is pretty easy without regular expressions:

In [4]: [k for k, v in collections.Counter("abracadabra").items() if v==2]
Out[4]: ['b', 'r']
Answered By: Dima Tisnek
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.