Two Letter Bigram in Pandas Dataframe

Question:

Having trouble finding a way to get every two letter combination in a string in a dataframe. Everything I have been searching is for words rather than letters. Below is expected output.

stringoutputhellohe, el, ll, loworldwo, or, rl,

I have tried both lines below

df['bigram'] = list(zip(df['string'],df['string][1:]))

Generated this error

ValueError: Length of values (15570) does not match length of index (15571)

df['bigram'] = list(ngrams(df['string'], n=2))

Generated this error

ValueError: Length of values (15570) does not match length of index (15571)

df['bigram']=re.findall(r'[a-zA-z]{2}', df['string'])

Generated this error

TypeError: expected string or bytes-like object

Example:

string output
hello he, el, ll, lo
world wo, or, rl, ld
Asked By: Will L

||

Answers:

string output
hello he, el, ll, lo
world wo, or, rl, ld

Better formatted table

Answered By: Will L

You need to loop over the strings:

from nltk import ngrams

df = pd.DataFrame({'string': ['abc', 'abcdef']})

df['bigram'] = df['string'].apply(lambda x: list(ngrams(x, n=2)))

Output:

   string                                    bigram
0     abc                          [(a, b), (b, c)]
1  abcdef  [(a, b), (b, c), (c, d), (d, e), (e, f)]

If you want a string:

df['bigram'] = [', '.join([x[i:i+2] for i in range(len(x)-2)])
                for x in df['string']]

Output:

   string          bigram
0     abc              ab
1  abcdef  ab, bc, cd, de
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.