How to create bigrams of categorical column into separate columns?

Question:

So I would like to take every row and split it into bigrams to be used as columns in order to encode the original string column.

I have a dataset like this one:

A
blue
red
black

I want my result to look like this:

A bl lu ue re ed la ac ck
blue 1 1 1 0 0 0 0 0
red 0 0 0 1 1 0 0 0
black 1 0 0 0 0 1 1 1

I have tried spliting up A but it does not split characters.

Asked By: revarein

||

Answers:

Here’s a way to do:

# sample data
f = pd.DataFrame({'A': ['blue', 'red', 'black']})

def bigram(s, n=2):
    return [s[i:i+n] for i in range(0, len(s), 1) if len(s[i:i+2]) == n]

# using pandas 
f['bgm'] = f['A'].apply(bigram)
f = f.explode('bgm').reset_index(drop=True)
f = pd.crosstab(f['A'], f['bgm']).reset_index()
f.columns.name=None

print(f)

       A  ac  bl  ck  ed  la  lu  re  ue
0  black   1   1   1   0   1   0   0   0
1   blue   0   1   0   0   0   1   0   1
2    red   0   0   0   1   0   0   1   0
Answered By: YOLO
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.