How to create bigrams of categorical column into separate columns?
Question:
So I would like to take every row and split it into bigrams to be used as columns in order to encode the original string column.
I have a dataset like this one:
A
blue
red
black
I want my result to look like this:
A
bl
lu
ue
re
ed
la
ac
ck
blue
1
1
1
0
0
0
0
0
red
0
0
0
1
1
0
0
0
black
1
0
0
0
0
1
1
1
I have tried spliting up A but it does not split characters.
Answers:
Here’s a way to do:
# sample data
f = pd.DataFrame({'A': ['blue', 'red', 'black']})
def bigram(s, n=2):
return [s[i:i+n] for i in range(0, len(s), 1) if len(s[i:i+2]) == n]
# using pandas
f['bgm'] = f['A'].apply(bigram)
f = f.explode('bgm').reset_index(drop=True)
f = pd.crosstab(f['A'], f['bgm']).reset_index()
f.columns.name=None
print(f)
A ac bl ck ed la lu re ue
0 black 1 1 1 0 1 0 0 0
1 blue 0 1 0 0 0 1 0 1
2 red 0 0 0 1 0 0 1 0
So I would like to take every row and split it into bigrams to be used as columns in order to encode the original string column.
I have a dataset like this one:
A |
---|
blue |
red |
black |
I want my result to look like this:
A | bl | lu | ue | re | ed | la | ac | ck |
---|---|---|---|---|---|---|---|---|
blue | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
red | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
black | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
I have tried spliting up A but it does not split characters.
Here’s a way to do:
# sample data
f = pd.DataFrame({'A': ['blue', 'red', 'black']})
def bigram(s, n=2):
return [s[i:i+n] for i in range(0, len(s), 1) if len(s[i:i+2]) == n]
# using pandas
f['bgm'] = f['A'].apply(bigram)
f = f.explode('bgm').reset_index(drop=True)
f = pd.crosstab(f['A'], f['bgm']).reset_index()
f.columns.name=None
print(f)
A ac bl ck ed la lu re ue
0 black 1 1 1 0 1 0 0 0
1 blue 0 1 0 0 0 1 0 1
2 red 0 0 0 1 0 0 1 0