Split string column based on delimiter and convert it to dict in Pandas without loop
Question:
I have below dataframe
clm1, clm2, clm3
10, a, clm4=1|clm5=5
11, b, clm4=2
My desired result is
clm1, clm2, clm4, clm5
10, a, 1, 5
11, b, 2, Nan
I have tried below method
rows = list(df.index)
dictlist = []
for index in rows: #loop through each row to convert clm3 to dict
i = df.at[index, "clm3"]
mydict = dict(map(lambda x: x.split('='), [x for x in i.split('|') if '=' in x]))
dictlist.append(mydict)
l=json_normalize(dictlist) #convert dict column to flat dataframe
resultdf = example.join(l).drop('clm3',axis=1)
This is giving me desired result but I am looking for a more efficient way to convert clm3 to dict which does not involve looping through each row.
Answers:
Using str.extractall
to get your values and unstack
to pivot them to a column for each unique value.
And str.get_dummies
to get a column for each unique clm
.
values = (
df['clm3'].str.extractall('(=d)')[0]
.str.replace('=', '')
.unstack()
.rename_axis(None, axis=1)
)
columns = df['clm3'].str.replace('=d', '').str.get_dummies(sep='|').columns
values.columns = columns
dfnew = pd.concat([df[['clm1', 'clm2']], values], axis=1)
clm1 clm2 0 1
0 10 a 1 5
1 11 b 2 NaN
two steps :
idea is to create a double split and then group by the index and unstack the values as columns
s = (
df["clm3"]
.str.split("|", expand=True)
.stack()
.str.split("=", expand=True)
.reset_index(level=1, drop=True)
)
final = pd.concat([df, s.groupby([s.index, s[0]])[1].sum().unstack()], axis=1).drop(
"clm3", axis=1
)
print(final)
clm1 clm2 clm4 clm5
0 10 a 1 5
1 11 b 2 NaN
df11=df1.clm3.map(lambda x:"dict({})".format(x.replace('|',',')))
.map(eval).map(pd.Series).pipe(lambda ss:pd.concat(ss.tolist(),axis=1)).T
df1.drop("clm3",axis=1).join(df11)
out:
clm1 clm2 clm4 clm5
0 10 a 1.0 5.0
1 11 b 2.0 NaN
I have below dataframe
clm1, clm2, clm3
10, a, clm4=1|clm5=5
11, b, clm4=2
My desired result is
clm1, clm2, clm4, clm5
10, a, 1, 5
11, b, 2, Nan
I have tried below method
rows = list(df.index)
dictlist = []
for index in rows: #loop through each row to convert clm3 to dict
i = df.at[index, "clm3"]
mydict = dict(map(lambda x: x.split('='), [x for x in i.split('|') if '=' in x]))
dictlist.append(mydict)
l=json_normalize(dictlist) #convert dict column to flat dataframe
resultdf = example.join(l).drop('clm3',axis=1)
This is giving me desired result but I am looking for a more efficient way to convert clm3 to dict which does not involve looping through each row.
Using str.extractall
to get your values and unstack
to pivot them to a column for each unique value.
And str.get_dummies
to get a column for each unique clm
.
values = (
df['clm3'].str.extractall('(=d)')[0]
.str.replace('=', '')
.unstack()
.rename_axis(None, axis=1)
)
columns = df['clm3'].str.replace('=d', '').str.get_dummies(sep='|').columns
values.columns = columns
dfnew = pd.concat([df[['clm1', 'clm2']], values], axis=1)
clm1 clm2 0 1
0 10 a 1 5
1 11 b 2 NaN
two steps :
idea is to create a double split and then group by the index and unstack the values as columns
s = (
df["clm3"]
.str.split("|", expand=True)
.stack()
.str.split("=", expand=True)
.reset_index(level=1, drop=True)
)
final = pd.concat([df, s.groupby([s.index, s[0]])[1].sum().unstack()], axis=1).drop(
"clm3", axis=1
)
print(final)
clm1 clm2 clm4 clm5
0 10 a 1 5
1 11 b 2 NaN
df11=df1.clm3.map(lambda x:"dict({})".format(x.replace('|',',')))
.map(eval).map(pd.Series).pipe(lambda ss:pd.concat(ss.tolist(),axis=1)).T
df1.drop("clm3",axis=1).join(df11)
out:
clm1 clm2 clm4 clm5
0 10 a 1.0 5.0
1 11 b 2.0 NaN