How to extract certain parts of string values in column and turn them into columns?
Question:
I have a dataframe which looks like this:
val
"ID: abcnName: JohnnLast name: JohnsonnAge: 27"
"ID: igb1nName: MikenLast name: JacksonnPosition: CEOnAge: 42"
...
I would like to extract Name, Position and Age from those values and turn them into separate columns to get dataframe which looks like this:
Name Position Age
John NaN 27
Mike CEO 42
How could I do that?
Answers:
Option 1
- Use
Series.str.extract
inside a list comprehension to get capturing groups for each string ("key") in the list cols
. Each series thus produced will get the appropiate column name this way.
- Next, we pass the result to
pd.concat
with axis=1
to combine the individual series.
- (You could potentially do this in one go with
Series.str.extractall
, but whether that works, depends on the order of the different capturing groups. E.g. if the order is always the same, this can work.)
cols = ['Name', 'Position', 'Age']
res = pd.concat([df.val.str.extract(fr'(?P<{col}>(?<={col}: ).*)n?')
for col in cols], axis=1)
# e.g. for "Name", we'll look for all chars preceded by "Name :" and followed
# by "n", and same for "Position". Question mark at the end is necessary for "Age"
# (or any other such category that comes at the end)
print(res)
Name Position Age
0 John NaN 27
1 Mike CEO 42
Option 2
- Instead of relying on regex patterns, we could also resort to
Series.str.split
. First, we split on n
and use Series.explode
to get rows with "key: value" pairs (but as strings, of course).
- Next, we apply
split
again, but on :
, and now add expand=True
.
- We will now have one column (
0
) that contains all the keys, and another (1
) that contains the associated values. We use df.pivot
to turn the values in the first column into the column names and get the values from the second column as data.
- Now, we just keep only the columns that we are interested in (i.e.
['Name', 'Position', 'Age']
).
- N.B. you could reverse these last two steps, of course. E.g. do
res[res[0].isin(cols)]
and only then apply pivot
.
cols = ['Name', 'Position', 'Age']
res = df.val.str.split('n').explode().str.split(': ', expand=True)
.pivot(columns=0, values=1)[cols]
# cosmetic (because of `pivot` columns.name will be `0`)
res.columns.name = None
print(res)
Name Position Age
0 John NaN 27
1 Mike CEO 42
I had saved your Input data frame as text file and did this.
Input Data
import pandas as pd
df = pd.DataFrame(columns=['ID', 'Name', 'Last name', 'Position', 'Age'])
file =open('File_path&name.txt','r')
L = file.readlines()
line_count = 0
for line in L:
d={}
s = line[1:len(line)-2]
k =s.split('\n')
for i in k:
t = i.split(': ')
d[t[0]] = t[1]
line_count += 1
df = df.append(d,ignore_index=True)
print(df)
output:
here its not formatted properly check the images
ID Name Last name Position Age
0 abc John Johnson NaN 27
1 igb1 Mike Jackson CEO 4
Process finished with exit code 0
I have a dataframe which looks like this:
val
"ID: abcnName: JohnnLast name: JohnsonnAge: 27"
"ID: igb1nName: MikenLast name: JacksonnPosition: CEOnAge: 42"
...
I would like to extract Name, Position and Age from those values and turn them into separate columns to get dataframe which looks like this:
Name Position Age
John NaN 27
Mike CEO 42
How could I do that?
Option 1
- Use
Series.str.extract
inside a list comprehension to get capturing groups for each string ("key") in the listcols
. Each series thus produced will get the appropiate column name this way. - Next, we pass the result to
pd.concat
withaxis=1
to combine the individual series. - (You could potentially do this in one go with
Series.str.extractall
, but whether that works, depends on the order of the different capturing groups. E.g. if the order is always the same, this can work.)
cols = ['Name', 'Position', 'Age']
res = pd.concat([df.val.str.extract(fr'(?P<{col}>(?<={col}: ).*)n?')
for col in cols], axis=1)
# e.g. for "Name", we'll look for all chars preceded by "Name :" and followed
# by "n", and same for "Position". Question mark at the end is necessary for "Age"
# (or any other such category that comes at the end)
print(res)
Name Position Age
0 John NaN 27
1 Mike CEO 42
Option 2
- Instead of relying on regex patterns, we could also resort to
Series.str.split
. First, we split onn
and useSeries.explode
to get rows with "key: value" pairs (but as strings, of course). - Next, we apply
split
again, but on:
, and now addexpand=True
. - We will now have one column (
0
) that contains all the keys, and another (1
) that contains the associated values. We usedf.pivot
to turn the values in the first column into the column names and get the values from the second column as data. - Now, we just keep only the columns that we are interested in (i.e.
['Name', 'Position', 'Age']
). - N.B. you could reverse these last two steps, of course. E.g. do
res[res[0].isin(cols)]
and only then applypivot
.
cols = ['Name', 'Position', 'Age']
res = df.val.str.split('n').explode().str.split(': ', expand=True)
.pivot(columns=0, values=1)[cols]
# cosmetic (because of `pivot` columns.name will be `0`)
res.columns.name = None
print(res)
Name Position Age
0 John NaN 27
1 Mike CEO 42
I had saved your Input data frame as text file and did this.
Input Data
import pandas as pd
df = pd.DataFrame(columns=['ID', 'Name', 'Last name', 'Position', 'Age'])
file =open('File_path&name.txt','r')
L = file.readlines()
line_count = 0
for line in L:
d={}
s = line[1:len(line)-2]
k =s.split('\n')
for i in k:
t = i.split(': ')
d[t[0]] = t[1]
line_count += 1
df = df.append(d,ignore_index=True)
print(df)
output:
here its not formatted properly check the images
ID Name Last name Position Age
0 abc John Johnson NaN 27
1 igb1 Mike Jackson CEO 4
Process finished with exit code 0