How to extract information from one column to create a new column in a pandas data frame
Question:
I have a lot of excel files, I want to combine, but in a first step, I’m trying to manipulate the files.
My data more or less looks like this:
session
type
role
parliament: 12
1
standing
member
1
standing
member
parliament: 13
1
standing
member
2
standing
member
Now, what I’m trying to do, is to add a new column containing the parliament information from the session column, while at the same time keeping all the other information as it is. So my final excel should look like this:
session
type
role
parliament
1
standing
member
12
1
standing
member
12
1
standing
member
13
2
standing
member
13
Can you guys please help me understanding how to solve this?
EDIT:
Here’ a slice of my data in dictionary form
{'Session': {0: 'Parliament: 28', 1: 1, 2: 1, 3: 1, 4: 1},
'Composition': {0: nan, 1: 'Senate', 2: 'Senate', 3: 'Senate', 4: 'Senate'},
'Type': {0: nan, 1: 'Standing', 2: 'Standing', 3: 'Standing', 4: 'Standing'},
'Role': {0: nan, 1: 'Chair', 2: 'Member', 3: 'Member', 4: 'Member'},
'Organization': {0: nan,
1: 'Committee of Selection',
2: 'Standing Committee on Banking and Commerce',
3: 'Standing Committee on Finance',
4: 'Standing Committee on Immigration and Labour'},
'Political Affiliation': {0: nan,
1: 'Liberal Party of Canada',
2: 'Liberal Party of Canada',
3: 'Liberal Party of Canada',
4: 'Liberal Party of Canada'}}
Answers:
You can groupby
each partliament group using cumsum()
, and then just restructure the data in the apply
function to match the final output you want:
(df.groupby(df.session.str.contains('parliament').cumsum())
.apply(lambda s: s[1:].assign(parliament=s.head(1).session.item().strip('parliament: ')))
.reset_index(drop=True))
session type role parliament
0 1 standing member 12
1 1 standing member 12
2 1 standing member 13
3 2 standing member 13
You can extract the number after parliament: then front fill the value:
out = (df[~df['session'].str.startswith('parliament')]
.join(df['session'].str.extract(r':s(?P<parliament>d+)').ffill()))
print(out)
# Output
session type role parliament
1 1 standing member 12
2 1 standing member 12
4 1 standing member 13
5 2 standing member 13
I have a lot of excel files, I want to combine, but in a first step, I’m trying to manipulate the files.
My data more or less looks like this:
session | type | role |
---|---|---|
parliament: 12 | ||
1 | standing | member |
1 | standing | member |
parliament: 13 | ||
1 | standing | member |
2 | standing | member |
Now, what I’m trying to do, is to add a new column containing the parliament information from the session column, while at the same time keeping all the other information as it is. So my final excel should look like this:
session | type | role | parliament |
---|---|---|---|
1 | standing | member | 12 |
1 | standing | member | 12 |
1 | standing | member | 13 |
2 | standing | member | 13 |
Can you guys please help me understanding how to solve this?
EDIT:
Here’ a slice of my data in dictionary form
{'Session': {0: 'Parliament: 28', 1: 1, 2: 1, 3: 1, 4: 1},
'Composition': {0: nan, 1: 'Senate', 2: 'Senate', 3: 'Senate', 4: 'Senate'},
'Type': {0: nan, 1: 'Standing', 2: 'Standing', 3: 'Standing', 4: 'Standing'},
'Role': {0: nan, 1: 'Chair', 2: 'Member', 3: 'Member', 4: 'Member'},
'Organization': {0: nan,
1: 'Committee of Selection',
2: 'Standing Committee on Banking and Commerce',
3: 'Standing Committee on Finance',
4: 'Standing Committee on Immigration and Labour'},
'Political Affiliation': {0: nan,
1: 'Liberal Party of Canada',
2: 'Liberal Party of Canada',
3: 'Liberal Party of Canada',
4: 'Liberal Party of Canada'}}
You can groupby
each partliament group using cumsum()
, and then just restructure the data in the apply
function to match the final output you want:
(df.groupby(df.session.str.contains('parliament').cumsum())
.apply(lambda s: s[1:].assign(parliament=s.head(1).session.item().strip('parliament: ')))
.reset_index(drop=True))
session type role parliament
0 1 standing member 12
1 1 standing member 12
2 1 standing member 13
3 2 standing member 13
You can extract the number after parliament: then front fill the value:
out = (df[~df['session'].str.startswith('parliament')]
.join(df['session'].str.extract(r':s(?P<parliament>d+)').ffill()))
print(out)
# Output
session type role parliament
1 1 standing member 12
2 1 standing member 12
4 1 standing member 13
5 2 standing member 13