How to correctly pivot a dataframe so the values of the first column are my new columns?
Question:
I have a file with some random census data, in essence multiple lines of the following:
age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K
I want to transform this into a csv that looks like this:
senior Self-emp-not-inc Bachelors ... >50K
I created the following script that I was hoping would do what I want:
for i in range(df.shape[1]):
temp_df = df.loc[i].str.split(" ", expand=True)
temp_df = temp_df[0].str.split("=", expand=True)
temp_df.columns = ['column_names', 'column_values']
temp_df = temp_df.reset_index(drop=True)
temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')
The last line though is throwing an error, specifically:
KeyError: 0
How can I either fix my pivot
or if this is not correct, what would be a better way to achieve what I want?
Answers:
Maybe just create a function that handles the split, returning a dictionary of column_name:column_value pairs, and then create a dataframe from that a list of those dictionaries.
def gen_dic_from_val(x):
l = [re.split('=|>|<',v) for v in x.split(' ')]
return {k[0]:k[1] for k in l}
pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())
(This assumes df
is a pd.DataFrame
holding your long string values in a column named val
)
Given a data frame with an ID column:
>>> df = pd.DataFrame({'id': [0], 's': 'age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K'})
Create a new data frame from the list version of the data:
>>> df['sl'] = df['s'].str.split(' ', expand=False)
>>> df1 = df.explode('sl')
id ... sl
0 0 ... age=senior
0 0 ... workclass=Self-emp-not-inc
0 0 ... education=Bachelors
0 0 ... edu_num=13
0 0 ... marital=Divorced
0 0 ... occupation=Craft-repair
0 0 ... relationship=Not-in-family
0 0 ... race=White
0 0 ... sex=Male
0 0 ... gain=high
0 0 ... loss=none
0 0 ... hours=half-time
0 0 ... country=United-States
0 0 ... salary>50K
[14 rows x 3 columns]
Create your names and values from your second split.
>>> df1[['n', 'v']] = df1['sl'].str.split('=', expand=True)
>>> df1[['id', 'n', 'v']]
id n v
0 0 age senior
0 0 workclass Self-emp-not-inc
0 0 education Bachelors
0 0 edu_num 13
0 0 marital Divorced
0 0 occupation Craft-repair
0 0 relationship Not-in-family
0 0 race White
0 0 sex Male
0 0 gain high
0 0 loss none
0 0 hours half-time
0 0 country United-States
0 0 salary>50K None
Then just pivot into place.
>>> df1.pivot(index='id', columns='n', values='v')
n age country edu_num ... salary>50K sex workclass
id ...
0 senior United-States 13 ... None Male Self-emp-not-inc
Because each observation (the index
in df.pivot
) is uniquely identified, this works over the entire data frame as a whole. If you don’t have an id
column already, create one by df.reset_index().rename(columns={'index': 'id'})
at the very start.
Assuming your dataframe does indeed look like this.
print(df)
0 1 2 ... 11 12 13
0 age=senior workclass=Self-emp-not-inc education=Bachelors ... hours=half-time country=United-States salary>50K
you could stack
, split
and unstack
to get the columns you need.
df1 = df.stack()
.str.split('=|>',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)
print(df1)
0 age country edu_num education gain ... race relationship salary sex workclass
0 senior United-States 13 Bachelors high ... White Not-in-family 50K Male Self-emp-not-inc
I have a file with some random census data, in essence multiple lines of the following:
age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K
I want to transform this into a csv that looks like this:
senior Self-emp-not-inc Bachelors ... >50K
I created the following script that I was hoping would do what I want:
for i in range(df.shape[1]):
temp_df = df.loc[i].str.split(" ", expand=True)
temp_df = temp_df[0].str.split("=", expand=True)
temp_df.columns = ['column_names', 'column_values']
temp_df = temp_df.reset_index(drop=True)
temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')
The last line though is throwing an error, specifically:
KeyError: 0
How can I either fix my pivot
or if this is not correct, what would be a better way to achieve what I want?
Maybe just create a function that handles the split, returning a dictionary of column_name:column_value pairs, and then create a dataframe from that a list of those dictionaries.
def gen_dic_from_val(x):
l = [re.split('=|>|<',v) for v in x.split(' ')]
return {k[0]:k[1] for k in l}
pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())
(This assumes df
is a pd.DataFrame
holding your long string values in a column named val
)
Given a data frame with an ID column:
>>> df = pd.DataFrame({'id': [0], 's': 'age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K'})
Create a new data frame from the list version of the data:
>>> df['sl'] = df['s'].str.split(' ', expand=False)
>>> df1 = df.explode('sl')
id ... sl
0 0 ... age=senior
0 0 ... workclass=Self-emp-not-inc
0 0 ... education=Bachelors
0 0 ... edu_num=13
0 0 ... marital=Divorced
0 0 ... occupation=Craft-repair
0 0 ... relationship=Not-in-family
0 0 ... race=White
0 0 ... sex=Male
0 0 ... gain=high
0 0 ... loss=none
0 0 ... hours=half-time
0 0 ... country=United-States
0 0 ... salary>50K
[14 rows x 3 columns]
Create your names and values from your second split.
>>> df1[['n', 'v']] = df1['sl'].str.split('=', expand=True)
>>> df1[['id', 'n', 'v']]
id n v
0 0 age senior
0 0 workclass Self-emp-not-inc
0 0 education Bachelors
0 0 edu_num 13
0 0 marital Divorced
0 0 occupation Craft-repair
0 0 relationship Not-in-family
0 0 race White
0 0 sex Male
0 0 gain high
0 0 loss none
0 0 hours half-time
0 0 country United-States
0 0 salary>50K None
Then just pivot into place.
>>> df1.pivot(index='id', columns='n', values='v')
n age country edu_num ... salary>50K sex workclass
id ...
0 senior United-States 13 ... None Male Self-emp-not-inc
Because each observation (the index
in df.pivot
) is uniquely identified, this works over the entire data frame as a whole. If you don’t have an id
column already, create one by df.reset_index().rename(columns={'index': 'id'})
at the very start.
Assuming your dataframe does indeed look like this.
print(df)
0 1 2 ... 11 12 13
0 age=senior workclass=Self-emp-not-inc education=Bachelors ... hours=half-time country=United-States salary>50K
you could stack
, split
and unstack
to get the columns you need.
df1 = df.stack()
.str.split('=|>',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)
print(df1)
0 age country edu_num education gain ... race relationship salary sex workclass
0 senior United-States 13 Bachelors high ... White Not-in-family 50K Male Self-emp-not-inc