How to correctly pivot a dataframe so the values of the first column are my new columns?

Question:

I have a file with some random census data, in essence multiple lines of the following:

age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K

I want to transform this into a csv that looks like this:

senior Self-emp-not-inc Bachelors ... >50K

I created the following script that I was hoping would do what I want:

 for i in range(df.shape[1]):
    temp_df = df.loc[i].str.split(" ", expand=True)
    temp_df = temp_df[0].str.split("=", expand=True)    

    temp_df.columns = ['column_names', 'column_values']
    temp_df = temp_df.reset_index(drop=True)

    temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')

The last line though is throwing an error, specifically:

KeyError: 0

How can I either fix my pivot or if this is not correct, what would be a better way to achieve what I want?

Asked By: dearn44

||

Answers:

Maybe just create a function that handles the split, returning a dictionary of column_name:column_value pairs, and then create a dataframe from that a list of those dictionaries.

def gen_dic_from_val(x):
    l = [re.split('=|>|<',v) for v in x.split(' ')]
    return {k[0]:k[1] for k in l}

pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())

(This assumes df is a pd.DataFrame holding your long string values in a column named val)

Answered By: langtang

Given a data frame with an ID column:

>>> df = pd.DataFrame({'id': [0], 's': 'age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K'})

Create a new data frame from the list version of the data:

>>> df['sl'] = df['s'].str.split(' ', expand=False)
>>> df1 = df.explode('sl')
   id  ...                          sl
0   0  ...                  age=senior
0   0  ...  workclass=Self-emp-not-inc
0   0  ...         education=Bachelors
0   0  ...                  edu_num=13
0   0  ...            marital=Divorced
0   0  ...     occupation=Craft-repair
0   0  ...  relationship=Not-in-family
0   0  ...                  race=White
0   0  ...                    sex=Male
0   0  ...                   gain=high
0   0  ...                   loss=none
0   0  ...             hours=half-time
0   0  ...       country=United-States
0   0  ...                  salary>50K

[14 rows x 3 columns]

Create your names and values from your second split.

>>> df1[['n', 'v']] = df1['sl'].str.split('=', expand=True)
>>> df1[['id', 'n', 'v']]
   id             n                 v
0   0           age            senior
0   0     workclass  Self-emp-not-inc
0   0     education         Bachelors
0   0       edu_num                13
0   0       marital          Divorced
0   0    occupation      Craft-repair
0   0  relationship     Not-in-family
0   0          race             White
0   0           sex              Male
0   0          gain              high
0   0          loss              none
0   0         hours         half-time
0   0       country     United-States
0   0    salary>50K              None

Then just pivot into place.

>>> df1.pivot(index='id', columns='n', values='v')
n      age        country edu_num  ... salary>50K   sex         workclass
id                                 ...                                   
0   senior  United-States      13  ...       None  Male  Self-emp-not-inc

Because each observation (the index in df.pivot) is uniquely identified, this works over the entire data frame as a whole. If you don’t have an id column already, create one by df.reset_index().rename(columns={'index': 'id'}) at the very start.

Answered By: ifly6

Assuming your dataframe does indeed look like this.

print(df)
               0                           1                    2   ...               11                     12          13
0  age=senior  workclass=Self-emp-not-inc  education=Bachelors  ...  hours=half-time  country=United-States  salary>50K

you could stack, split and unstack to get the columns you need.

df1 = df.stack()
.str.split('=|>',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)

print(df1)
0     age        country edu_num  education  gain  ...   race   relationship salary   sex         workclass
0  senior  United-States      13  Bachelors  high  ...  White  Not-in-family    50K  Male  Self-emp-not-inc
Answered By: Umar.H
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.