How to pivot dataframe into ML format

Question:

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.

I have a DF that looks like this:

     month  day  week_day  classname_en  origin  destination
0      1     7        2        1            2         5
1      1     2        6        2            1       167
2      2     1        5        1            2        54
3      2     2        6        4            1         6
4      1     2        6        5            6         1

But I want to turn it into something like:

     month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0      1       0              1             0                 0         1        0      
1      1       0              0             1                 1         0        0
2      0       1              1             0                 0         1        0
3      0       1              0             0                 1         0        0
4      1       0              0             0                 0         0        1

Basically, turn all values into columns and then have binary rows 1 – if the column is present, 0 if none.

IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!

Asked By: Sin of Greed

||

Answers:

Use pd.get_dummies:

out = pd.get_dummies(df, columns=df.columns)
print(out)

# Output
   month_1  month_2  day_1  day_2  day_7  week_day_2  week_day_5  ...  origin_2  origin_6  destination_1  destination_5  destination_6  destination_54  destination_167
0        1        0      0      0      1           1           0  ...         1         0              0              1              0               0                0
1        1        0      0      1      0           0           0  ...         0         0              0              0              0               0                1
2        0        1      1      0      0           0           1  ...         1         0              0              0              0               1                0
3        0        1      0      1      0           0           0  ...         0         0              0              0              1               0                0
4        1        0      0      1      0           0           0  ...         0         1              1              0              0               0                0

[5 rows x 20 columns]
Answered By: Corralien

To expand @Corraliens answer

It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.

To overcome this you can use one-hot-encoding from Sklearn. It’ll make sure that you always have the same amount of features on "new data" as your training data.

import pandas as pd

df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)

# output
   age  color_blue  color_red
0   10           0          1
1   15           1          0


df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)

#output

   age  color_blue  color_green  color_red
0   10           0            0          1
1   15           1            0          0
2   20           0            1          0

and as you can see, the order of the color-binary representation has also changed.

If we on the other hand use OneHotEncoder you can ommit all those issues

from sklearn.preprocessing import OneHotEncoder

df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore") 

color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix

ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]

pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)

# output
   color_blue  color_red
0           0          1  
1           1          0      


# now transform new data

df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})

new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)

#output

  color_blue  color_red
0           0          1
1           1          0
2           0          0

note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.

Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.

Answered By: CutePoison

You can use get_dummies function of pandas for convert row to column based on data.

For that your code will be:

import pandas as pd

df = pd.DataFrame({
    'month': [1, 1, 2, 2, 1],
    'day': [7, 2, 1, 2, 2],
    'week_day': [2, 6, 5, 6, 6],
    'classname_en': [1, 2, 1, 4, 5],
    'origin': [2, 1, 2, 1, 6],
    'destination': [5, 167, 54, 6, 1]
})

response = pd.get_dummies(df, columns=df.columns)
print(response)

Result :
enter image description here

Answered By: NIKUNJ KOTHIYA
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.