Ordering data in python or excel

Question:

I have a large csv file of unordered data. It consists of music tags. I am trying to group all of the similar tags together for easier analysis.

An example of what I have:

Band1, hiphop, pop, rap    
Band2, rock, rap, pop    
band3, hiphop, rap

The output I am looking for would be like:

Band1, hiphop, pop, rap    
Band2, NaN,    pop, rap, rock    
Band3 hiphop,   NaN, rap

What is the best way to sort the data like this?

I have tried using pandas and doing basic sorts in excel.

Asked By: Eoghan

||

Answers:

Read the file (simulated below). As you read each row, update a fieldnames set so that when you write the rows you can pass this set of generes to your dictwriter.

import csv

text_in = """
Band1, hiphop, pop, rap    
Band2, rock, rap, pop    
band3, hiphop, rap
"""

rows = [
    [col.strip() for col in row.split(",")]
    for row in text_in.split("n")
    if row
]

fieldnames = set()
rows_reshaped = []
for row in rows:
    name = row[0]
    genres = row[1:]
    fieldnames.update(genres)
    rows_reshaped.append(dict([("name", name)] + [(genre, True) for genre in genres]))
fieldnames = ["name"] + sorted(fieldnames)

with open("band,csv", "w", encoding="utf-8", newline="") as file_out:
    writer = csv.DictWriter(file_out, fieldnames=fieldnames, restval=False)
    writer.writeheader()
    writer.writerows(rows_reshaped)

This should give you a file like:

name,hiphop,pop,rap,rock
Band1,True,True,True,False
Band2,False,True,True,True
band3,True,False,True,False
Answered By: JonSG

Basically removing your wide format and turning the data into a long format then turning the data into a one hot encoded dataframe which you can use as you please

import pandas as pd

df = pd.read_csv('./band_csv.csv',header=None)

new_df = pd.DataFrame(columns=['band','genre'])
for col in list(df.columns[1:]):
    temp_df = pd.DataFrame(columns=['band','genre'])
    temp_df.loc[:,'band'] = df.loc[:,df.columns[0]]
    temp_df.loc[:,'genre'] = df.loc[:,col]
    new_df = pd.concat([new_df,temp_df])


grouped_df = pd.get_dummies(new_df, columns=['genre']).groupby(['band'], as_index=False).sum()

Your grouped_df should look like

   band  genre_hiphop  genre_pop  genre_rap  genre_rock
0  Band1             1          1          1           0
1  Band2             0          1          1           1
2  band3             1          0          1           0
Answered By: Hillygoose

Here’s an optional away that avoids for loops, just melting and pivoting the data to get to your output:

import pandas as pd
import numpy as np

df = pd.read_csv("./test.csv", names=['col1','col2','col3','col4'])

#melt on all but the first column
df = pd.melt(df, id_vars='col1', value_vars=df.columns[1:], value_name='genres')

#pivot using the new genres column as column names
df = pd.pivot_table(df, values='variable', index='col1', columns='genres', aggfunc='count').reset_index()

#swap non-null values with the column name
cols = df.columns[1:]
df[cols] = np.where(df[cols].notnull(), cols, df[cols])

+--------+-------+--------+-----+-----+------+
| genres | col1  | hiphop | pop | rap | rock |
+--------+-------+--------+-----+-----+------+
|      0 | Band1 | hiphop | pop | rap | NaN  |
|      1 | Band2 | NaN    | pop | rap | rock |
|      2 | band3 | hiphop | NaN | rap | NaN  |
+--------+-------+--------+-----+-----+------+
Answered By: JNevill