Ordering data in python or excel
Question:
I have a large csv file of unordered data. It consists of music tags. I am trying to group all of the similar tags together for easier analysis.
An example of what I have:
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
The output I am looking for would be like:
Band1, hiphop, pop, rap
Band2, NaN, pop, rap, rock
Band3 hiphop, NaN, rap
What is the best way to sort the data like this?
I have tried using pandas and doing basic sorts in excel.
Answers:
Read the file (simulated below). As you read each row, update a fieldnames set
so that when you write the rows you can pass this set of generes to your dictwriter.
import csv
text_in = """
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
"""
rows = [
[col.strip() for col in row.split(",")]
for row in text_in.split("n")
if row
]
fieldnames = set()
rows_reshaped = []
for row in rows:
name = row[0]
genres = row[1:]
fieldnames.update(genres)
rows_reshaped.append(dict([("name", name)] + [(genre, True) for genre in genres]))
fieldnames = ["name"] + sorted(fieldnames)
with open("band,csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.DictWriter(file_out, fieldnames=fieldnames, restval=False)
writer.writeheader()
writer.writerows(rows_reshaped)
This should give you a file like:
name,hiphop,pop,rap,rock
Band1,True,True,True,False
Band2,False,True,True,True
band3,True,False,True,False
Basically removing your wide format and turning the data into a long format then turning the data into a one hot encoded dataframe which you can use as you please
import pandas as pd
df = pd.read_csv('./band_csv.csv',header=None)
new_df = pd.DataFrame(columns=['band','genre'])
for col in list(df.columns[1:]):
temp_df = pd.DataFrame(columns=['band','genre'])
temp_df.loc[:,'band'] = df.loc[:,df.columns[0]]
temp_df.loc[:,'genre'] = df.loc[:,col]
new_df = pd.concat([new_df,temp_df])
grouped_df = pd.get_dummies(new_df, columns=['genre']).groupby(['band'], as_index=False).sum()
Your grouped_df should look like
band genre_hiphop genre_pop genre_rap genre_rock
0 Band1 1 1 1 0
1 Band2 0 1 1 1
2 band3 1 0 1 0
Here’s an optional away that avoids for
loops, just melting and pivoting the data to get to your output:
import pandas as pd
import numpy as np
df = pd.read_csv("./test.csv", names=['col1','col2','col3','col4'])
#melt on all but the first column
df = pd.melt(df, id_vars='col1', value_vars=df.columns[1:], value_name='genres')
#pivot using the new genres column as column names
df = pd.pivot_table(df, values='variable', index='col1', columns='genres', aggfunc='count').reset_index()
#swap non-null values with the column name
cols = df.columns[1:]
df[cols] = np.where(df[cols].notnull(), cols, df[cols])
+--------+-------+--------+-----+-----+------+
| genres | col1 | hiphop | pop | rap | rock |
+--------+-------+--------+-----+-----+------+
| 0 | Band1 | hiphop | pop | rap | NaN |
| 1 | Band2 | NaN | pop | rap | rock |
| 2 | band3 | hiphop | NaN | rap | NaN |
+--------+-------+--------+-----+-----+------+
I have a large csv file of unordered data. It consists of music tags. I am trying to group all of the similar tags together for easier analysis.
An example of what I have:
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
The output I am looking for would be like:
Band1, hiphop, pop, rap
Band2, NaN, pop, rap, rock
Band3 hiphop, NaN, rap
What is the best way to sort the data like this?
I have tried using pandas and doing basic sorts in excel.
Read the file (simulated below). As you read each row, update a fieldnames set
so that when you write the rows you can pass this set of generes to your dictwriter.
import csv
text_in = """
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
"""
rows = [
[col.strip() for col in row.split(",")]
for row in text_in.split("n")
if row
]
fieldnames = set()
rows_reshaped = []
for row in rows:
name = row[0]
genres = row[1:]
fieldnames.update(genres)
rows_reshaped.append(dict([("name", name)] + [(genre, True) for genre in genres]))
fieldnames = ["name"] + sorted(fieldnames)
with open("band,csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.DictWriter(file_out, fieldnames=fieldnames, restval=False)
writer.writeheader()
writer.writerows(rows_reshaped)
This should give you a file like:
name,hiphop,pop,rap,rock
Band1,True,True,True,False
Band2,False,True,True,True
band3,True,False,True,False
Basically removing your wide format and turning the data into a long format then turning the data into a one hot encoded dataframe which you can use as you please
import pandas as pd
df = pd.read_csv('./band_csv.csv',header=None)
new_df = pd.DataFrame(columns=['band','genre'])
for col in list(df.columns[1:]):
temp_df = pd.DataFrame(columns=['band','genre'])
temp_df.loc[:,'band'] = df.loc[:,df.columns[0]]
temp_df.loc[:,'genre'] = df.loc[:,col]
new_df = pd.concat([new_df,temp_df])
grouped_df = pd.get_dummies(new_df, columns=['genre']).groupby(['band'], as_index=False).sum()
Your grouped_df should look like
band genre_hiphop genre_pop genre_rap genre_rock
0 Band1 1 1 1 0
1 Band2 0 1 1 1
2 band3 1 0 1 0
Here’s an optional away that avoids for
loops, just melting and pivoting the data to get to your output:
import pandas as pd
import numpy as np
df = pd.read_csv("./test.csv", names=['col1','col2','col3','col4'])
#melt on all but the first column
df = pd.melt(df, id_vars='col1', value_vars=df.columns[1:], value_name='genres')
#pivot using the new genres column as column names
df = pd.pivot_table(df, values='variable', index='col1', columns='genres', aggfunc='count').reset_index()
#swap non-null values with the column name
cols = df.columns[1:]
df[cols] = np.where(df[cols].notnull(), cols, df[cols])
+--------+-------+--------+-----+-----+------+
| genres | col1 | hiphop | pop | rap | rock |
+--------+-------+--------+-----+-----+------+
| 0 | Band1 | hiphop | pop | rap | NaN |
| 1 | Band2 | NaN | pop | rap | rock |
| 2 | band3 | hiphop | NaN | rap | NaN |
+--------+-------+--------+-----+-----+------+