Append new column to csv based on lookup
Question:
I have two csv files lookup.csv and data.csv. I’m converting lookup.csv as dictionary and need to add new column in data.csv based on the column.
Input:
lookup.csv
1 first
2 second
...
data.csv
101 NYC 1
202 DC 2
Expected output:
data.csv
col1 col2 col3 col4
101 NYC 1 first
202 DC 2 second
...
Here for the first row new column col4 has first because the col3 has 1 and it’s corresponding value in lookup.csv is first.
I tried the below logic but failing here:
df = pd.read_csv("lookup.csv",header=None, index_col=0, squeeze=True).to_dict()
df1 = pd.read_csv("data.csv")
df1['col4'] = df.get(df1['col3'])
Error: TypeError: unhashable type: 'Series'
Can someone please help in resolving this issue?
Answers:
get
method expects a hashable key (i.e., a single value), but df1['col3']
is a Series
object. Try apply
method:
import pandas as pd
lookup_dict = pd.read_csv("lookup.csv", header=None, index_col=0).squeeze("columns").to_dict()
data_df = pd.read_csv("data.csv", header=None, index_col=False)
data_df.columns = ['col1', 'col2', 'col3']
data_df['col4'] = data_df['col3'].apply(lambda x: lookup_dict.get(x))
print(data_df)
Output:
col1 col2 col3 col4
0 101 NYC 1 first
1 202 DC 2 second
You can also pandas merge
method.
If lookup.csv is:
Code Name
0 1 first
1 2 second
and data.csv is:
Pin Initial Code
0 101 NYC 1
1 202 DC 2
2 101 NYC 1
3 202 DC 2
4 101 NYC 1
5 202 DC 2
6 101 NYC 1
7 202 DC 2
Then read each csv into dataframe
import pandas as pd
lookupdf = pd.read_csv('lookup.csv')
datadf = pd.read_csv('data.csv')
And use following single code line with merge
(which will occur using common column name):
newdf = pd.merge(datadf, lookupdf)
See the result:
print(newdf)
Pin Initial Code Name
0 101 NYC 1 first
1 101 NYC 1 first
2 101 NYC 1 first
3 101 NYC 1 first
4 202 DC 2 second
5 202 DC 2 second
6 202 DC 2 second
7 202 DC 2 second
First of all, the squeeze=True
is causing pd.read_csv
to return a series, not a dataframe [read_csv docs]. That’s why you’re getting the unhashable type series error.
Secondly, instead of converting it to a dictionary, you can just merge the dataframes or join them, depending on whether shared key is a column or the index.
df = pd.read_csv("lookup.csv", header=None, names=['num', 'name'])
df1 = pd.read_csv("data.csv", header=0, names=['foo', 'bar', 'num'])
df_merged = df.merge(df1, on='num')
I have two csv files lookup.csv and data.csv. I’m converting lookup.csv as dictionary and need to add new column in data.csv based on the column.
Input:
lookup.csv
1 first
2 second
...
data.csv
101 NYC 1
202 DC 2
Expected output:
data.csv
col1 col2 col3 col4
101 NYC 1 first
202 DC 2 second
...
Here for the first row new column col4 has first because the col3 has 1 and it’s corresponding value in lookup.csv is first.
I tried the below logic but failing here:
df = pd.read_csv("lookup.csv",header=None, index_col=0, squeeze=True).to_dict()
df1 = pd.read_csv("data.csv")
df1['col4'] = df.get(df1['col3'])
Error: TypeError: unhashable type: 'Series'
Can someone please help in resolving this issue?
get
method expects a hashable key (i.e., a single value), but df1['col3']
is a Series
object. Try apply
method:
import pandas as pd
lookup_dict = pd.read_csv("lookup.csv", header=None, index_col=0).squeeze("columns").to_dict()
data_df = pd.read_csv("data.csv", header=None, index_col=False)
data_df.columns = ['col1', 'col2', 'col3']
data_df['col4'] = data_df['col3'].apply(lambda x: lookup_dict.get(x))
print(data_df)
Output:
col1 col2 col3 col4
0 101 NYC 1 first
1 202 DC 2 second
You can also pandas merge
method.
If lookup.csv is:
Code Name
0 1 first
1 2 second
and data.csv is:
Pin Initial Code
0 101 NYC 1
1 202 DC 2
2 101 NYC 1
3 202 DC 2
4 101 NYC 1
5 202 DC 2
6 101 NYC 1
7 202 DC 2
Then read each csv into dataframe
import pandas as pd
lookupdf = pd.read_csv('lookup.csv')
datadf = pd.read_csv('data.csv')
And use following single code line with merge
(which will occur using common column name):
newdf = pd.merge(datadf, lookupdf)
See the result:
print(newdf)
Pin Initial Code Name
0 101 NYC 1 first
1 101 NYC 1 first
2 101 NYC 1 first
3 101 NYC 1 first
4 202 DC 2 second
5 202 DC 2 second
6 202 DC 2 second
7 202 DC 2 second
First of all, the squeeze=True
is causing pd.read_csv
to return a series, not a dataframe [read_csv docs]. That’s why you’re getting the unhashable type series error.
Secondly, instead of converting it to a dictionary, you can just merge the dataframes or join them, depending on whether shared key is a column or the index.
df = pd.read_csv("lookup.csv", header=None, names=['num', 'name'])
df1 = pd.read_csv("data.csv", header=0, names=['foo', 'bar', 'num'])
df_merged = df.merge(df1, on='num')