How to extract each value in a column of comma separated strings into individual rows
Question:
I have a csv that I am importing into a dataframe. I am trying to split a single column that has a bunch of comma separated values into rows.
df_supplier = pd.read_csv(wf['local_filename'])
print(list(df_supplier))
col = 'Commodities (Use Ctrl to select multiple)'
melt_col = 'Supplier (DTRM ID)'
df_supplier_commodities = df_supplier.loc[:, col]
.apply(pd.Series)
.reset_index()
.melt(id_vars=melt_col)
.dropna()
.loc[:[melt_col, col]]
.set_index(melt_col)
This is the piece of code I have come up with and yes I know that column header is ridiculous, but I don’t make the csvs. So this comes in with the following headers:
['Supplier (DTRM ID)', 'Status', 'Sent for Approval Date', 'Approval Date', 'Legal Company Name', 'Supplier ID', 'Company Description (Owner To Complete)', 'Parent Supplier ID', 'Parent Supplier Name', 'List of Affiliates', 'Category Manager', 'Country', 'DUNS code', 'Trade register name', 'Commodities (Use Ctrl to select multiple)', 'Default Commodity', 'City', 'State', 'Payment Terms', 'Deactivated', 'Tag', 'Created by', 'Creation Date']
The headers necessary are the Supplier (DTRM ID)
and then each of the Commodities (Use Ctrl to select multiple)
. A supplier can have multiple commodities to a single supplier ID and thus each row for a commodity with the appropriate supplier ID.
The code errors with the following:
Traceback (most recent call last):
File "/home/ec2-user/determine_etl/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2656, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Supplier (DTRM ID)'
But the print(list(df_supplier))
shows that key is there. What am I doing wrong?
I want to make sure I have been clear so I will give an example of the data layout in the dataframe:
+--------------------+---------------------------------------------+
| Supplier (DTRM ID) | Commodities (Use Ctrl to select multiple) |
+--------------------+---------------------------------------------+
| 12333 | Strawberry, Raspberry, Flamingo, Snozzberry |
+--------------------+---------------------------------------------+
Here is the output I am trying to get:
+--------------------+-------------------------------------------+
| Supplier (DTRM ID) | Commodities (Use Ctrl to select multiple) |
+--------------------+-------------------------------------------+
| 12333 | Strawberry |
| 12333 | Raspberry |
| 12333 | Flamingo |
| 12333 | Snozzberry |
+--------------------+-------------------------------------------+
I thought what I had for Code would do this, but it tells me the Supplier (DTRM ID)
isn’t a valid key (see traceback)
Answers:
It sounds like you have something like:
df = pd.DataFrame({
'A': ['11, 5.1, 2.8','6, 4, 0','0, 2, 0']
})
A
0 11, 5.1, 2.8
1 6, 4, 0
2 0, 2, 0
One column A with “,” separated values.
You can do the following to put each of the values into its own column:
df['A'].str.split(',', expand = True)
You will get the following:
0 1 2
0 11 5.1 2.8
1 6 4 0
2 0 2 0
With columns 0,1,2. You can then use .rename() to change the column names, and .T to transpose and make them rows. Without example DataFrames it’s difficult to understand exactly what your trying to do.
EDIT:
This works for me:
pd.concat([df['Supplier (DTRM ID)'], df['Commodities (Use Ctrl to select multiple)'].str.split(',', expand = True)], axis = 1)
.melt(id_vars=['Supplier (DTRM ID)'])
.sort_values(by = 'Supplier (DTRM ID)')
.rename(columns = {'value': 'Commodities (Use Ctrl to select multiple)'})
.drop(columns = ['variable'])
.dropna()
(The are for readability)
- The best option is to use
.str.split
and then to .explode
the list
import pandas as pd
df = pd.DataFrame({'Supplier': [12333, 12334], 'Commodities': ['Strawberry, Raspberry, Flamingo, Snozzberry', 'Steak, Lobster, Salmon, Tuna']})
# display(df)
Supplier Commodities
0 12333 Strawberry, Raspberry, Flamingo, Snozzberry
1 12334 Steak, Lobster, Salmon, Tuna
# split the strings into lists
df['Commodities'] = df['Commodities'].str.split(', ')
# explode the lists
df = df.explode('Commodities', ignore_index=True)
# display(df)
Supplier Commodities
0 12333 Strawberry
1 12333 Raspberry
2 12333 Flamingo
3 12333 Snozzberry
4 12334 Steak
5 12334 Lobster
6 12334 Salmon
7 12334 Tuna
I have a csv that I am importing into a dataframe. I am trying to split a single column that has a bunch of comma separated values into rows.
df_supplier = pd.read_csv(wf['local_filename'])
print(list(df_supplier))
col = 'Commodities (Use Ctrl to select multiple)'
melt_col = 'Supplier (DTRM ID)'
df_supplier_commodities = df_supplier.loc[:, col]
.apply(pd.Series)
.reset_index()
.melt(id_vars=melt_col)
.dropna()
.loc[:[melt_col, col]]
.set_index(melt_col)
This is the piece of code I have come up with and yes I know that column header is ridiculous, but I don’t make the csvs. So this comes in with the following headers:
['Supplier (DTRM ID)', 'Status', 'Sent for Approval Date', 'Approval Date', 'Legal Company Name', 'Supplier ID', 'Company Description (Owner To Complete)', 'Parent Supplier ID', 'Parent Supplier Name', 'List of Affiliates', 'Category Manager', 'Country', 'DUNS code', 'Trade register name', 'Commodities (Use Ctrl to select multiple)', 'Default Commodity', 'City', 'State', 'Payment Terms', 'Deactivated', 'Tag', 'Created by', 'Creation Date']
The headers necessary are the Supplier (DTRM ID)
and then each of the Commodities (Use Ctrl to select multiple)
. A supplier can have multiple commodities to a single supplier ID and thus each row for a commodity with the appropriate supplier ID.
The code errors with the following:
Traceback (most recent call last):
File "/home/ec2-user/determine_etl/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2656, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Supplier (DTRM ID)'
But the print(list(df_supplier))
shows that key is there. What am I doing wrong?
I want to make sure I have been clear so I will give an example of the data layout in the dataframe:
+--------------------+---------------------------------------------+
| Supplier (DTRM ID) | Commodities (Use Ctrl to select multiple) |
+--------------------+---------------------------------------------+
| 12333 | Strawberry, Raspberry, Flamingo, Snozzberry |
+--------------------+---------------------------------------------+
Here is the output I am trying to get:
+--------------------+-------------------------------------------+
| Supplier (DTRM ID) | Commodities (Use Ctrl to select multiple) |
+--------------------+-------------------------------------------+
| 12333 | Strawberry |
| 12333 | Raspberry |
| 12333 | Flamingo |
| 12333 | Snozzberry |
+--------------------+-------------------------------------------+
I thought what I had for Code would do this, but it tells me the Supplier (DTRM ID)
isn’t a valid key (see traceback)
It sounds like you have something like:
df = pd.DataFrame({
'A': ['11, 5.1, 2.8','6, 4, 0','0, 2, 0']
})
A
0 11, 5.1, 2.8
1 6, 4, 0
2 0, 2, 0
One column A with “,” separated values.
You can do the following to put each of the values into its own column:
df['A'].str.split(',', expand = True)
You will get the following:
0 1 2
0 11 5.1 2.8
1 6 4 0
2 0 2 0
With columns 0,1,2. You can then use .rename() to change the column names, and .T to transpose and make them rows. Without example DataFrames it’s difficult to understand exactly what your trying to do.
EDIT:
This works for me:
pd.concat([df['Supplier (DTRM ID)'], df['Commodities (Use Ctrl to select multiple)'].str.split(',', expand = True)], axis = 1)
.melt(id_vars=['Supplier (DTRM ID)'])
.sort_values(by = 'Supplier (DTRM ID)')
.rename(columns = {'value': 'Commodities (Use Ctrl to select multiple)'})
.drop(columns = ['variable'])
.dropna()
(The are for readability)
- The best option is to use
.str.split
and then to.explode
the list
import pandas as pd
df = pd.DataFrame({'Supplier': [12333, 12334], 'Commodities': ['Strawberry, Raspberry, Flamingo, Snozzberry', 'Steak, Lobster, Salmon, Tuna']})
# display(df)
Supplier Commodities
0 12333 Strawberry, Raspberry, Flamingo, Snozzberry
1 12334 Steak, Lobster, Salmon, Tuna
# split the strings into lists
df['Commodities'] = df['Commodities'].str.split(', ')
# explode the lists
df = df.explode('Commodities', ignore_index=True)
# display(df)
Supplier Commodities
0 12333 Strawberry
1 12333 Raspberry
2 12333 Flamingo
3 12333 Snozzberry
4 12334 Steak
5 12334 Lobster
6 12334 Salmon
7 12334 Tuna