How to extract each value in a column of comma separated strings into individual rows

Question

I have a csv that I am importing into a dataframe. I am trying to split a single column that has a bunch of comma separated values into rows.

df_supplier = pd.read_csv(wf['local_filename'])
print(list(df_supplier))
col = 'Commodities (Use Ctrl to select multiple)'
melt_col = 'Supplier (DTRM ID)'
df_supplier_commodities = df_supplier.loc[:, col]                            
                                     .apply(pd.Series)
                                     .reset_index()
                                     .melt(id_vars=melt_col)
                                     .dropna()
                                     .loc[:[melt_col, col]]
                                     .set_index(melt_col)

This is the piece of code I have come up with and yes I know that column header is ridiculous, but I don’t make the csvs. So this comes in with the following headers:

['Supplier (DTRM ID)', 'Status', 'Sent for Approval Date', 'Approval Date', 'Legal Company Name', 'Supplier ID', 'Company Description (Owner To Complete)', 'Parent Supplier ID', 'Parent Supplier Name', 'List of Affiliates', 'Category Manager', 'Country', 'DUNS code', 'Trade register name', 'Commodities (Use Ctrl to select multiple)', 'Default Commodity', 'City', 'State', 'Payment Terms', 'Deactivated', 'Tag', 'Created by', 'Creation Date']

The headers necessary are the Supplier (DTRM ID) and then each of the Commodities (Use Ctrl to select multiple). A supplier can have multiple commodities to a single supplier ID and thus each row for a commodity with the appropriate supplier ID.

The code errors with the following:

Traceback (most recent call last):
  File "/home/ec2-user/determine_etl/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2656, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Supplier (DTRM ID)'

But the print(list(df_supplier)) shows that key is there. What am I doing wrong?

I want to make sure I have been clear so I will give an example of the data layout in the dataframe:

+--------------------+---------------------------------------------+
| Supplier (DTRM ID) |  Commodities (Use Ctrl to select multiple)  |
+--------------------+---------------------------------------------+
|              12333 | Strawberry, Raspberry, Flamingo, Snozzberry |
+--------------------+---------------------------------------------+

Here is the output I am trying to get:

+--------------------+-------------------------------------------+
| Supplier (DTRM ID) | Commodities (Use Ctrl to select multiple) |
+--------------------+-------------------------------------------+
|              12333 | Strawberry                                |
|              12333 | Raspberry                                 |
|              12333 | Flamingo                                  |
|              12333 | Snozzberry                                |
+--------------------+-------------------------------------------+

I thought what I had for Code would do this, but it tells me the Supplier (DTRM ID) isn’t a valid key (see traceback)

Asked By: Shenanigator

||

Source

Answer 1

It sounds like you have something like:

df = pd.DataFrame({
                  'A': ['11, 5.1, 2.8','6, 4, 0','0, 2, 0']
                })

       A
0   11, 5.1, 2.8
1   6, 4, 0
2   0, 2, 0

One column A with “,” separated values.

You can do the following to put each of the values into its own column:

df['A'].str.split(',', expand = True)

You will get the following:

    0   1   2
0   11  5.1 2.8
1   6   4   0
2   0   2   0

With columns 0,1,2. You can then use .rename() to change the column names, and .T to transpose and make them rows. Without example DataFrames it’s difficult to understand exactly what your trying to do.

EDIT:

This works for me:

pd.concat([df['Supplier (DTRM ID)'], df['Commodities (Use Ctrl to select multiple)'].str.split(',', expand = True)], axis = 1)
        .melt(id_vars=['Supplier (DTRM ID)'])
        .sort_values(by = 'Supplier (DTRM ID)')
        .rename(columns = {'value': 'Commodities (Use Ctrl to select multiple)'})
        .drop(columns = ['variable'])
        .dropna()

(The are for readability)

Answered By: Ben Pap

Answer 2

The best option is to use .str.split and then to .explode the list

import pandas as pd

df = pd.DataFrame({'Supplier': [12333, 12334], 'Commodities': ['Strawberry, Raspberry, Flamingo, Snozzberry', 'Steak, Lobster, Salmon, Tuna']})

# display(df)
   Supplier                                  Commodities
0     12333  Strawberry, Raspberry, Flamingo, Snozzberry
1     12334                 Steak, Lobster, Salmon, Tuna

# split the strings into lists
df['Commodities'] = df['Commodities'].str.split(', ')

# explode the lists
df = df.explode('Commodities', ignore_index=True)

# display(df)
   Supplier Commodities
0     12333  Strawberry
1     12333   Raspberry
2     12333    Flamingo
3     12333  Snozzberry
4     12334       Steak
5     12334     Lobster
6     12334      Salmon
7     12334        Tuna

Answered By: Trenton McKinney

How to extract each value in a column of comma separated strings into individual rows

Question:

Answers: