How would one extrapolate data from one column of strings and set a value in another column?

Question:

I have some scraped data from an ecommerce website and that has the package unit count in the name (see example below). I want to take the unit count information from the name and add the number of units as a int into a "Unit" column. I know I can use df.loc[product_column].str.contains('10 pk'), unitColumn] = 10, or even loop through a list that holds sample strings of each unit count. But it becomes a little more cumbersome when your looking at data from 150 stores.

What I’m looking for is a way for python to figure it out automatically and change set the value for with ML. I don’t know if it’s possible but I’m hoping someone can point me in then right direction.

Product name 10 pack
flavor 10 Pack product name
10 pack name
product name flavor 10pk
flavor product name 14Pk
1gx14 Product name
name 2-Pack
2pk different name 
store name 3 pack
product name 5 pack store name
name 5 pack
5 Pack product name
Name 5-pack product name
randome name 5pk product name
name 5pk flavor
6 Pack flavor
flavor 7 pack
prduct name 7 pack 
7pk flavor
pack x2 name

I know I can do

df.loc[product_column].str.contains('10 pk'), unitColumn] = 10

But I’m looking for more of an automated solution.

Asked By: Luke

||

Answers:

I ended up using this solution:

Put all possibilities in a dictionary.

    dfQty={
    2:tuple(['Pack x2', '2-pack', '2pk','2-pack']),
    3: tuple(['3 pack', '3pk']),
    #4: tuple([]),
    5: tuple(['5 pack', '5-pack','5pk']),
    6: tuple(['6 pack']),
    7: tuple(['7 pack', '7pk']),
    #8: tuple([]),
    #9: tuple([]),
    10: tuple(['10pk','10 Pack']),
    #11: tuple([]),
    #12: tuple([]),
    #13: tuple([]),
    14: tuple(['14pk','1gx14']) 
}

Then I looped through the dictionary and used str.containes(item) to change value.

for key, value in dfQty.items():
       for item in value:
            #for ind, row in df.iterrows()
        df2.loc[df2['Product'].str.contains(item, case=False, na=False), 'Units'] = key

Import things I realized while doing this are:

  1. It’s important for my dictionary items to be wrapped in a tuple; otherwise, I get an unhashable error.
  2. I had to comment out key with no values.
Answered By: Luke

Adapting this solution to your case were a number that’s followed by the letter g is not a quantity, you can do

df['Package_unit_count'] = df['Product'].str.extract('(d+)[^g]')
index Product Package_unit_count
0 Product name 10 pack 10
1 flavor 10 Pack product name 10
2 10 pack name 10
3 product name flavor 10pk 10
4 flavor product name 14Pk 14
5 1gx14 Product name 14
6 name 2-Pack 2
7 2pk different name 2
8 store name 3 pack 3
9 product name 5 pack store name 5
10 name 5 pack 5
11 5 Pack product name 5
12 Name 5-pack product name 5
13 randome name 5pk product name 5
14 name 5pk flavor 5
15 6 Pack flavor 6
16 flavor 7 pack 7
17 prduct name 7 pack 7
18 7pk flavor 7
19 pack x2 name 2
Answered By: Ignatius Reilly
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.