What's the difference between these two ways to calculate the number of occurrences of two words in a text column?

Question

I’m new to pandas, and I’m learning it on Kaggle now.

Here is an exercise asking about to find the number of occurrences of two words in the description column.

I found the first statement from StackOverflow, but the second one is the correct answer. What’s the reason for this different result?

1. Found from StackOverflow

tropical = reviews.description.str.count("tropical").sum()
fruity = reviews.description.str.count("fruity").sum()
descriptor_counts = pd.Series([tropical,fruity])

`

2. The correct answer

tropical = reviews.description.map(lambda desc: 'tropical' in desc).sum()
fruity = reviews.description.map(lambda desc: 'fruity' in desc).sum()
descriptor_counts = pd.Series([tropical, fruity],index=['tropical','fruity'])

The first result is [3703, 9259]
The second result is [3607, 9090]

Update! The original question is:
Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset.

Asked By: Irene Zhou

||

Source

Answer 1

The first one is less because it’s only getting the values that are 'tropical' or 'fruity'.

So:

>>> s='a'
>>> s=='a'
True

But the second one is getting the values that contain 'tropical' or 'fruity', so the above:

>>> s='ab'
>>> s=='a'
False

So it does:

>>> s='ab'
>>> 'a' in s
True

Answered By: U12-Forward

Answer 2

Here is an example

The first code counts tropical to be 5 and fruity to be 4. It counts repetitions of the word in the same description.
So, the result would be [5,4] in this example.

The second code counts topical to be 4 and fruity to be 3. It counts the presence of the word in each description (if tropical in description). Once you find it, it counts as one no matter how many more are in that description.
So, the result would be [4,3].

So, I feel the question is wrong. If the question is about the number of occurrences of the two words, in this example [5,4] should be correct (in your case [3703,9259]). If the question asks in how many descriptions each word occurs, then you are counting the descriptions by using if word in description. So, check the question one more time.

Answered By: Raj006

Answer 3

count1, count2 = 0, 0

for i in description.iteritems():
    if "fruity" in i[0]:
        count1 += i[1]
    if "tropical" in i[0]:
        count2 += i[1]

descriptor_counts = pd.Series(data = {"fruity": count1, "tropical": count2},
                          index = ["tropical", "fruity"])

Recently I was going through the same problem and this would be my solution without using a "lambda expression"

Answered By: Shreyas Sarda

Answer 4

This is how I could solve it and got the correct answer:

n_trop = reviews['description'].str.contains('tropical').sum()
n_fruit = reviews['description'].str.contains('fruity').sum()
descriptor_counts = pd.Series([n_trop,n_fruit], index=('tropical','fruity')) 
print(descriptor_counts)

Answered By: Aarti Palande

What's the difference between these two ways to calculate the number of occurrences of two words in a text column?

Question:

1. Found from StackOverflow

2. The correct answer

Answers: