Using only for loop and if statement (no built in functions), group the similar values in a column and add the corresponding values in another column
Question:
I have a following dataframe – df (this is a demo one, actual one is very big):
Text
Score
‘I love pizza!’
2
‘I love pizza!’
1
‘I love pizza!’
3
‘Python rules!’
0
‘Python rules!’
5
I want to group the ‘Text’ column values and then add the following rows of the ‘Score’ column.
The output I desire is thus:
Text
Score
Sum
‘I love pizza!’
2
6
‘I love pizza!’
1
6
‘I love pizza!’
3
6
‘Python rules!’
0
5
‘Python rules!’
5
5
I know how to get the desired output using Python/Pandas groupby and sum() (and aggregate) methods, for instance,
df1 = df.groupby('Text')['Score'].sum().reset_index(name='Sum')
df3 = df.merge(df1, on='Text', how='left')
However, I do not want to use any such in-built functions. I want to only use simple for loop and if statement to accomplish this.
I tried doing this the following way:
def func(df):
# NOTE, CANNOT USE LIST APPEND (as it is an in-built function).
sum = 0
n = len(df['text']) # NEED TO WORK FOR-LOOP USING INTEGERS AND HENCE NEED LENGTH
for i in range(0,n):
exists = False #flag to track repeated values
for j in range(i+1,n):
if df['text'][i] == df['text'][j]: # IF TRUE, THEN THE 'TEXT' ROWS ARE SIMILAR I.E. GROUPED
exists = True
sum = df['score'][i] + df['score'][j]
break;
if not exists:
sum += sum
return sum
df['Sum'] = func(df)
The output for this script is incorrect:
Text
Score
Sum
‘I love pizza!’
2
10
‘I love pizza!’
1
10
‘I love pizza!’
3
10
‘Python rules!’
0
10
‘Python rules!’
5
10
I have tried playing around with the above script, I get different results, but never the correct one. Any help with this is greatly appreciated!
Thank you so much in advance!
Answers:
Herein is the script that produces the correct output for the above question:
def func(df):
result = []
final_result = []
n = len(df['Text'])
#Add a list of zeros the same length as the original list (= n) to flag positions already checked
flags = [0] * n
for k in range(0,n):
sum = df['Score'][k]
for i in range(0,n):
#Step to skip (continue) without doing anything if the position has already been flagged (processed, counted)
if flags[i]:
continue
else:
if i==k:
for j in range(i+1,n):
if df['Text'][i]==df['Text'][j]: #If true, then the 'Text' rows are similar, i.e. grouped
#Every time there is a match, the position is flageed by turning it to 1
flags[j] = 1
sum += df['Score'][j]
result = sum
break
final_result += [result]
return final_result
df['Sum'] = func(df)
I have a following dataframe – df (this is a demo one, actual one is very big):
Text | Score |
---|---|
‘I love pizza!’ | 2 |
‘I love pizza!’ | 1 |
‘I love pizza!’ | 3 |
‘Python rules!’ | 0 |
‘Python rules!’ | 5 |
I want to group the ‘Text’ column values and then add the following rows of the ‘Score’ column.
The output I desire is thus:
Text | Score | Sum |
---|---|---|
‘I love pizza!’ | 2 | 6 |
‘I love pizza!’ | 1 | 6 |
‘I love pizza!’ | 3 | 6 |
‘Python rules!’ | 0 | 5 |
‘Python rules!’ | 5 | 5 |
I know how to get the desired output using Python/Pandas groupby and sum() (and aggregate) methods, for instance,
df1 = df.groupby('Text')['Score'].sum().reset_index(name='Sum')
df3 = df.merge(df1, on='Text', how='left')
However, I do not want to use any such in-built functions. I want to only use simple for loop and if statement to accomplish this.
I tried doing this the following way:
def func(df):
# NOTE, CANNOT USE LIST APPEND (as it is an in-built function).
sum = 0
n = len(df['text']) # NEED TO WORK FOR-LOOP USING INTEGERS AND HENCE NEED LENGTH
for i in range(0,n):
exists = False #flag to track repeated values
for j in range(i+1,n):
if df['text'][i] == df['text'][j]: # IF TRUE, THEN THE 'TEXT' ROWS ARE SIMILAR I.E. GROUPED
exists = True
sum = df['score'][i] + df['score'][j]
break;
if not exists:
sum += sum
return sum
df['Sum'] = func(df)
The output for this script is incorrect:
Text | Score | Sum |
---|---|---|
‘I love pizza!’ | 2 | 10 |
‘I love pizza!’ | 1 | 10 |
‘I love pizza!’ | 3 | 10 |
‘Python rules!’ | 0 | 10 |
‘Python rules!’ | 5 | 10 |
I have tried playing around with the above script, I get different results, but never the correct one. Any help with this is greatly appreciated!
Thank you so much in advance!
Herein is the script that produces the correct output for the above question:
def func(df):
result = []
final_result = []
n = len(df['Text'])
#Add a list of zeros the same length as the original list (= n) to flag positions already checked
flags = [0] * n
for k in range(0,n):
sum = df['Score'][k]
for i in range(0,n):
#Step to skip (continue) without doing anything if the position has already been flagged (processed, counted)
if flags[i]:
continue
else:
if i==k:
for j in range(i+1,n):
if df['Text'][i]==df['Text'][j]: #If true, then the 'Text' rows are similar, i.e. grouped
#Every time there is a match, the position is flageed by turning it to 1
flags[j] = 1
sum += df['Score'][j]
result = sum
break
final_result += [result]
return final_result
df['Sum'] = func(df)