How do I count the times the word true was printed in my csv file?
Question:
The output that is generated is a column of true or false, I need to count how many times the word true was printed every 10 rows
In the code that I present, it reads csv files that are in one folder and prints them in another. In each of these csv’s it contains two columns that were chosen when the dataframe was defined. In addition, two columns were added which, through the how_many_times function, count how many times the value meets the condition that I give it.
Example of my csv(original df has more rows):
In [1]: dff = pd.DataFrame([['20220901-00:00:00', 50.0335,False,True], ['20220901-00:00:01', 50.024,False,False], ['20220901-00:00:02', 50.021,False,False]], columns=['t', 'f','f<49.975','f>50.025'])
This is my code (I used .sum but it didn’t work for what I needed):
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
how_many_times1= df.apply(lambda x: x['f'] < 49.975, axis=1).sum
df['f<49.975']=how_many_times1
how_many_times2= df.apply(lambda x: x['f'] > 50.025, axis=1).sum
df['f>50.025']=how_many_times2
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv')
Answers:
You apply the .sum()
method directly to a Pandas Series being a DataFrame column ( how_many_times1.sum()
). And because True is equivalent to 1 and False to 0 you can directly count the entries without applying a condition.
This makes sense in case you need the total sum. In case you need the sum periodically each ten rows it makes sense to count the True values in the to the column applied function.
The code below defines two ‘apply’ functions which do the job of creating the right entries for the columns and printing the sum each ten rows.
See the code below for how it is done in detail:
import pandas as pd
df = pd.DataFrame([['20220901-00:00:00', 50.0335],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ]],
columns=['t', 'f' ])
columns = ['dummy', 'f<49.975','f>50.025']
print(df)
row1 = 1
sum1 = 0
lsm1 = []
def cond1(x):
global row1, sum1
cond = False
if x < 49.975:
cond=True
sum1+=1
if row1%10==0:
print('sum1:', sum1, 'at row:', row1)
lsm1.append(sum1)
sum1=0 # outcomment if cummulative sum required
else:
lsm1.append(None)
row1 += 1
return cond
row2 = 1
sum2 = 0
lsm2 = []
def cond2(x):
global row2, sum2
cond = False
if x > 50.025:
cond=True
sum2+=1
if row2%10==0:
print('sum2:', sum2, 'at row:', row2)
lsm2.append(sum2)
sum2=0 # outcomment if cummulative sum required
else:
lsm2.append(None)
row2 += 1
return cond
how_many_times1 = df['f'].apply(cond1)
df[columns[1]] = how_many_times1
df['sum1'] = lsm1
how_many_times2 = df['f'].apply(cond2)
df[columns[2]] =how_many_times2
df['sum2'] = lsm2
print(df)
prints
t f
0 20220901-00:00:00 50.0335
1 20220901-00:00:01 50.1000
2 20220901-00:00:02 48.0210
3 20220901-00:00:01 50.1000
4 20220901-00:00:01 50.1000
5 20220901-00:00:02 48.0210
6 20220901-00:00:01 50.1000
7 20220901-00:00:01 50.1000
8 20220901-00:00:02 48.0210
9 20220901-00:00:01 50.1000
10 20220901-00:00:02 48.0210
11 20220901-00:00:01 50.1000
12 20220901-00:00:01 50.1000
13 20220901-00:00:01 50.1000
sum1: 3 at row: 10
sum2: 7 at row: 10
t f f<49.975 sum1 f>50.025 sum2
0 20220901-00:00:00 50.0335 False NaN True NaN
1 20220901-00:00:01 50.1000 False NaN True NaN
2 20220901-00:00:02 48.0210 True NaN False NaN
3 20220901-00:00:01 50.1000 False NaN True NaN
4 20220901-00:00:01 50.1000 False NaN True NaN
5 20220901-00:00:02 48.0210 True NaN False NaN
6 20220901-00:00:01 50.1000 False NaN True NaN
7 20220901-00:00:01 50.1000 False NaN True NaN
8 20220901-00:00:02 48.0210 True NaN False NaN
9 20220901-00:00:01 50.1000 False 3.0 True 7.0
10 20220901-00:00:02 48.0210 True NaN False NaN
11 20220901-00:00:01 50.1000 False NaN True NaN
12 20220901-00:00:01 50.1000 False NaN True NaN
13 20220901-00:00:01 50.1000 False NaN True NaN
Another option would be:
df[["f<49.975", "f>50.025"]] = (
df.assign(f1=df["f"].lt(49.975), f2=df["f"].gt(50.025))
.groupby(df.index // 10)[["f1", "f2"]].transform("sum")
.loc[df.index % 10 == 9]
)
- Add two columns
f1
, f2
to df
, defined by the two conditions.
- Now group every 10 rows and sum over the two new columns to get the truth-count per block. Use
.transform
to do that to keep the original index.
- Then take only every tenth row and assign the result to the two new columns.
The output that is generated is a column of true or false, I need to count how many times the word true was printed every 10 rows
In the code that I present, it reads csv files that are in one folder and prints them in another. In each of these csv’s it contains two columns that were chosen when the dataframe was defined. In addition, two columns were added which, through the how_many_times function, count how many times the value meets the condition that I give it.
Example of my csv(original df has more rows):
In [1]: dff = pd.DataFrame([['20220901-00:00:00', 50.0335,False,True], ['20220901-00:00:01', 50.024,False,False], ['20220901-00:00:02', 50.021,False,False]], columns=['t', 'f','f<49.975','f>50.025'])
This is my code (I used .sum but it didn’t work for what I needed):
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
how_many_times1= df.apply(lambda x: x['f'] < 49.975, axis=1).sum
df['f<49.975']=how_many_times1
how_many_times2= df.apply(lambda x: x['f'] > 50.025, axis=1).sum
df['f>50.025']=how_many_times2
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv')
You apply the .sum()
method directly to a Pandas Series being a DataFrame column ( how_many_times1.sum()
). And because True is equivalent to 1 and False to 0 you can directly count the entries without applying a condition.
This makes sense in case you need the total sum. In case you need the sum periodically each ten rows it makes sense to count the True values in the to the column applied function.
The code below defines two ‘apply’ functions which do the job of creating the right entries for the columns and printing the sum each ten rows.
See the code below for how it is done in detail:
import pandas as pd
df = pd.DataFrame([['20220901-00:00:00', 50.0335],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ]],
columns=['t', 'f' ])
columns = ['dummy', 'f<49.975','f>50.025']
print(df)
row1 = 1
sum1 = 0
lsm1 = []
def cond1(x):
global row1, sum1
cond = False
if x < 49.975:
cond=True
sum1+=1
if row1%10==0:
print('sum1:', sum1, 'at row:', row1)
lsm1.append(sum1)
sum1=0 # outcomment if cummulative sum required
else:
lsm1.append(None)
row1 += 1
return cond
row2 = 1
sum2 = 0
lsm2 = []
def cond2(x):
global row2, sum2
cond = False
if x > 50.025:
cond=True
sum2+=1
if row2%10==0:
print('sum2:', sum2, 'at row:', row2)
lsm2.append(sum2)
sum2=0 # outcomment if cummulative sum required
else:
lsm2.append(None)
row2 += 1
return cond
how_many_times1 = df['f'].apply(cond1)
df[columns[1]] = how_many_times1
df['sum1'] = lsm1
how_many_times2 = df['f'].apply(cond2)
df[columns[2]] =how_many_times2
df['sum2'] = lsm2
print(df)
prints
t f
0 20220901-00:00:00 50.0335
1 20220901-00:00:01 50.1000
2 20220901-00:00:02 48.0210
3 20220901-00:00:01 50.1000
4 20220901-00:00:01 50.1000
5 20220901-00:00:02 48.0210
6 20220901-00:00:01 50.1000
7 20220901-00:00:01 50.1000
8 20220901-00:00:02 48.0210
9 20220901-00:00:01 50.1000
10 20220901-00:00:02 48.0210
11 20220901-00:00:01 50.1000
12 20220901-00:00:01 50.1000
13 20220901-00:00:01 50.1000
sum1: 3 at row: 10
sum2: 7 at row: 10
t f f<49.975 sum1 f>50.025 sum2
0 20220901-00:00:00 50.0335 False NaN True NaN
1 20220901-00:00:01 50.1000 False NaN True NaN
2 20220901-00:00:02 48.0210 True NaN False NaN
3 20220901-00:00:01 50.1000 False NaN True NaN
4 20220901-00:00:01 50.1000 False NaN True NaN
5 20220901-00:00:02 48.0210 True NaN False NaN
6 20220901-00:00:01 50.1000 False NaN True NaN
7 20220901-00:00:01 50.1000 False NaN True NaN
8 20220901-00:00:02 48.0210 True NaN False NaN
9 20220901-00:00:01 50.1000 False 3.0 True 7.0
10 20220901-00:00:02 48.0210 True NaN False NaN
11 20220901-00:00:01 50.1000 False NaN True NaN
12 20220901-00:00:01 50.1000 False NaN True NaN
13 20220901-00:00:01 50.1000 False NaN True NaN
Another option would be:
df[["f<49.975", "f>50.025"]] = (
df.assign(f1=df["f"].lt(49.975), f2=df["f"].gt(50.025))
.groupby(df.index // 10)[["f1", "f2"]].transform("sum")
.loc[df.index % 10 == 9]
)
- Add two columns
f1
,f2
todf
, defined by the two conditions. - Now group every 10 rows and sum over the two new columns to get the truth-count per block. Use
.transform
to do that to keep the original index. - Then take only every tenth row and assign the result to the two new columns.