Looping over text files and get rows with NaN values in column in python
Question:
I have multiple text files with multiple tab delimited columns, from which I would like to extract rows with NaN values in column PTA
and add the filename as additional column to those extracted rows.
So for example:
File1
i B C D E F G H I J PTA K L
0 0.24055 0.31092 0.03447 0.00015 0.93464 0.08232 0.52609 0.00560 0.44018 0.06337 236 770
1 0.43976 0.45359 0.01220 0.93317 0.05711 0.06316 0.49310 0.05882 0.51825 0.18522 433 573
2 0.48067 0.17356 0.96903 0.02968 0.08864 0.05567 0.30423 0.02337 0.01981 0.56240 481 525
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 NaN 422 584
File2
i B C D E F G H I J PTA K L
0 0.1234 0.31092 0.356 0.00015 0.93464 0.08232 0.52609 0.5873 0.0034 0.06337 367 985
1 0.975 0.367 0.01220 0.875 0.05711 0.0365 0.49310 0.05882 0.51825 NaN 635 784
2 0.48067 0.17356 0.96903 0.02968 0.08864 0.05567 0.30423 0.02337 0.01981 0.823 956 213
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 1.30621 678 943
Expected output df:
i B C D E F G H I J PTA K L
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 NaN 422 584 File1
1 0.975 0.367 0.01220 0.875 0.05711 0.0365 0.49310 0.05882 0.51825 NaN 635 784 File2
This I would like to do over multiple files using python. At the moment I have tried this code, but I am not sure how to put it into a proper loop:
# import required module
import os
import pandas as pd
# assign directory
directory = 'files'
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
df=pd.read_csv(f, sep='t',comment='#')
print(df)
rows=df[df['PTA'].isna()]
print(rows)
At the moment, I am missing the part, where to add those rows into the new data frame.
Answers:
While iterating your files add the new column with filename to your filtered data, append()
the new dataframe
to a list and pd.concat()
all dataframes from list:
...
na_data = []
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
df=pd.read_csv(f, sep='t',comment='#')
rows=df[df['PTA'].isna()]
if not rows.empty():
rows['filename'] = filename
na_data.append(rows)
pd.concat(na_data)
I have multiple text files with multiple tab delimited columns, from which I would like to extract rows with NaN values in column PTA
and add the filename as additional column to those extracted rows.
So for example:
File1
i B C D E F G H I J PTA K L
0 0.24055 0.31092 0.03447 0.00015 0.93464 0.08232 0.52609 0.00560 0.44018 0.06337 236 770
1 0.43976 0.45359 0.01220 0.93317 0.05711 0.06316 0.49310 0.05882 0.51825 0.18522 433 573
2 0.48067 0.17356 0.96903 0.02968 0.08864 0.05567 0.30423 0.02337 0.01981 0.56240 481 525
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 NaN 422 584
File2
i B C D E F G H I J PTA K L
0 0.1234 0.31092 0.356 0.00015 0.93464 0.08232 0.52609 0.5873 0.0034 0.06337 367 985
1 0.975 0.367 0.01220 0.875 0.05711 0.0365 0.49310 0.05882 0.51825 NaN 635 784
2 0.48067 0.17356 0.96903 0.02968 0.08864 0.05567 0.30423 0.02337 0.01981 0.823 956 213
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 1.30621 678 943
Expected output df:
i B C D E F G H I J PTA K L
3 0.41872 0.18580 0.00191 0.08048 0.90871 0.02035 0.23598 0.01610 0.19815 NaN 422 584 File1
1 0.975 0.367 0.01220 0.875 0.05711 0.0365 0.49310 0.05882 0.51825 NaN 635 784 File2
This I would like to do over multiple files using python. At the moment I have tried this code, but I am not sure how to put it into a proper loop:
# import required module
import os
import pandas as pd
# assign directory
directory = 'files'
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
df=pd.read_csv(f, sep='t',comment='#')
print(df)
rows=df[df['PTA'].isna()]
print(rows)
At the moment, I am missing the part, where to add those rows into the new data frame.
While iterating your files add the new column with filename to your filtered data, append()
the new dataframe
to a list and pd.concat()
all dataframes from list:
...
na_data = []
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
df=pd.read_csv(f, sep='t',comment='#')
rows=df[df['PTA'].isna()]
if not rows.empty():
rows['filename'] = filename
na_data.append(rows)
pd.concat(na_data)