How to skip reading empty files with pandas in Python
Question:
I read all the files in one folder one by one into a pandas.DataFrame
and then I check them for some conditions. There are a few thousand files, and I would love to make pandas
raise an exception when a file is empty, so that my reader function would skip this file.
I have something like:
class StructureReader(FileList):
def __init__(self, dirname, filename):
self.dirname=dirname
self.filename=str(self.dirname+"/"+filename)
def read(self):
self.data = pd.read_csv(self.filename, header=None, sep = ",")
if len(self.data)==0:
raise ValueError
class Run(object):
def __init__(self, dirname):
self.dirname=dirname
self.file__list=FileList(dirname)
self.result=Result()
def run(self):
for k in self.file__list.file_list[:]:
self.b=StructureReader(self.dirname, k)
try:
self.b.read()
self.b.find_interesting_bonds(self.result)
self.b.find_same_direction_chain(self.result)
except ValueError:
pass
Regular file that I’m searching for some condition looks like:
"A/C/24","A/G/14","WW_cis",,
"B/C/24","A/G/15","WW_cis",,
"C/C/24","A/F/11","WW_cis",,
"d/C/24","A/G/12","WW_cis",,
But somehow I don’t ever get ValueError
raised, and my functions are searching empty files, which gives me a lot of "Empty DataFrame …" lines in my results file. How can I skip empty files?
Answers:
You should not use pandas, but directly the python libraries. The answer is there: python how to check file empty or not
I’d first check if the file is empty, and if it isn’t empty I’ll try to use it with pandas. Following this link https://stackoverflow.com/a/15924160/5088142 you can find a nice way to check if a file is empty:
import os
def is_non_zero_file(fpath):
return os.path.isfile(fpath) and os.path.getsize(fpath) > 0
You can get your work done with following code, just add your CSVs path to the path variable, and run. You should get an object raw_data which is a Pandas dataframe.
import os, pandas as pd, glob
import pandas.io.common
path = "/home/username/data_folder"
files_list = glob.glob(os.path.join(path, "*.csv"))
for i in range(0,len(files_list)):
try:
raw_data = pd.read_csv(files_list[i])
except pandas.errors.EmptyDataError:
print(files_list[i], " is empty and has been skipped.")
How about this
files = glob.glob('*.csv')
files = list(filter(lambda file: os.stat(file).st_size > 0, files))
data = pd.read_csv(files)
I read all the files in one folder one by one into a pandas.DataFrame
and then I check them for some conditions. There are a few thousand files, and I would love to make pandas
raise an exception when a file is empty, so that my reader function would skip this file.
I have something like:
class StructureReader(FileList):
def __init__(self, dirname, filename):
self.dirname=dirname
self.filename=str(self.dirname+"/"+filename)
def read(self):
self.data = pd.read_csv(self.filename, header=None, sep = ",")
if len(self.data)==0:
raise ValueError
class Run(object):
def __init__(self, dirname):
self.dirname=dirname
self.file__list=FileList(dirname)
self.result=Result()
def run(self):
for k in self.file__list.file_list[:]:
self.b=StructureReader(self.dirname, k)
try:
self.b.read()
self.b.find_interesting_bonds(self.result)
self.b.find_same_direction_chain(self.result)
except ValueError:
pass
Regular file that I’m searching for some condition looks like:
"A/C/24","A/G/14","WW_cis",,
"B/C/24","A/G/15","WW_cis",,
"C/C/24","A/F/11","WW_cis",,
"d/C/24","A/G/12","WW_cis",,
But somehow I don’t ever get ValueError
raised, and my functions are searching empty files, which gives me a lot of "Empty DataFrame …" lines in my results file. How can I skip empty files?
You should not use pandas, but directly the python libraries. The answer is there: python how to check file empty or not
I’d first check if the file is empty, and if it isn’t empty I’ll try to use it with pandas. Following this link https://stackoverflow.com/a/15924160/5088142 you can find a nice way to check if a file is empty:
import os
def is_non_zero_file(fpath):
return os.path.isfile(fpath) and os.path.getsize(fpath) > 0
You can get your work done with following code, just add your CSVs path to the path variable, and run. You should get an object raw_data which is a Pandas dataframe.
import os, pandas as pd, glob
import pandas.io.common
path = "/home/username/data_folder"
files_list = glob.glob(os.path.join(path, "*.csv"))
for i in range(0,len(files_list)):
try:
raw_data = pd.read_csv(files_list[i])
except pandas.errors.EmptyDataError:
print(files_list[i], " is empty and has been skipped.")
How about this
files = glob.glob('*.csv')
files = list(filter(lambda file: os.stat(file).st_size > 0, files))
data = pd.read_csv(files)