Read in all csv files from a directory using Python
Question:
I hope this is not trivial but I am wondering the following:
If I have a specific folder with n csv
files, how could I iteratively read all of them, one at a time, and perform some calculations on their values?
For a single file, for example, I do something like this and perform some calculations on the x
array:
import csv
import os
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2] #Creates the array that will undergo a set of calculations
I know that I can check how many csv
files there are in a given folder (check here):
import glob
for files in glob.glob("*.csv"):
print files
But I failed to figure out how to possibly nest the numpy.genfromtxt()
function in a for loop, so that I read in all the csv files of a directory that it is up to me to specify.
EDIT
The folder I have only has jpg
and csv
files. The latter are named eventX.csv
, where X ranges from 1 to 50. The for
loop I am referring to should therefore consider the file names the way they are.
Answers:
That’s how I’d do it:
import os
directory = os.path.join("c:\","path")
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
f=open(file, 'r')
# perform calculation
f.close()
I think you look for something like this
import glob
for file_name in glob.glob(directoryPath+'*.csv'):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
Edit
If you want to get all csv
files from a folder (including subfolder) you could use subprocess
instead of glob
(note that this code only works on linux systems)
import subprocess
file_list = subprocess.check_output(['find',directoryPath,'-name','*.csv']).split('n')[:-1]
for i,file_name in enumerate(file_list):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
# now you can use i as an index
It first searches the folder and sub-folders for all file_names using the find
command from the shell and applies your calculations afterwards.
According to the documentation of numpy.genfromtxt()
, the first argument can be a
File, filename, or generator to read.
That would mean that you could write a generator that yields the lines of all the files like this:
def csv_merge_generator(pattern):
for file in glob.glob(pattern):
for line in file:
yield line
# then using it like this
numpy.genfromtxt(csv_merge_generator('*.csv'))
should work. (I do not have numpy installed, so cannot test easily)
Using pandas and glob as the base packages
import glob
import pandas as pd
glued_data = pd.DataFrame()
for file_name in glob.glob(directoryPath+'*.csv'):
x = pd.read_csv(file_name, low_memory=False)
glued_data = pd.concat([glued_data,x],axis=0)
Here’s a more succinct way to do this, given some path = "/path/to/dir/"
.
import glob
import pandas as pd
pd.concat([pd.read_csv(f) for f in glob.glob(path+'*.csv')])
Then you can apply your calculation to the whole dataset, or, if you want to apply it one by one:
pd.concat([process(pd.read_csv(f)) for f in glob.glob(path+'*.csv')])
The function below will return a dictionary containing a dataframe for each .csv file in the folder within your defined path.
import pandas as pd
import glob
import os
import ntpath
def panda_read_csv(path):
pd_csv_dict = {}
csv_files = glob.glob(os.path.join(path, "*.csv"))
for csv_file in csv_files:
file_name = ntpath.basename(csv_file)
pd_csv_dict['pd_' + file_name] = pd.read_csv(csv_file, sep=";", encoding='mac_roman')
locals().update(pd_csv_dict)
return pd_csv_dict
You need to import the glob library and then use it like following:
import glob
path='C:\Users\Admin\PycharmProjects\db_conection_screenshot\seclectors_absent_images'
filenames = glob.glob(path + "*.png")
print(len(filenames))
You can use pathlib
glob
functionality to list all .csv in a path, and pandas
to read them.
Then it’s only a matter of applying whatever function you want (which, if systematic, can also be done within the list comprehension)
import pands as pd
from pathlib import Path
path2csv = Path("/your/path/")
csvlist = path2csv.glob("*.csv")
csvs = [pd.read_csv(g) for g in csvlist ]
Another answer using list comprehension:
from os import listdir
files= [f for f in listdir("./") if f.endswith(".csv")]
I hope this is not trivial but I am wondering the following:
If I have a specific folder with n csv
files, how could I iteratively read all of them, one at a time, and perform some calculations on their values?
For a single file, for example, I do something like this and perform some calculations on the x
array:
import csv
import os
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2] #Creates the array that will undergo a set of calculations
I know that I can check how many csv
files there are in a given folder (check here):
import glob
for files in glob.glob("*.csv"):
print files
But I failed to figure out how to possibly nest the numpy.genfromtxt()
function in a for loop, so that I read in all the csv files of a directory that it is up to me to specify.
EDIT
The folder I have only has jpg
and csv
files. The latter are named eventX.csv
, where X ranges from 1 to 50. The for
loop I am referring to should therefore consider the file names the way they are.
That’s how I’d do it:
import os
directory = os.path.join("c:\","path")
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
f=open(file, 'r')
# perform calculation
f.close()
I think you look for something like this
import glob
for file_name in glob.glob(directoryPath+'*.csv'):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
Edit
If you want to get all csv
files from a folder (including subfolder) you could use subprocess
instead of glob
(note that this code only works on linux systems)
import subprocess
file_list = subprocess.check_output(['find',directoryPath,'-name','*.csv']).split('n')[:-1]
for i,file_name in enumerate(file_list):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
# now you can use i as an index
It first searches the folder and sub-folders for all file_names using the find
command from the shell and applies your calculations afterwards.
According to the documentation of numpy.genfromtxt()
, the first argument can be a
File, filename, or generator to read.
That would mean that you could write a generator that yields the lines of all the files like this:
def csv_merge_generator(pattern):
for file in glob.glob(pattern):
for line in file:
yield line
# then using it like this
numpy.genfromtxt(csv_merge_generator('*.csv'))
should work. (I do not have numpy installed, so cannot test easily)
Using pandas and glob as the base packages
import glob
import pandas as pd
glued_data = pd.DataFrame()
for file_name in glob.glob(directoryPath+'*.csv'):
x = pd.read_csv(file_name, low_memory=False)
glued_data = pd.concat([glued_data,x],axis=0)
Here’s a more succinct way to do this, given some path = "/path/to/dir/"
.
import glob
import pandas as pd
pd.concat([pd.read_csv(f) for f in glob.glob(path+'*.csv')])
Then you can apply your calculation to the whole dataset, or, if you want to apply it one by one:
pd.concat([process(pd.read_csv(f)) for f in glob.glob(path+'*.csv')])
The function below will return a dictionary containing a dataframe for each .csv file in the folder within your defined path.
import pandas as pd
import glob
import os
import ntpath
def panda_read_csv(path):
pd_csv_dict = {}
csv_files = glob.glob(os.path.join(path, "*.csv"))
for csv_file in csv_files:
file_name = ntpath.basename(csv_file)
pd_csv_dict['pd_' + file_name] = pd.read_csv(csv_file, sep=";", encoding='mac_roman')
locals().update(pd_csv_dict)
return pd_csv_dict
You need to import the glob library and then use it like following:
import glob
path='C:\Users\Admin\PycharmProjects\db_conection_screenshot\seclectors_absent_images'
filenames = glob.glob(path + "*.png")
print(len(filenames))
You can use pathlib
glob
functionality to list all .csv in a path, and pandas
to read them.
Then it’s only a matter of applying whatever function you want (which, if systematic, can also be done within the list comprehension)
import pands as pd
from pathlib import Path
path2csv = Path("/your/path/")
csvlist = path2csv.glob("*.csv")
csvs = [pd.read_csv(g) for g in csvlist ]
Another answer using list comprehension:
from os import listdir
files= [f for f in listdir("./") if f.endswith(".csv")]