Is there a direct way to import the contents of a CSV file into a record array, just like how R’s
read.csv() import data into R dataframes?
Or should I use
csv.reader() and then apply
numpy.genfromtxt() by setting the
delimiter kwarg to a comma:
from numpy import genfromtxt my_data = genfromtxt('my_file.csv', delimiter=',')
You can also try
recfromcsv() which can guess data types and return a properly formatted record array.
import pandas as pd df = pd.read_csv('myfile.csv', sep=',', header=None) print(df.values)
array([[ 1. , 2. , 3. ], [ 4. , 5.5, 6. ]])
DataFrameis a 2-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or
import numpy as np np.genfromtxt('myfile.csv', delimiter=',')
For the following
1.0, 2, 3 4, 5.5, 6
the code above gives an array:
array([[ 1. , 2. , 3. ], [ 4. , 5.5, 6. ]])
np.genfromtxt('myfile.csv', delimiter=',', dtype=None)
gives a record array:
array([(1.0, 2.0, 3), (4.0, 5.5, 6)], dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])
This has the advantage that files with multiple data types (including strings) can be easily imported.
I tried it :
from numpy import genfromtxt genfromtxt(fname = dest_file, dtype = (<whatever options>))
import csv import numpy as np with open(dest_file,'r') as dest_f: data_iter = csv.reader(dest_f, delimiter = delimiter, quotechar = '"') data = [data for data in data_iter] data_array = np.asarray(data, dtype = <whatever options>)
on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.
I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.
You can use this code to send CSV file data into an array:
import numpy as np csv = np.genfromtxt('test.csv', delimiter=",") print(csv)
I tried this:
import pandas as p import numpy as n closingValue = p.read_csv("<FILENAME>", usecols=, dtype=float) print(closingValue)
As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:
This is my test code:
$ for f in test_pandas.py test_numpy_csv.py ; do /usr/bin/time python $f; done 2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k 0inputs+24outputs (0major+107147minor)pagefaults 0swaps 23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k 0inputs+0outputs (0major+416145minor)pagefaults 0swaps
from numpy import genfromtxt train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')
from pandas import read_csv df = read_csv('/home/hvn/me/notebook/train.csv')
du -h ~/me/notebook/train.csv 59M /home/hvn/me/notebook/train.csv
With NumPy and pandas at versions:
$ pip freeze | egrep -i 'pandas|numpy' numpy==1.13.3 pandas==0.20.2
A quite simple method. But it requires all the elements being float (int and so on)
import numpy as np data = np.loadtxt('c:\1.csv',delimiter=',',skiprows=0)
This is the easiest way:
import csv with open('testfile.csv', newline='') as csvfile: data = list(csv.reader(csvfile))
Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.
I would suggest using tables (
pip3 install tables). You can save your
.csv file to
.h5 using pandas (
pip3 install pandas),
import pandas as pd data = pd.read_csv("dataset.csv") store = pd.HDFStore('dataset.h5') store['mydata'] = data store.close()
You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.
import pandas as pd store = pd.HDFStore('dataset.h5') data = store['mydata'] store.close() # Data in NumPy format data = data.values
This work as a charm…
import csv with open("data.csv", 'r') as f: data = list(csv.reader(f, delimiter=";")) import numpy as np data = np.array(data, dtype=np.float)
In : %time my_data = genfromtxt('one.csv', delimiter=',') CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s Wall time: 24.4 s In : %time df = pd.read_csv("one.csv", skiprows=20) CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s Wall time: 1.38 s
Available on the newest pandas and numpy version.
import pandas as pd import numpy as np data = pd.read_csv('data.csv', header=None) # Discover, visualize, and preprocess data using pandas if needed. data = data.to_numpy()
this is a very simple task, the best way to do this is as follows
import pandas as pd import numpy as np df = pd.read_csv(r'C:UsersRonDesktopClients.csv') #read the file (put 'r' before the path string to address any special characters in the file such as ). Don't forget to put the file name at the end of the path + ".csv" print(df)` y = np.array(df)