Python Pandas: How to read only first n rows of CSV files in?
Question:
I have a very large data set and I can’t afford to read the entire data set in. So, I’m thinking of reading only one chunk of it to train but I have no idea how to do it.
Answers:
If you only want to read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you only want to read rows 1,000,000 … 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
nrows : int, default None Number of rows of file to read. Useful for
reading pieces of large files*
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
and for large files, you’ll probably also want to use chunksize:
chunksize : int, default None
Return TextFileReader object for iteration
If you do not want to use Pandas, you can use csv library and to limit row readed with interaction break.
For example, I needed to read a list of files stored in csvs list to get the only the header.
for csvs in result:
csvs = './'+csvs
with open(csvs,encoding='ANSI', newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
count=0
for row in csv_reader:
if count:
break;
chunksize=
is a very useful argument because the output of read_csv
after passing it is an iterator, so you can call the next()
function on it to get the specific chunk you want without straining your memory. For example, to get the first n
rows, you can use:
chunks = pd.read_csv('file.csv', chunksize=n)
df = next(chunks)
For example, if you have a time-series data and you want to make the first 700k rows the train set and the remainder test set, then you can do so by:
chunks = pd.read_csv('file.csv', chunksize=700_000)
train_df = next(chunks)
test_df = next(chunks)
I have a very large data set and I can’t afford to read the entire data set in. So, I’m thinking of reading only one chunk of it to train but I have no idea how to do it.
If you only want to read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you only want to read rows 1,000,000 … 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
nrows : int, default None Number of rows of file to read. Useful for
reading pieces of large files*
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
and for large files, you’ll probably also want to use chunksize:
chunksize : int, default None
Return TextFileReader object for iteration
If you do not want to use Pandas, you can use csv library and to limit row readed with interaction break.
For example, I needed to read a list of files stored in csvs list to get the only the header.
for csvs in result:
csvs = './'+csvs
with open(csvs,encoding='ANSI', newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
count=0
for row in csv_reader:
if count:
break;
chunksize=
is a very useful argument because the output of read_csv
after passing it is an iterator, so you can call the next()
function on it to get the specific chunk you want without straining your memory. For example, to get the first n
rows, you can use:
chunks = pd.read_csv('file.csv', chunksize=n)
df = next(chunks)
For example, if you have a time-series data and you want to make the first 700k rows the train set and the remainder test set, then you can do so by:
chunks = pd.read_csv('file.csv', chunksize=700_000)
train_df = next(chunks)
test_df = next(chunks)