Python Pandas: How to read only first n rows of CSV files in?

Question:

I have a very large data set and I can’t afford to read the entire data set in. So, I’m thinking of reading only one chunk of it to train but I have no idea how to do it.

Asked By: bensw

||

Answers:

If you only want to read the first 999,999 (non-header) rows:

read_csv(..., nrows=999999)

If you only want to read rows 1,000,000 … 1,999,999

read_csv(..., skiprows=1000000, nrows=999999)

nrows : int, default None Number of rows of file to read. Useful for
reading pieces of large files*

skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file

and for large files, you’ll probably also want to use chunksize:

chunksize : int, default None
Return TextFileReader object for iteration

pandas.io.parsers.read_csv documentation

Answered By: smci

If you do not want to use Pandas, you can use csv library and to limit row readed with interaction break.

For example, I needed to read a list of files stored in csvs list to get the only the header.

for csvs in result:
    csvs = './'+csvs
    with open(csvs,encoding='ANSI', newline='') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        count=0
        for row in csv_reader:
            if count:
                break;
Answered By: Luiz A C K

chunksize= is a very useful argument because the output of read_csv after passing it is an iterator, so you can call the next() function on it to get the specific chunk you want without straining your memory. For example, to get the first n rows, you can use:

chunks = pd.read_csv('file.csv', chunksize=n)
df = next(chunks)

For example, if you have a time-series data and you want to make the first 700k rows the train set and the remainder test set, then you can do so by:

chunks = pd.read_csv('file.csv', chunksize=700_000)
train_df = next(chunks)
test_df = next(chunks)
Answered By: cottontail
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.