How would I go about converting a .csv to an .arrow file without loading it all into memory?

Question:

I found a similar question here: Read CSV with PyArrow

In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name it.
I can’t seem to find the exact information I am looking for in the docs for pyarrow. My file will not have any nans, but it will have a timestamped index. The file is ~100 gb, so loading it into memory simply isn’t an option. I tried changing the code, but as I assumed, the code ended up overwriting the previous file every loop.

***This is my first post. I would like to thank all the contributors who answered 99.9% of my other questions before I had even the asked them.

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1     ### used one line chunks for a small test

def main():
    writer = None
    for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):

        table = pa.Table.from_pandas(split)
        # Write out to file
        with pa.OSFile('test.arrow', 'wb') as sink:     ### no append mode yet
            with pa.RecordBatchFileWriter(sink, table.schema) as writer:
                writer.write_table(table)
    writer.close()

if __name__ == "__main__":
    main()

Below is the code I used in the command line

>cat data.csv | python test.py
Asked By: kasbah512

||

Answers:

As suggested by @Pace, you should consider moving the output file creation outside of the reading loop. Something like this:

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1     ### used one line chunks for a small test

def main():
    # Write out to file
    with pa.OSFile('test.arrow', 'wb') as sink:     ### no append mode yet
        with pa.RecordBatchFileWriter(sink, table.schema) as writer:
            for split in pd.read_csv('data.csv', chunksize=SPLIT_ROWS):
                table = pa.Table.from_pandas(split)
                writer.write_table(table)

if __name__ == "__main__":
    main()        

You also don’t have to use sys.stdin.buffer if you would prefer to specify specific input and output files. You could then just run the script as:

python test.py

By using with statements, both writer and sink will be automatically closed afterwards (in this case when main() returns). This means it should not be necessary to include an explicit close() call.

Answered By: Martin Evans

Solution adapted from @Martin-Evans code:

Closed file after the for loop as suggested by @Pace

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1000000

def main():
    schema = pa.Table.from_pandas(pd.read_csv('Data.csv',nrows=2)).schema 
    ### reads first two lines to define schema 

    with pa.OSFile('test.arrow', 'wb') as sink:
        with pa.RecordBatchFileWriter(sink, schema) as writer:            
            for split in pd.read_csv('Data.csv',chunksize=SPLIT_ROWS):
                table = pa.Table.from_pandas(split)
                writer.write_table(table)

            writer.close()

if __name__ == "__main__":
    main()   
Answered By: kasbah512

in 2023 you don’t need pandas for this. You can chunk through csv using arrow:

import pyarrow as pa
from pyarrow import csv

schema =  pa.schema([
        ('time', pa.timestamp('ms', None)),
        ('deviceid', pa.utf8())
])
convert_dict = {
  'time': pa.timestamp('ms', None),
  'deviceid': pa.utf8()
}
convert_options = pa.csv.ConvertOptions(
    column_types=convert_dict
    , strings_can_be_null=True
    , quoted_strings_can_be_null=True
    ,timestamp_parsers=["%Y-%m-%d %H:%M:%S"],
)

arrowfile = "data_dst.arrow"
csvfile = "data_src.csv"

with pa.OSFile(arrowfile, 'wb') as sink:     ### no append mode yet
   with pa.csv.open_csv(csvfile, convert_options=convert_options) as reader:
       with pa.RecordBatchFileWriter(sink, schema) as writer:
           for next_chunk in reader:
               if next_chunk is None:
                    break
               if writer is None:
                    break
               next_table = pa.Table.from_batches([next_chunk])
               writer.write_table(next_table)
Answered By: martin