Script suddenly using all RAM

Question:

I have a python script I am using to convert some very densely formatted csv files into another format that I need. The csv files are quite large (3GB) so I read them in chunks to avoid using all the RAM (I have 32GB of RAM on the machine I am using).

The odd thing that happens in that the script processes one file using a few GB of memory (about ~3GB based on what top says).

I finish that file and load the next file, again in chunks. Suddenly I am using 25GB, writing to swap, and the process is killed. I’m not sure what is changing between the first and second iteration. I have put in an os.sleep(60) to try to let the garbage collector catch up but it still is going from ~10% memory to ~85% to killed process.

Here’s the main chunk of the script:

for file in files:
    sleep(60)
    print(file)
    read_names = True
    count = 0
    for df in pd.read_csv(file, encoding= 'unicode_escape', chunksize=1e4, names=['all']):
        start_index = 0
        count += 1
        if read_names:
            names = df.iloc[0,:].apply(lambda x: x.split(';')).values[0]
            names = names[1:]
            start_index = 2
            read_names = False
        for row in df.iloc[start_index:,:].iterrows():
            data = row[1]
            data_list = data['all'].split(';')
            date_time = data_list[0]
            values = data_list[1:]
            date, time = date_time.split(' ')
            dd, mm, yyyy = date.split('/')
            date = yyyy + '/' + mm + '/' + dd
            for name, value in zip(names, values):
                try:
                    data_dict[name].append([name, date, time, float(value)])
                except:
                    pass
        if count % 5 == 0:
            for name in names:
                start_date = data_dict[name][0][1]
                start_time = data_dict[name][0][2]
                end_date = data_dict[name][-1][1]
                end_time = data_dict[name][-1][2]
                start_dt = start_date + ' ' + start_time
                end_dt = end_date + ' ' + end_time
                dt_index = pd.date_range(start=start_dt, freq='1S', periods=len(data_dict[name]))
                df = pd.DataFrame(data_dict[name], index=dt_index)
                df = df[3].resample('1T').mean().round(10)
                with open(csv_dict[name], 'a') as ff:
                    for index, value in zip(df.index, df.values):
                        date, time = str(index).split(' ')
                        to_write = f"{name}, {date}, {time}, {value}n"
                        ff.write(to_write)

Is there something I can do to manage this better? I need to loop over 50 large files for this task.

Data format:
Input

time sensor1 sensor2 sensor3 sensor....
2022-07-01 00:00:00; 2.559;.234;0;0;0;.....
2022-07-01 00:00:01; 2.560;.331;0;0;0;.....
2022-07-01 00:00:02; 2.558;.258;0;0;0;.....

Output

sensor1, 2019-05-13, 05:58:00, 2.559 
sensor1, 2019-05-13, 05:59:00, 2.560 
sensor1, 2019-05-13, 06:00:00, 2.558 

Edit: interesting finding – the files I am writing to are suddenly not being updated, they are several minutes behind where they should be if writing is occurring as it should. The data within the file is not changing either when I check the tail of the file. Thus I assume the data is building up in the dictionary and swamping RAM, which makes sense. Now to understand why the writing isn’t happening.

Edit 2: more interesting finds!! The script runs fine of the first csv and a big chunk of the second csv before filling up the RAM and crashing. It seems the ram problem starts with the second file, so I skipped processing that one and magically I am running longer than I have thus far without a memory issue. This perhaps is corrupt data that throws something off.

Asked By: matt

||

Answers:

Given file.csv that looks exactly like:

time sensor1 sensor2 sensor3 sensor4 sensor5
2022-07-01 00:00:00; 2.559;.234;0;0;0
2022-07-01 00:00:01; 2.560;.331;0;0;0
2022-07-01 00:00:02; 2.558;.258;0;0;0

You’re doing a lot more than this, and not using proper pandas methods will kill you on time (iterrows is basically never the best option). Basically, if you’re manually looping over a DataFrame, you’re probably doing it wrong.

But, if you follow this pattern of using it as a context manager, instead of trying to treat it as an iterator (which is a deprecated method), you won’t have the memory issues.

files = ['file.csv']
for file in files:
    with open(file) as f:
        # Grab the columns:
        cols = f.readline().split()
        # Initialize the context-manager: 
        # You'll want a larger chunksize, 1e5 should even work.
        with pd.read_csv(f, names=cols, sep=';', chunksize=1) as chunks:
            for df in chunks:
                df[['date', 'time']] = df.time.str.split(expand=True)
                df = df.melt(['date', 'time'], var_name='sensor')
                df = df[['sensor', 'date', 'time', 'value']]
                for name, group in df.groupby('sensor', as_index=False):
                    group.to_csv(f'{name}.csv', mode='a', index=False, header=False)

Output of sensor1.csv:

sensor1,2022-07-01,00:00:00,2.559
sensor1,2022-07-01,00:00:01,2.56
sensor1,2022-07-01,00:00:02,2.558

sensor2.csv

sensor2,2022-07-01,00:00:00,0.234
sensor2,2022-07-01,00:00:01,0.331
sensor2,2022-07-01,00:00:02,0.258

sensor3.csv

sensor3,2022-07-01,00:00:00,0.0
sensor3,2022-07-01,00:00:01,0.0
sensor3,2022-07-01,00:00:02,0.0

etc...

Answered By: BeRT2me
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.