Summing numbers in two diffrent .txt file in Python

Question:

I am currently trying to sum two .txt files containing each other over 35 millions value and put the result in a third file.

File 1 :

2694.28
2694.62
2694.84
2695.17

File 2 :

1.483429484776452
2.2403221757269196
1.101004844694236
1.6119626937837102

File 3 :

2695.76343
2696.86032
2695.941
2696.78196

Any idea to do that with python ?

Asked By: Nicolas Guibal

||

Answers:

You can use numpy for speed. It will be much faster than pure python. Numpy uses C/C++ for a lot of it’s operations.

import numpy
import os

path = os.path.dirname(os.path.realpath(__file__))

file_name_1 = path + '/values_1.txt'
file_name_2 = path + '/values_2.txt'

a = numpy.loadtxt(file_name_1, dtype=float)
b = numpy.loadtxt(file_name_2, dtype=float)
c = a + b
precision = 10
numpy.savetxt(path + '/sum.txt', c, fmt=f'%-.{precision}f')

This assumes your .txt files are located where your python script is located.

Answered By: alvrm

You can use pandas.read_csv to read, sum, and then write chunks of your file.
Presumably all 35 million records do not stay in memory. You need to read the file by chunk. In this way you read one chunk at a time, and load into memory only one chunk (2 actually one for file1 and one for file2), do the sum and write into memory one chunk at a time in append mode on file3.

In this dummy example I put as chunksize=2, because doing tests on your inputs that are 4 long. It depends on the server you are working on, do some tests and see what is the best size for your problem (50k, 100k, 500k, 1kk etc).

import pandas as pd

chunksize = 2

with pd.read_csv("file1.txt", chunksize=chunksize, header=None) as reader1, pd.read_csv("file2.txt", chunksize=chunksize, header=None) as reader2:
    for chunk1, chunk2 in zip(reader1, reader2):
        (chunk1 + chunk2).to_csv("file3.txt", index=False, header=False, mode='a')
Answered By: Massifox
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.