Summing numbers in two diffrent .txt file in Python
Question:
I am currently trying to sum two .txt files containing each other over 35 millions value and put the result in a third file.
File 1 :
2694.28
2694.62
2694.84
2695.17
File 2 :
1.483429484776452
2.2403221757269196
1.101004844694236
1.6119626937837102
File 3 :
2695.76343
2696.86032
2695.941
2696.78196
Any idea to do that with python ?
Answers:
You can use numpy for speed. It will be much faster than pure python. Numpy uses C/C++ for a lot of it’s operations.
import numpy
import os
path = os.path.dirname(os.path.realpath(__file__))
file_name_1 = path + '/values_1.txt'
file_name_2 = path + '/values_2.txt'
a = numpy.loadtxt(file_name_1, dtype=float)
b = numpy.loadtxt(file_name_2, dtype=float)
c = a + b
precision = 10
numpy.savetxt(path + '/sum.txt', c, fmt=f'%-.{precision}f')
This assumes your .txt files are located where your python script is located.
You can use pandas.read_csv to read, sum, and then write chunks of your file.
Presumably all 35 million records do not stay in memory. You need to read the file by chunk. In this way you read one chunk at a time, and load into memory only one chunk (2 actually one for file1 and one for file2), do the sum and write into memory one chunk at a time in append mode on file3.
In this dummy example I put as chunksize=2, because doing tests on your inputs that are 4 long. It depends on the server you are working on, do some tests and see what is the best size for your problem (50k, 100k, 500k, 1kk etc).
import pandas as pd
chunksize = 2
with pd.read_csv("file1.txt", chunksize=chunksize, header=None) as reader1, pd.read_csv("file2.txt", chunksize=chunksize, header=None) as reader2:
for chunk1, chunk2 in zip(reader1, reader2):
(chunk1 + chunk2).to_csv("file3.txt", index=False, header=False, mode='a')
I am currently trying to sum two .txt files containing each other over 35 millions value and put the result in a third file.
File 1 :
2694.28
2694.62
2694.84
2695.17
File 2 :
1.483429484776452
2.2403221757269196
1.101004844694236
1.6119626937837102
File 3 :
2695.76343
2696.86032
2695.941
2696.78196
Any idea to do that with python ?
You can use numpy for speed. It will be much faster than pure python. Numpy uses C/C++ for a lot of it’s operations.
import numpy
import os
path = os.path.dirname(os.path.realpath(__file__))
file_name_1 = path + '/values_1.txt'
file_name_2 = path + '/values_2.txt'
a = numpy.loadtxt(file_name_1, dtype=float)
b = numpy.loadtxt(file_name_2, dtype=float)
c = a + b
precision = 10
numpy.savetxt(path + '/sum.txt', c, fmt=f'%-.{precision}f')
This assumes your .txt files are located where your python script is located.
You can use pandas.read_csv to read, sum, and then write chunks of your file.
Presumably all 35 million records do not stay in memory. You need to read the file by chunk. In this way you read one chunk at a time, and load into memory only one chunk (2 actually one for file1 and one for file2), do the sum and write into memory one chunk at a time in append mode on file3.
In this dummy example I put as chunksize=2, because doing tests on your inputs that are 4 long. It depends on the server you are working on, do some tests and see what is the best size for your problem (50k, 100k, 500k, 1kk etc).
import pandas as pd
chunksize = 2
with pd.read_csv("file1.txt", chunksize=chunksize, header=None) as reader1, pd.read_csv("file2.txt", chunksize=chunksize, header=None) as reader2:
for chunk1, chunk2 in zip(reader1, reader2):
(chunk1 + chunk2).to_csv("file3.txt", index=False, header=False, mode='a')