Computing averages of records from multiple files with python

Question:

Dear all,
I am beginner in Python. I am looking for the best way to do the following in Python: let’s assume I have three text files, each one with m rows and n columns of numbers, name file A, B, and C. For the following, the contents can be indexed as A[i][j], or B[k][l] and so on. I need to compute the average of A[0][0], B[0][0], C[0][0], and writes it to file D at D[0][0]. And the same for the remaining records. For instance, let’s assume that :

A:  
1 2 3   
4 5 6  
B:  
0 1 3  
2 4 5  
C:  
2 5 6  
1 1 1

Therefore, file D should be

D:  
1     2.67   4    
2.33  3.33   4  

My actual files are of course larger than the present ones, of the order of some Mb. I am unsure about the best solution, if reading all the file contents in a nested structure indexed by filename, or trying to read, for each file, each line and computing the mean. After reading the manual, the fileinput module is not useful in this case because it does not read the lines “in parallel”, as I need here, but it reads the lines “serially”. Any guidance or advice is highly appreciated.

Asked By: iluvatar

||

Answers:

Have a look at numpy. It can read the three files into three arrays (using fromfile), calculate the average and export it to a text file (using tofile).

import numpy as np


a = np.fromfile('A.csv', dtype=np.int)   
b = np.fromfile('B.csv', dtype=np.int)   
c = np.fromfile('C.csv', dtype=np.int)   

d = (a + b + c) / 3.0

d.tofile('D.csv')

Size of "some MB" should not be a problem.

Answered By: eumiro

In case of text files, try this:

def readdat(data,sep=','):
    step1 = data.split('n')
    step2 = []
    for index in step1:
        step2.append(float(index.split(sep)))
    return step2

def formatdat(data,sep=','):
    step1 = []
    for index in data:
        step1.append(sep.join(str(data)))
    return 'n'.join(step1)

and then use these functions to format the text into lists.

Answered By: Eric Pauley

Just for reference, here’s how you’d do the same sort of thing without numpy (less elegant, but more flexible):

files = zip(open("A.dat"), open("B.dat"), open("C.dat"))
outfile = open("D.dat","w")
for rowgrp in files:     # e.g.("1 2 3n", "0 1 3n", "2 5 6n")
    intsbyfile = [[int(a) for a in row.strip().split()] for row in rowgrp]
                         # [[1,2,3], [0,1,3], [2,5,6]]
    intgrps = zip(*intsbyfile) # [(1,0,2), (2,1,5), (3,3,6)]
    # use float() to ensure we get true division in Python 2.
    averages = [float(sum(intgrp))/len(intgrp) for intgrp in intgrps]
    outfile.write(" ".join(str(a) for a in averages) + "n")

In Python 3, zip will only read the files as they are needed. In Python 2, if they’re too big to load into memory, use itertools.izip instead.

Answered By: Thomas K
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.