How can I speed up reading multiple files and putting the data into a dataframe?

Question:

I have a number of text files, say 50, that I need to read into a massive dataframe. At the moment, I am using the following steps.

  1. Read every file and check what the labels are. The information I need is often contained in the first few lines. The same labels just repeat for the rest of the file, with different types of data listed against them each time.
  2. Create a dataframe with those labels.
  3. Read the file again and fill the dataframe with values.
  4. Concatenate that dataframe with a master dataframe.

This works pretty well for files that are of the 100 KB size – a few minutes, but at 50 MB, it just takes hours, and is not practical.

How can I optimise my code? In particular –

  1. How can I identify what functions are taking the most time, which I need to optimise? Is it the reading of the file? Is it the writing to the dataframe? Where is my program spending time?
  2. Should I consider multithreading or multiprocessing?
  3. Can I improve the algorithm?
    • Perhaps read the entire file in in one go into a list, rather than line by line,
    • Parse data in chunks/entire file, rather than line by line,
    • Assign data to the dataframe in chunks/one go, rather than row by row.
  4. Is there anything else that I can do to make my code execute faster?

Here is an example code. My own code is a little more complex, as the text files are more complex such that I have to use about 10 regular expressions and multiple while loops to read the data in and allocate it to the right location in the right array. To keep the MWE simple, I haven’t used repeating labels in the input files for the MWE either, so it would like I’m reading the file twice for no reason. I hope that makes sense!

import re
import pandas as pd

df = pd.DataFrame()
paths = ["../gitignore/test1.txt", "../gitignore/test2.txt"]
reg_ex = re.compile('^(.+) (.+)n')
# read all files to determine what indices are available
for path in paths:
    file_obj = open(path, 'r')
    print file_obj.readlines()

['a 1n', 'b 2n', 'end']
['c 3n', 'd 4n', 'end']

indices = []
for path in paths:
    index = []
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                index += match.group(1)
            except AttributeError:
                pass
    indices.append(index)
# read files again and put data into a master dataframe
for path, index in zip(paths, indices):
    subset_df = pd.DataFrame(index=index, columns=["Number"])
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                subset_df.loc[[match.group(1)]] = match.group(2)
            except AttributeError:
                pass
    df = pd.concat([df, subset_df]).sort_index()
print df

  Number
a      1
b      2
c      3
d      4

My input files:

test1.txt

a 1
b 2
end

test2.txt

c 3
d 4
end
Asked By: bluprince13

||

Answers:

You can import the multiprocessing model and use a pool of worker processes to open multiple files as file objects concurrently, speeding up the loading portion of your code. To test time, either import the datetime function and use the following code:

import datetime
start=datetime.datetime.now()

#part of your code goes here

execTime1=datetime.datetime.now()
print(execTime1-start)

#the next part of your code goes here

execTime2=datetime.datetime.now()
print(execTime2-execTime1)

As far as reading each file only once, consider using another multiprocessing script to build a list of lines in each file, so you can check for a match without a file I/O operation.

Answered By: Ron Distante

First, use a profiler for your script (see this question). Analyze exactly which part is consuming more time. See if you can optimize it.

Second, I feel that the I/O operation- file reading most likely be the bottleneck. It can be optimized using concurrent approach. I would suggest read files concurrently and create data frame. Each thread can push newly created data frame to a queue. A main thread monitoring queue can pick up data frames from queue and merge it with master data frame.

Hope this helps.

Answered By: Anwar Shaikh

1 create one output template for files (like result data frame should have column A, B C)

2 read every file, transform it into output template (that was established in step 1) and save file like temp_idxx.csv, this can be done in parallel 🙂

3 concatenate these temp_idxx.csv files into one massive file and delete temps

pros of this procedure is that it can be run in parallel, and it will not eat all the memory
cons are creating output format and sticking to it, and disk space usage

Answered By: quester

Read the files directly into a pandas dataframe using using pd.read_csv. To create your subset_df. Use methods such as skipfooter to skip the lines at the end of the file you know you wont need. There are many more methods available that may replace some of the regex loop functions you are using, such as error_bad_lines and skip_blank_lines.

Then use tools provided by pandas to clean out the data that is not needed.

This will allow you to read the open and read the file only once.

Answered By: blindChicken

Before pulling out the multiprocessing hammer, your first step should be to do some profiling. Use cProfile to quickly look through to identify which functions are taking a long time. Unfortunately if your lines are all in a single function call, they’ll show up as library calls. line_profiler is better but takes a little more setup time.

NOTE. If using ipython, you can use %timeit (magic command for the timeit module) and %prun (magic command for the profile module) both to time your statements as well as functions. A google search will show some guides.

Pandas is a wonderful library, but I’ve been an occasional victim of using it poorly with atrocious results. In particular, be wary of append()/concat() operations. That might be your bottleneck but you should profile to be sure. Usually, the numpy.vstack() and numpy.hstack() operations are faster if you don’t need to perform index/column alignment. In your case it looks like you might be able to get by with Series or 1-D numpy ndarrays which can save time.

BTW, a try block in python is much slower often 10x or more than checking for an invalid condition, so be sure you absolutely need it when sticking it into a loop for every single line. This is probably the other hogger of time; I imagine you stuck the try block to check for AttributeError in case of a match.group(1) failure. I would check for a valid match first.

Even these small modifications should be enough for your program to run significantly faster before trying anything drastic like multiprocessing. Those Python libraries are awesome but bring a fresh set of challenges to deal with.

Answered By: clocker

General python considerations :

First of all about time measurement you may use such a snippet:

from time import time, sleep


class Timer(object):
    def __init__(self):
        self.last = time()


    def __call__(self):
        old = self.last
        self.last = time()
        return self.last - old

    @property
    def elapsed(self):
        return time() - self.last



timer = Timer()

sleep(2)
print timer.elapsed
print timer()
sleep(1)
print timer()

Then you could benchmark running code many times, and check for the diff.

About this, i comment inline :

with open(path, 'r') as file_obj:
    line = True
    while line: #iterate on realdines instead.
        try:
            line = file_obj.readline()
            match = reg_ex.match(line)
            index += match.group(1)
            #if match:
            #    index.extend(match.group(1)) # or extend

        except AttributeError:
            pass

You previous code wat not really pythonic, you may want to try/except.
Then try only on do in on the minimum possible lines.

The same notices apply to the second block of code.

If you need to read the same files multiple times. you could store them in RAM using StringIO or easier keep a {path: content} dict that you only read once.

Python regex are known to be slow, your data seems pretty simple, you may consider using split and strip methods on your inputlines.

 striped=[l.split() for l in [c.strip() for c in file_desc.readlines()] if l] 

I recommend you to read this : https://gist.github.com/JeffPaine/6213790 the correspondig video is here https://www.youtube.com/watch?v=OSGv2VnC0go

Answered By: cgte

I’ve used this many times as it’s a particular easy implementation of multiprocessing.

import pandas as pd
from multiprocessing import Pool

def reader(filename):
    return pd.read_excel(filename)

def main():
    pool = Pool(4) # number of cores you want to use
    file_list = [file1.xlsx, file2.xlsx, file3.xlsx, ...]
    df_list = pool.map(reader, file_list) #creates a list of the loaded df's
    df = pd.concat(df_list) # concatenates all the df's into a single df

if __name__ == '__main__':
    main()

Using this you should be able to substantially increase the speed of your program without too much work at all. If you don’t know how many processors you have, you can check by pulling up your shell and typing

echo %NUMBER_OF_PROCESSORS%

EDIT: To make this run even faster, consider changing your files to csvs and using pandas function pandas.read_csv

Answered By: Некто

Your code don’t do what you describe.

Question: 1. Read every file and check what the labels are. The information I need is often contained in the first few lines.

But you read the whole file, not only a few lines.
This result in reading the files twice!

Question: 2. Read the file again and fill the dataframe with values.

You overwrite df['a'|'b'|'c'|'d'] in the loop again and again, which is useless
I belive this is not what you want.
This works for Data given in Question, but not if you have to deal with n values.


Proposal with a different logic:

data = {}
for path in paths:
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                if match.group(1) not in data:
                    data[ match.group(1) ] = []

                data[match.group(1)].append( match.group(2) )
            except AttributeError:
                pass

print('data=%s' % data)
df = pd.DataFrame.from_dict(data, orient='index').sort_index()
df.rename(index=str, columns={0: "Number"}, inplace=True)  

Output:

data={'b': ['2'], 'a': ['1'], 'd': ['4'], 'c': ['3']}
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 1 columns):
Number    4 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
  Number
a      1
b      2
c      3
d      4  

Time Table:

             Code from Q:   to_dict_from_dict
    4 values 0:00:00.033071 0:00:00.022146
 1000 values 0:00:08.267750 0:00:05.536500
10000 values 0:01:22.677500 0:00:55.365000

Tested with Python:3.4.2 – pandas:0.19.2 – re:2.2.1

Answered By: stovfl

First of all, if you are reading the file in multiple times, it seems like that would be the bottleneck. Try reading the file into 1 string object and then using cStringIO on it multiple times.

Second, you haven’t really shown any reason to build the indices before reading in all the files. Even if you do, why are you using Pandas for IO? It seems like you can build it up in regular python data structures (maybe using __slots__) and then put it in the master dataframe. If you don’t need file X index before you read file Y (as you 2nd loop seems to suggest), you just need to loop over the files once.

Third, you can either use simple split/strip on the strings to pull out space separated tokens, or if it’s more complicated (there are string quotes and such) use the CSV module from Python’s standard library. Until you show how you actually build up your data, it’s hard to suggest a fix related to that.

What you have shown so far can be done fairly quickly with the simple

for path in paths:
    data = []
    with open(path, 'r') as file_obj:
        for line in file_obj:
            try:
                d1, d2 = line.strip().split()
            except ValueError:
                pass
            data.append(d1, int(d2)))
    index, values = zip(*data)
    subset_df = pd.DataFrame({"Number": pd.Series(values, index=index)})

Here’s the difference in timings when I run on a virtual machine with the disk space not pre-allocated (the generated files are roughly 24MB in size):

import pandas as pd
from random import randint
from itertools import combinations
from posix import fsync


outfile = "indexValueInput"

for suffix in ('1', '2'):
    with open(outfile+"_" + suffix, 'w') as f:
        for i, label in enumerate(combinations([chr(i) for i in range(ord('a'), ord('z')+1)], 8)) :
            val = randint(1, 1000000)
            print >>f, "%s %d" % (''.join(label), val)
            if i > 3999999:
                break
        print >>f, "end"
        fsync(f.fileno())

def readWithPandas():
    data = []
    with open(outfile + "_2", 'r') as file_obj:
        for line in file_obj:
            try:
                d1, d2 = str.split(line.strip())
            except ValueError:
                pass
            data.append((d1, int(d2)))
    index, values = zip(*data)
    subset_df = pd.DataFrame({"Numbers": pd.Series(values, index=index)})

def readWithoutPandas():
    data = []
    with open(outfile+"_1", 'r') as file_obj:
        for line in file_obj:
            try:
                d1, d2 = str.split(line.strip())
            except ValueError:
                pass
            data.append((d1, int(d2)))
    index, values = zip(*data)

def time_func(func, *args):
    import time
    print "timing function", str(func.func_name)
    tStart = time.clock()
    func(*args)
    tEnd = time.clock()
    print "%f seconds " % (tEnd - tStart)

time_func(readWithoutPandas)
time_func(readWithPandas)

The resulting times are:

timing function readWithoutPandas
4.616853 seconds 
timing function readWithPandas
4.931765 seconds 

You can try these functions with your index buildup and see what the difference in time would be. It is almost certain that the slow down comes from multiple disk reads. And since Pandas will take no time to build up your dataframe from a dictionary, you are better off figuring out how to build up your index in pure Python before passing the data to Pandas. But do both the data read and the build up of the index in 1 disk read.

I guess one other caveat is that if you print from inside of your code, expect that to take a huge amount of time. The time it takes to write plain text to a tty dwarves the time it takes to read/write to disk.

Answered By: Dmitry Rubanovich

It turns out that creating a blank DataFrame first, searching the index to find the right place for a row of data, and then updating just that one row of the DataFrame is a stupidly time expensive process.

A much faster way of doing this is to read the contents of the input file into a primitive data structure such as a list of lists, or a list of dicts, and then converting that into a DataFrame.

Use lists when all of the data that you’re reading in are in the same columns. Otherwise, use dicts to explicitly say which column each bit of data should go to.

Update Jan 18: This is linked to How to parse complex text files using Python? I also wrote a blog article explaining how to parse complex files to beginners.

Answered By: bluprince13