Efficiently editing large input file based on simple lookup with python dataframes

Question:

I have a very large txt file (currently 6Gb, 50m rows) with a structure like this…

**id amount batch transaction sequence**
a2asd 12.6 123456 12394891237124 0
bs9dj 0.6 123456 12394891237124 1
etc...

I read the file like this…

inputFileDf = pd.read_csv(filename, header=None, index_col=False, sep ='t', names=['id','amount','batch','transaction','sequence'])

I also have a list that I’m generating during the app run (before loading the inputFileDf) that stores millions of rows of just the "transaction" and "sequence" columns…

runListDf = pd.DataFrame(runList, columns=['transaction','sequence-2'])

At the end of the run I want to update the input file based on the matches in the list as follows…

# merge the 2 input dfs (this step takes the longest)
combinedDf = pd.merge(inputFileDf, runListDf,  how='left', left_on=['transaction','sequence'], right_on = ['transaction','sequence-2']) 

combinedDf['sequence-2'] = combinedDf['sequence-2'].fillna(value=-1)

# create a new isValid column based on whether there's a match between the sequence fields (0 means invalid and will be removed later)
combinedDf['isValid'] = np.where(combinedDf['sequence'] == combinedDf['sequence-2'],0,1)

combinedDf = combinedDf.drop('sequence-2', axis=1)

# we only care about matching columns
combinedDf = combinedDf.loc[combinedDf['isValid'] == 1]

combinedDf.drop('isValid', axis=1, inplace=True)

combinedDf.to_csv(filename, header=None, index=None, sep ='t')

I’m running into performance issues which I’m not sure is to do with my code, or perhaps just the size of the comparisons I need. I’ve experimented (as suggested by Chat GPT!) with replacing my pd.merge operations with set_index + join but the set_index step takes even longer…

# this is even less efficient in my case
inputFileDf.set_index(['transaction', 'sequence'], inplace=True)
runListDf.set_index(['transaction', 'sequence'], inplace=True)
combinedDf = inputFileDf.join(runListDf, how='left')

Really appreciate any thoughts on whether it may be possible to perform this task much more efficiently.

Asked By: d3wannabe

||

Answers:

After our discussion, it seems you only want to keep (transaction, sequence) from inputFileDf that are not in runListDf. Therefore, using merge is not necessary and isin could be a better choice:

inputFileDf = pd.read_csv(filename, header=None, index_col=False, sep='t', 
                          names=['id', 'amount', 'batch', 'transaction', 'sequence'])
runListDf = pd.DataFrame(runList, columns=['transaction', 'sequence'])

cols = ['sequence', 'transaction']
mask = ~inputFileDf[cols].isin(runListDf[cols]).all(axis=1)
outliers = inputFileDf.loc[mask]

If both transaction and sequence from left match on right (all(axis=1)) the row will not be selected on the output.

Output:

>>> inputFileDf
      id  amount   batch  transaction  sequence
0  a2asd    12.6  123456          123         0
1  bs9dj     0.6  123456          123         2
2  c4sdd     4.3  123456          123         3

>>> runListDf  # the order of column doesn't matter, only the column names counts
   sequence  transaction
0         0          123
1         1          123

>>> outliers
      id  amount   batch  transaction  sequence
1  bs9dj     0.6  123456          123         2
2  c4sdd     4.3  123456          123         3
Answered By: Corralien
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.