Efficiently editing large input file based on simple lookup with python dataframes
Question:
I have a very large txt file (currently 6Gb, 50m rows) with a structure like this…
**id amount batch transaction sequence**
a2asd 12.6 123456 12394891237124 0
bs9dj 0.6 123456 12394891237124 1
etc...
I read the file like this…
inputFileDf = pd.read_csv(filename, header=None, index_col=False, sep ='t', names=['id','amount','batch','transaction','sequence'])
I also have a list that I’m generating during the app run (before loading the inputFileDf) that stores millions of rows of just the "transaction" and "sequence" columns…
runListDf = pd.DataFrame(runList, columns=['transaction','sequence-2'])
At the end of the run I want to update the input file based on the matches in the list as follows…
# merge the 2 input dfs (this step takes the longest)
combinedDf = pd.merge(inputFileDf, runListDf, how='left', left_on=['transaction','sequence'], right_on = ['transaction','sequence-2'])
combinedDf['sequence-2'] = combinedDf['sequence-2'].fillna(value=-1)
# create a new isValid column based on whether there's a match between the sequence fields (0 means invalid and will be removed later)
combinedDf['isValid'] = np.where(combinedDf['sequence'] == combinedDf['sequence-2'],0,1)
combinedDf = combinedDf.drop('sequence-2', axis=1)
# we only care about matching columns
combinedDf = combinedDf.loc[combinedDf['isValid'] == 1]
combinedDf.drop('isValid', axis=1, inplace=True)
combinedDf.to_csv(filename, header=None, index=None, sep ='t')
I’m running into performance issues which I’m not sure is to do with my code, or perhaps just the size of the comparisons I need. I’ve experimented (as suggested by Chat GPT!) with replacing my pd.merge operations with set_index + join but the set_index step takes even longer…
# this is even less efficient in my case
inputFileDf.set_index(['transaction', 'sequence'], inplace=True)
runListDf.set_index(['transaction', 'sequence'], inplace=True)
combinedDf = inputFileDf.join(runListDf, how='left')
Really appreciate any thoughts on whether it may be possible to perform this task much more efficiently.
Answers:
After our discussion, it seems you only want to keep (transaction, sequence) from inputFileDf
that are not in runListDf
. Therefore, using merge
is not necessary and isin
could be a better choice:
inputFileDf = pd.read_csv(filename, header=None, index_col=False, sep='t',
names=['id', 'amount', 'batch', 'transaction', 'sequence'])
runListDf = pd.DataFrame(runList, columns=['transaction', 'sequence'])
cols = ['sequence', 'transaction']
mask = ~inputFileDf[cols].isin(runListDf[cols]).all(axis=1)
outliers = inputFileDf.loc[mask]
If both transaction and sequence from left match on right (all(axis=1)
) the row will not be selected on the output.
Output:
>>> inputFileDf
id amount batch transaction sequence
0 a2asd 12.6 123456 123 0
1 bs9dj 0.6 123456 123 2
2 c4sdd 4.3 123456 123 3
>>> runListDf # the order of column doesn't matter, only the column names counts
sequence transaction
0 0 123
1 1 123
>>> outliers
id amount batch transaction sequence
1 bs9dj 0.6 123456 123 2
2 c4sdd 4.3 123456 123 3
I have a very large txt file (currently 6Gb, 50m rows) with a structure like this…
**id amount batch transaction sequence**
a2asd 12.6 123456 12394891237124 0
bs9dj 0.6 123456 12394891237124 1
etc...
I read the file like this…
inputFileDf = pd.read_csv(filename, header=None, index_col=False, sep ='t', names=['id','amount','batch','transaction','sequence'])
I also have a list that I’m generating during the app run (before loading the inputFileDf) that stores millions of rows of just the "transaction" and "sequence" columns…
runListDf = pd.DataFrame(runList, columns=['transaction','sequence-2'])
At the end of the run I want to update the input file based on the matches in the list as follows…
# merge the 2 input dfs (this step takes the longest)
combinedDf = pd.merge(inputFileDf, runListDf, how='left', left_on=['transaction','sequence'], right_on = ['transaction','sequence-2'])
combinedDf['sequence-2'] = combinedDf['sequence-2'].fillna(value=-1)
# create a new isValid column based on whether there's a match between the sequence fields (0 means invalid and will be removed later)
combinedDf['isValid'] = np.where(combinedDf['sequence'] == combinedDf['sequence-2'],0,1)
combinedDf = combinedDf.drop('sequence-2', axis=1)
# we only care about matching columns
combinedDf = combinedDf.loc[combinedDf['isValid'] == 1]
combinedDf.drop('isValid', axis=1, inplace=True)
combinedDf.to_csv(filename, header=None, index=None, sep ='t')
I’m running into performance issues which I’m not sure is to do with my code, or perhaps just the size of the comparisons I need. I’ve experimented (as suggested by Chat GPT!) with replacing my pd.merge operations with set_index + join but the set_index step takes even longer…
# this is even less efficient in my case
inputFileDf.set_index(['transaction', 'sequence'], inplace=True)
runListDf.set_index(['transaction', 'sequence'], inplace=True)
combinedDf = inputFileDf.join(runListDf, how='left')
Really appreciate any thoughts on whether it may be possible to perform this task much more efficiently.
After our discussion, it seems you only want to keep (transaction, sequence) from inputFileDf
that are not in runListDf
. Therefore, using merge
is not necessary and isin
could be a better choice:
inputFileDf = pd.read_csv(filename, header=None, index_col=False, sep='t',
names=['id', 'amount', 'batch', 'transaction', 'sequence'])
runListDf = pd.DataFrame(runList, columns=['transaction', 'sequence'])
cols = ['sequence', 'transaction']
mask = ~inputFileDf[cols].isin(runListDf[cols]).all(axis=1)
outliers = inputFileDf.loc[mask]
If both transaction and sequence from left match on right (all(axis=1)
) the row will not be selected on the output.
Output:
>>> inputFileDf
id amount batch transaction sequence
0 a2asd 12.6 123456 123 0
1 bs9dj 0.6 123456 123 2
2 c4sdd 4.3 123456 123 3
>>> runListDf # the order of column doesn't matter, only the column names counts
sequence transaction
0 0 123
1 1 123
>>> outliers
id amount batch transaction sequence
1 bs9dj 0.6 123456 123 2
2 c4sdd 4.3 123456 123 3