Convert dataframe to multiple dataframes based on condition
Question:
I have a file with data of multiple transaction status, refer "FILE PIC". Now I want to get different dataframes for all transaction status without total of third column.
The string in first column are fixed for all the transaction status and row below the string are column names.
By the strings, I want to differentiate the dataframes, for reference see required output below.
File data
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
241
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
250
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
672
Output Required
SUCCESS TRANSACTIONS DF
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
REJECTED TRANSACTIONS DF
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
REFUND TRANSACTIONS DF
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
Answers:
One way to parse file into dataframes.
Code
from io import StringIO
import re
def parse_file(filenm):
'''
split file into block of rows based upon 1 or more blank rows
-re.split('ns*n', data)
convert each block into dataframe (function make_dataframe)
'''
def make_dataframe(p):
' Create dataframe from block of rows from file (not including first/last rows)'
return pd.read_csv(
StringIO('n'.join(p[1:-1])),
sep = 's+',
engine = 'python')
with open(filenm) as f:
return {p[0]:make_dataframe(p) # p[0] is section name
for s in re.split('ns*n', f.read()) # split file on 1+ blank lines
if (p:=s.split('n'))} # p has block of lines
Usage
# Parse file into dataframes
dfs = parse_files('test.txt')
# Show dataframes
dfs = parse_file('test.txt')
for k, df in dfs.items():
print(k)
print(df)
print()
Output
SUCCESS TRANSACTIONS DF
SA SB SC
0 255623 145244651 55
1 45454544 NaN 85
2 1300256655 3511614646 101
REJECTED TRANSACTIONS DF
RJA RJB RJC
0 3652656 NaN 78
1 7884045455 45681247 97
2 148428 1242154661 75
REFUND TRANSACTIONS DF
RA RB RC
0 5546446166 NaN 545
1 1230001556 NaN 75
2 4548754545 4548545877 52
Let’s prepare the data:
import pandas as pd
from io import StringIO
data = '''
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
241
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
250
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
672
'''
df = pd.read_csv(StringIO(data), sep='ss+', engine='python', names=[*'abc'])
I’m not sure if this data are exactly the same as the one on your side. So be careful with the code below. Anyway, that’s what you posted, so be it.
The common way to split such construction is like this:
- extract some pattern as a marker of a starting row;
- use
cumsum
method to mark rows of a contiguous group with the same number;
- use these numbers to identify groups in
groupby
.
Here’s how it can look in code:
grouper = df['a'].str.endswith('TRANSACTIONS').cumsum()
groups = {}
for _, gr in df.groupby(grouper):
name = gr.iloc[0, 0]
groups[name] = (
gr[1:-1] # skip the very first row with a general name, and the last one with total
.T.set_index(gr.index[1]).T # use the row under the general name as a column names
.reset_index(drop=True) # drop old indexes
.rename_axis(columns='')
)
for name, group in groups.items():
display(group.style.set_caption(name))
The output:
If the data are stored in string ds
:
print('n'.join(['n'.join(ta.split('n')[:-1]) for ta in ds.split(nn)]))
prints what you have requested.
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
Below the entire code printing the output above:
data_as_string = """
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
241
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
250
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
672"""
ds = data_as_string
nl = 'n'
dnl = 'nn'
# with finally:
print(nl.join([nl.join(ta.split(nl)[:-1]) for ta in ds.split(dnl)]))
I have a file with data of multiple transaction status, refer "FILE PIC". Now I want to get different dataframes for all transaction status without total of third column.
The string in first column are fixed for all the transaction status and row below the string are column names.
By the strings, I want to differentiate the dataframes, for reference see required output below.
File data
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
241
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
250
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
672
Output Required
SUCCESS TRANSACTIONS DF
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
REJECTED TRANSACTIONS DF
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
REFUND TRANSACTIONS DF
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
One way to parse file into dataframes.
Code
from io import StringIO
import re
def parse_file(filenm):
'''
split file into block of rows based upon 1 or more blank rows
-re.split('ns*n', data)
convert each block into dataframe (function make_dataframe)
'''
def make_dataframe(p):
' Create dataframe from block of rows from file (not including first/last rows)'
return pd.read_csv(
StringIO('n'.join(p[1:-1])),
sep = 's+',
engine = 'python')
with open(filenm) as f:
return {p[0]:make_dataframe(p) # p[0] is section name
for s in re.split('ns*n', f.read()) # split file on 1+ blank lines
if (p:=s.split('n'))} # p has block of lines
Usage
# Parse file into dataframes
dfs = parse_files('test.txt')
# Show dataframes
dfs = parse_file('test.txt')
for k, df in dfs.items():
print(k)
print(df)
print()
Output
SUCCESS TRANSACTIONS DF
SA SB SC
0 255623 145244651 55
1 45454544 NaN 85
2 1300256655 3511614646 101
REJECTED TRANSACTIONS DF
RJA RJB RJC
0 3652656 NaN 78
1 7884045455 45681247 97
2 148428 1242154661 75
REFUND TRANSACTIONS DF
RA RB RC
0 5546446166 NaN 545
1 1230001556 NaN 75
2 4548754545 4548545877 52
Let’s prepare the data:
import pandas as pd
from io import StringIO
data = '''
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
241
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
250
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
672
'''
df = pd.read_csv(StringIO(data), sep='ss+', engine='python', names=[*'abc'])
I’m not sure if this data are exactly the same as the one on your side. So be careful with the code below. Anyway, that’s what you posted, so be it.
The common way to split such construction is like this:
- extract some pattern as a marker of a starting row;
- use
cumsum
method to mark rows of a contiguous group with the same number; - use these numbers to identify groups in
groupby
.
Here’s how it can look in code:
grouper = df['a'].str.endswith('TRANSACTIONS').cumsum()
groups = {}
for _, gr in df.groupby(grouper):
name = gr.iloc[0, 0]
groups[name] = (
gr[1:-1] # skip the very first row with a general name, and the last one with total
.T.set_index(gr.index[1]).T # use the row under the general name as a column names
.reset_index(drop=True) # drop old indexes
.rename_axis(columns='')
)
for name, group in groups.items():
display(group.style.set_caption(name))
The output:
If the data are stored in string ds
:
print('n'.join(['n'.join(ta.split('n')[:-1]) for ta in ds.split(nn)]))
prints what you have requested.
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
Below the entire code printing the output above:
data_as_string = """
SUCCESS TRANSACTIONS
SA SB SC
255623 145244651 55
45454544 NA 85
1300256655 3511614646 101
241
REJECTED TRANSACTIONS
RJA RJB RJC
3652656 NA 78
7884045455 45681247 97
148428 1242154661 75
250
REFUND TRANSACTIONS
RA RB RC
5546446166 NA 545
1230001556 NA 75
4548754545 4548545877 52
672"""
ds = data_as_string
nl = 'n'
dnl = 'nn'
# with finally:
print(nl.join([nl.join(ta.split(nl)[:-1]) for ta in ds.split(dnl)]))