Convert dataframe to multiple dataframes based on condition

Question:

I have a file with data of multiple transaction status, refer "FILE PIC". Now I want to get different dataframes for all transaction status without total of third column.
The string in first column are fixed for all the transaction status and row below the string are column names.

By the strings, I want to differentiate the dataframes, for reference see required output below.

FILE PIC

File data

    SUCCESS TRANSACTIONS    
    SA                      SB               SC
    255623                  145244651        55
    45454544                NA               85
    1300256655              3511614646       101
                                             241

    REJECTED TRANSACTIONS   
    RJA                     RJB              RJC
    3652656                 NA               78
    7884045455              45681247         97
    148428                  1242154661       75
                                             250

    REFUND TRANSACTIONS     
    RA                      RB               RC
    5546446166              NA               545
    1230001556              NA               75
    4548754545              4548545877       52
                                             672

Output Required

SUCCESS TRANSACTIONS DF

    SA                      SB               SC
    255623                  145244651        55
    45454544                NA               85
    1300256655              3511614646       101

REJECTED TRANSACTIONS DF

    RJA                     RJB              RJC
    3652656                 NA               78
    7884045455              45681247         97
    148428                  1242154661       75

REFUND TRANSACTIONS DF

    RA                      RB               RC
    5546446166              NA               545
    1230001556              NA               75
    4548754545              4548545877       52
Asked By: Mohit

||

Answers:

One way to parse file into dataframes.

Code

from io import StringIO
import re

def parse_file(filenm):
    '''
          split file into block of rows based upon 1 or more blank rows 
             -re.split('ns*n', data)
          convert each block into dataframe (function make_dataframe)
    '''
    def make_dataframe(p):
        ' Create dataframe from block of rows from file (not including first/last rows)' 
        return pd.read_csv(
                        StringIO('n'.join(p[1:-1])),
                        sep = 's+', 
                        engine = 'python')

    with open(filenm) as f:
        return {p[0]:make_dataframe(p)                  # p[0] is section name
                for s in re.split('ns*n', f.read())  # split file on 1+ blank lines
                if (p:=s.split('n'))}                  # p has block of lines
                

Usage

# Parse file into dataframes
dfs = parse_files('test.txt')

# Show dataframes
dfs = parse_file('test.txt')
for k, df in dfs.items():
    print(k)
    print(df)
    print()
    

Output

SUCCESS TRANSACTIONS DF
           SA         SB   SC
0      255623  145244651   55
1    45454544        NaN   85
2  1300256655 3511614646  101

REJECTED TRANSACTIONS DF
          RJA        RJB  RJC
0     3652656        NaN   78
1  7884045455   45681247   97
2      148428 1242154661   75

REFUND TRANSACTIONS DF
           RA         RB   RC
0  5546446166        NaN  545
1  1230001556        NaN   75
2  4548754545 4548545877   52
Answered By: DarrylG

Let’s prepare the data:

import pandas as pd
from io import StringIO

data = '''
    SUCCESS TRANSACTIONS    
    SA                      SB               SC
    255623                  145244651        55
    45454544                NA               85
    1300256655              3511614646       101
                                             241

    REJECTED TRANSACTIONS   
    RJA                     RJB              RJC
    3652656                 NA               78
    7884045455              45681247         97
    148428                  1242154661       75
                                             250

    REFUND TRANSACTIONS     
    RA                      RB               RC
    5546446166              NA               545
    1230001556              NA               75
    4548754545              4548545877       52
                                             672
'''

df = pd.read_csv(StringIO(data), sep='ss+', engine='python', names=[*'abc'])

I’m not sure if this data are exactly the same as the one on your side. So be careful with the code below. Anyway, that’s what you posted, so be it.

The common way to split such construction is like this:

  • extract some pattern as a marker of a starting row;
  • use cumsum method to mark rows of a contiguous group with the same number;
  • use these numbers to identify groups in groupby.

Here’s how it can look in code:

grouper = df['a'].str.endswith('TRANSACTIONS').cumsum()
groups = {}
for _, gr in df.groupby(grouper):
    name = gr.iloc[0, 0]
    groups[name] = (
        gr[1:-1]     # skip the very first row with a general name, and the last one with total
        .T.set_index(gr.index[1]).T    # use the row under the general name as a column names
        .reset_index(drop=True)        # drop old indexes
        .rename_axis(columns='')
    )

for name, group in groups.items():
    display(group.style.set_caption(name))

The output:

image

Answered By: Vitalizzare

If the data are stored in string ds:

print('n'.join(['n'.join(ta.split('n')[:-1]) for ta in ds.split(nn)]))

prints what you have requested.

    SUCCESS TRANSACTIONS    
    SA                      SB               SC
    255623                  145244651        55
    45454544                NA               85
    1300256655              3511614646       101
    REJECTED TRANSACTIONS   
    RJA                     RJB              RJC
    3652656                 NA               78
    7884045455              45681247         97
    148428                  1242154661       75
    REFUND TRANSACTIONS     
    RA                      RB               RC
    5546446166              NA               545
    1230001556              NA               75
    4548754545              4548545877       52

Below the entire code printing the output above:

data_as_string = """
    SUCCESS TRANSACTIONS    
    SA                      SB               SC
    255623                  145244651        55
    45454544                NA               85
    1300256655              3511614646       101
                                             241

    REJECTED TRANSACTIONS   
    RJA                     RJB              RJC
    3652656                 NA               78
    7884045455              45681247         97
    148428                  1242154661       75
                                             250

    REFUND TRANSACTIONS     
    RA                      RB               RC
    5546446166              NA               545
    1230001556              NA               75
    4548754545              4548545877       52
                                             672"""
ds  = data_as_string
nl  = 'n'
dnl = 'nn'

# with finally:

print(nl.join([nl.join(ta.split(nl)[:-1]) for ta in ds.split(dnl)]))
Answered By: Claudio
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.