Extract multiple dataframes from single .txt file

Question:

I have a file that contains several dataframes (each dataframe has its own variables) I want to extract all dataframes from this file and save them in separate files
Is there a way to do this with pandas?

df.txt

   col1,  col2,
0,     0,   NaN,
1,     1,   NaN,
2,     2,   2.0,
3,     3,   3.0,
   col3,  col4, col5,
4,     0,   1, 0,
5,     1,   NaN, 1,
   col6,  col7,  
6,     0,   NaN, 
7,     1,   3, 

Desired result :

df1.csv

   col1,  col2,
0,     0,   NaN,
1,     1,   NaN,
2,     2,   2.0,
3,     3,   3.0,

df2.csv

   col3,  col4, col5,
4,     0,   1, 0,
5,     1,   NaN, 1,

df3.csv

   col6,  col7,  
6,     0,   NaN, 
7,     1,   3, 

Asked By: Husain

||

Answers:

There’s no way to do this in pandas directly but you can easily do this if you know how to recognize the starting line of a new dataframe within the file.
If the file really looks like you posted and a new dataframe starts with some spaces (could be a tab as well) then the following code will create new files for each of the dataframes within the file.

import re


def create_csv(index, lines):
    with open(f"dfs{index}.csv", "w") as new_file:
        new_file.writelines(lines)


if __name__ == "__main__":
    with open("df.txt", "r") as f:
        lines = f.readlines()
        df_start_indexes = []
        for i, l in enumerate(lines):
            if re.match(r"s", l):
                df_start_indexes.append(i)

        print(df_start_indexes)
        for df_i, start_line in enumerate(df_start_indexes):
            end_line = df_start_indexes[df_i + 1] if len(df_start_indexes) > df_i + 1 else None
            create_csv(df_i, lines[start_line:end_line])
Answered By: Krulvis
df = pd.read_csv('df.txt', header=None)

csv_starts = df.loc[df[0].str.strip().str.startswith('col')].index.to_list()
for i,(a,b) in enumerate(zip(csv_starts, csv_starts[1:]+[None])):
    df.iloc[a:b].to_csv(f'df{i}.csv', index=False, header=False)
  1. Read the whole file as csv, but there are no commas in there, so it will be a 1-column DataFrame:

    df = pd.read_csv(‘df.txt’, header=None)

  2. Get the indices of new-file breaks – the rows with col-names in them:

    csv_starts = df.loc[df[0].str.strip().str.startswith(‘col’)].index.to_list()

  3. Build index-ranges and save them to file:

    for i,(a,b) in enumerate(zip(csv_starts, csv_starts[1:]+[None])):
    df.iloc[a:b].to_csv(f’df{i}.csv’, index=False, header=False)

Answered By: qaziqarta
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.