How to select range of rows in Pandas Dataframe

Question:

I have a dataframe (from a weirdly-formatted Excel file) that has sections of data in the rows. An example like this:

Name Data1 Data2
First Group Header some data some more data
First Group Data some data some more data
First Group Data some data some more data
Second Group Header some data some more data
Second Group Data some data some more data
Second Group Data some data some more data
Second Group Data some data some more data
Second Group Data some data some more data
Third Group Header some data some more data
Second Group Data some data some more data

Question: How do I get individual dataframes for the Headers and associated data rows? The main issue is the range for each section of data is variable and can change over time, however the header names remain the same.

Asked By: Mark

||

Answers:

I was able to use the following to get the different headers grouped together. It should allow you to add more headers if you would like, but does take a small amount of Tech Debt to keep current (unless there is more to your table I can’t see that you could work more with). This also assumes that you don’t have that line between the different groups. If you do then I would simply remove those with a dropna() as needed.

import pandas as pd
import numpy as np

header_list = ['First Group Header', 'Second Group Header', 'Third Group Header']
df['GROUPER'] = np.where(df['Name'].isin(header_list), df['Name'], np.nan)
df['GROUPER'] = df['GROUPER'].ffill()
Answered By: ArchAngelPwn

I suggest following solution

import pandas as pd
df = pd.DataFrame({"Name":["1st header","A","B","2nd header","AA","BB","CC","DD","3rd header","AAA"],"col1":[1,2,3,10,20,30,40,50,100,200],"col2":[2,4,6,20,40,60,80,100,200,400]})
df["groupn"] = df["Name"].str.contains("header").cumsum()
group_dfs = [d for n, d in df.groupby("groupn")]
# group_dfs is now list of pd.DataFrames
for g_df in group_dfs:
    print(g_df)
    print("=====")

gives output

         Name  col1  col2  groupn
0  1st header     1     2       1
1           A     2     4       1
2           B     3     6       1
=====
         Name  col1  col2  groupn
3  2nd header    10    20       2
4          AA    20    40       2
5          BB    30    60       2
6          CC    40    80       2
7          DD    50   100       2
=====
         Name  col1  col2  groupn
8  3rd header   100   200       3
9         AAA   200   400       3
=====

Explanation: you need to have function which will give Trues for headers and Falses for non-headers, I check if Name contains substring header but other check might be done if required, then I use cumulative sum which gives number of group, this is possible as True and False are assumed to be 1 and 0 respectively when needed in python, then I GROUP BY that column and create list of sub-dataframes. Observe that they have groupn column, which you might elect to drop if it is no longer required.

Answered By: Daweo
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.