How to select range of rows in Pandas Dataframe
Question:
I have a dataframe (from a weirdly-formatted Excel file) that has sections of data in the rows. An example like this:
Name
Data1
Data2
First Group Header
some data
some more data
First Group Data
some data
some more data
First Group Data
some data
some more data
Second Group Header
some data
some more data
Second Group Data
some data
some more data
Second Group Data
some data
some more data
Second Group Data
some data
some more data
Second Group Data
some data
some more data
Third Group Header
some data
some more data
Second Group Data
some data
some more data
Question: How do I get individual dataframes for the Headers and associated data rows? The main issue is the range for each section of data is variable and can change over time, however the header names remain the same.
Answers:
I was able to use the following to get the different headers grouped together. It should allow you to add more headers if you would like, but does take a small amount of Tech Debt to keep current (unless there is more to your table I can’t see that you could work more with). This also assumes that you don’t have that line between the different groups. If you do then I would simply remove those with a dropna() as needed.
import pandas as pd
import numpy as np
header_list = ['First Group Header', 'Second Group Header', 'Third Group Header']
df['GROUPER'] = np.where(df['Name'].isin(header_list), df['Name'], np.nan)
df['GROUPER'] = df['GROUPER'].ffill()
I suggest following solution
import pandas as pd
df = pd.DataFrame({"Name":["1st header","A","B","2nd header","AA","BB","CC","DD","3rd header","AAA"],"col1":[1,2,3,10,20,30,40,50,100,200],"col2":[2,4,6,20,40,60,80,100,200,400]})
df["groupn"] = df["Name"].str.contains("header").cumsum()
group_dfs = [d for n, d in df.groupby("groupn")]
# group_dfs is now list of pd.DataFrames
for g_df in group_dfs:
print(g_df)
print("=====")
gives output
Name col1 col2 groupn
0 1st header 1 2 1
1 A 2 4 1
2 B 3 6 1
=====
Name col1 col2 groupn
3 2nd header 10 20 2
4 AA 20 40 2
5 BB 30 60 2
6 CC 40 80 2
7 DD 50 100 2
=====
Name col1 col2 groupn
8 3rd header 100 200 3
9 AAA 200 400 3
=====
Explanation: you need to have function which will give True
s for headers and False
s for non-headers, I check if Name
contains substring header
but other check might be done if required, then I use cumulative sum which gives number of group, this is possible as True
and False
are assumed to be 1
and 0
respectively when needed in python
, then I GROUP BY that column and create list of sub-dataframes. Observe that they have groupn
column, which you might elect to drop if it is no longer required.
I have a dataframe (from a weirdly-formatted Excel file) that has sections of data in the rows. An example like this:
Name | Data1 | Data2 |
---|---|---|
First Group Header | some data | some more data |
First Group Data | some data | some more data |
First Group Data | some data | some more data |
Second Group Header | some data | some more data |
Second Group Data | some data | some more data |
Second Group Data | some data | some more data |
Second Group Data | some data | some more data |
Second Group Data | some data | some more data |
Third Group Header | some data | some more data |
Second Group Data | some data | some more data |
Question: How do I get individual dataframes for the Headers and associated data rows? The main issue is the range for each section of data is variable and can change over time, however the header names remain the same.
I was able to use the following to get the different headers grouped together. It should allow you to add more headers if you would like, but does take a small amount of Tech Debt to keep current (unless there is more to your table I can’t see that you could work more with). This also assumes that you don’t have that line between the different groups. If you do then I would simply remove those with a dropna() as needed.
import pandas as pd
import numpy as np
header_list = ['First Group Header', 'Second Group Header', 'Third Group Header']
df['GROUPER'] = np.where(df['Name'].isin(header_list), df['Name'], np.nan)
df['GROUPER'] = df['GROUPER'].ffill()
I suggest following solution
import pandas as pd
df = pd.DataFrame({"Name":["1st header","A","B","2nd header","AA","BB","CC","DD","3rd header","AAA"],"col1":[1,2,3,10,20,30,40,50,100,200],"col2":[2,4,6,20,40,60,80,100,200,400]})
df["groupn"] = df["Name"].str.contains("header").cumsum()
group_dfs = [d for n, d in df.groupby("groupn")]
# group_dfs is now list of pd.DataFrames
for g_df in group_dfs:
print(g_df)
print("=====")
gives output
Name col1 col2 groupn
0 1st header 1 2 1
1 A 2 4 1
2 B 3 6 1
=====
Name col1 col2 groupn
3 2nd header 10 20 2
4 AA 20 40 2
5 BB 30 60 2
6 CC 40 80 2
7 DD 50 100 2
=====
Name col1 col2 groupn
8 3rd header 100 200 3
9 AAA 200 400 3
=====
Explanation: you need to have function which will give True
s for headers and False
s for non-headers, I check if Name
contains substring header
but other check might be done if required, then I use cumulative sum which gives number of group, this is possible as True
and False
are assumed to be 1
and 0
respectively when needed in python
, then I GROUP BY that column and create list of sub-dataframes. Observe that they have groupn
column, which you might elect to drop if it is no longer required.