How to sum rows that start with the same string
Question:
I used pandas to clean a csv:
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:UsersLeo90Downloadsdata-export.csv',encoding='utf-8', header=None, sep='n')
df = df[0].str.split(',', expand=True)
df=df.iloc[:,[0,1,2,3,4,5,6,7]]
df=df.replace(to_replace='None',value=np.nan).dropna()
df=df.reset_index(drop=True)
columnNames = df.iloc[0]
df = df[1:]
df.columns = columnNames
df.groupby('path').head()
The processed data like the screenshot below

I want to use python to make this dataframe look like this

I know I could use str.contain
to match these strings but it can only return bool, so I can’t sum the A&B columns. Are there are any solutions for this problem?
I tried str.contain
to match these strings but I can’t sum A&B.
Answers:
Use groupby.sum
on the starting string:
-
If the starting string is fixed, use .str[:n]
to slice the first n
chars
df = pd.DataFrame({'path': ['google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
# path A B
# 0 google.com/A/123 1 2
# 1 google.com/A/124 1 2
# 2 google.com/A/125 1 2
# 3 google.com/B/3333 1 2
# 4 google.com/C/11111111 1 2
# 5 google.com/C/11111113 1 2
start = df['path'].str[:12] # first 12 chars of df['path']
out = df.groupby(start).sum()
# A B
# path
# google.com/A 3 6
# google.com/B 1 2
# google.com/C 2 4
-
If the starting string is dynamic, use .str.extract()
to capture the desired pattern (e.g., up to the 2nd slash)
df = pd.DataFrame({'path': ['A.com/A', 'A.com/A/B/C', 'google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
# path A B
# 0 A.com/A 1 2
# 1 A.com/A/B/C 1 2
# 2 google.com/A/123 1 2
# 3 google.com/A/124 1 2
# 4 google.com/A/125 1 2
# 5 google.com/B/3333 1 2
# 6 google.com/C/11111111 1 2
# 7 google.com/C/11111113 1 2
start = df['path'].str.extract(r'^([^/]+/[^/]+)', expand=False) # up to 2nd slash of df['path']
out = df.groupby(start).sum()
# A B
# path
# A.com/A 2 4
# google.com/A 3 6
# google.com/B 1 2
# google.com/C 2 4
Breakdown of ^([^/]+/[^/]+)
regex:
^
: beginning of string
(
: start capturing
[^/]+
: non-slash characters
/
: slash
[^/]+
: non-slash characters
)
: stop capturing
Group by the path whilst disregarding the final subpath and aggregate (sum) the other columns.
df['path'] = df["path"].apply(lambda x: "/".join(x.split("/")[:-1]))
df.groupby("path").sum()
I used pandas to clean a csv:
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:UsersLeo90Downloadsdata-export.csv',encoding='utf-8', header=None, sep='n')
df = df[0].str.split(',', expand=True)
df=df.iloc[:,[0,1,2,3,4,5,6,7]]
df=df.replace(to_replace='None',value=np.nan).dropna()
df=df.reset_index(drop=True)
columnNames = df.iloc[0]
df = df[1:]
df.columns = columnNames
df.groupby('path').head()
The processed data like the screenshot below
I want to use python to make this dataframe look like this
I know I could use str.contain
to match these strings but it can only return bool, so I can’t sum the A&B columns. Are there are any solutions for this problem?
I tried str.contain
to match these strings but I can’t sum A&B.
Use groupby.sum
on the starting string:
-
If the starting string is fixed, use
.str[:n]
to slice the firstn
charsdf = pd.DataFrame({'path': ['google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2}) # path A B # 0 google.com/A/123 1 2 # 1 google.com/A/124 1 2 # 2 google.com/A/125 1 2 # 3 google.com/B/3333 1 2 # 4 google.com/C/11111111 1 2 # 5 google.com/C/11111113 1 2 start = df['path'].str[:12] # first 12 chars of df['path'] out = df.groupby(start).sum() # A B # path # google.com/A 3 6 # google.com/B 1 2 # google.com/C 2 4
-
If the starting string is dynamic, use
.str.extract()
to capture the desired pattern (e.g., up to the 2nd slash)df = pd.DataFrame({'path': ['A.com/A', 'A.com/A/B/C', 'google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2}) # path A B # 0 A.com/A 1 2 # 1 A.com/A/B/C 1 2 # 2 google.com/A/123 1 2 # 3 google.com/A/124 1 2 # 4 google.com/A/125 1 2 # 5 google.com/B/3333 1 2 # 6 google.com/C/11111111 1 2 # 7 google.com/C/11111113 1 2 start = df['path'].str.extract(r'^([^/]+/[^/]+)', expand=False) # up to 2nd slash of df['path'] out = df.groupby(start).sum() # A B # path # A.com/A 2 4 # google.com/A 3 6 # google.com/B 1 2 # google.com/C 2 4
Breakdown of
^([^/]+/[^/]+)
regex:^
: beginning of string(
: start capturing[^/]+
: non-slash characters/
: slash[^/]+
: non-slash characters)
: stop capturing
Group by the path whilst disregarding the final subpath and aggregate (sum) the other columns.
df['path'] = df["path"].apply(lambda x: "/".join(x.split("/")[:-1]))
df.groupby("path").sum()