How to sum rows that start with the same string

Question:

I used pandas to clean a csv:

import pandas as pd 
import numpy as np
df = pd.read_csv(r'C:UsersLeo90Downloadsdata-export.csv',encoding='utf-8', header=None, sep='n')
df = df[0].str.split(',', expand=True)
df=df.iloc[:,[0,1,2,3,4,5,6,7]]
df=df.replace(to_replace='None',value=np.nan).dropna()
df=df.reset_index(drop=True)
columnNames = df.iloc[0] 
df = df[1:] 
df.columns = columnNames
df.groupby('path').head()

The processed data like the screenshot below

Original data

I want to use python to make this dataframe look like this

enter image description here

I know I could use str.contain to match these strings but it can only return bool, so I can’t sum the A&B columns. Are there are any solutions for this problem?

I tried str.contain to match these strings but I can’t sum A&B.

Asked By: Sid

||

Answers:

Use groupby.sum on the starting string:

  • If the starting string is fixed, use .str[:n] to slice the first n chars

    df = pd.DataFrame({'path': ['google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
    #                     path  A  B
    # 0       google.com/A/123  1  2
    # 1       google.com/A/124  1  2
    # 2       google.com/A/125  1  2
    # 3      google.com/B/3333  1  2
    # 4  google.com/C/11111111  1  2
    # 5  google.com/C/11111113  1  2
    
    start = df['path'].str[:12]  # first 12 chars of df['path']
    out = df.groupby(start).sum()
    #               A  B
    # path              
    # google.com/A  3  6
    # google.com/B  1  2
    # google.com/C  2  4
    
  • If the starting string is dynamic, use .str.extract() to capture the desired pattern (e.g., up to the 2nd slash)

    df = pd.DataFrame({'path': ['A.com/A', 'A.com/A/B/C', 'google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
    #                     path  A  B
    # 0                A.com/A  1  2
    # 1            A.com/A/B/C  1  2
    # 2       google.com/A/123  1  2
    # 3       google.com/A/124  1  2
    # 4       google.com/A/125  1  2
    # 5      google.com/B/3333  1  2
    # 6  google.com/C/11111111  1  2
    # 7  google.com/C/11111113  1  2
    
    start = df['path'].str.extract(r'^([^/]+/[^/]+)', expand=False)  # up to 2nd slash of df['path']
    out = df.groupby(start).sum()
    #               A  B
    # path              
    # A.com/A       2  4
    # google.com/A  3  6
    # google.com/B  1  2
    # google.com/C  2  4
    

    Breakdown of ^([^/]+/[^/]+) regex:

    • ^: beginning of string
    • (: start capturing
    • [^/]+: non-slash characters
    • /: slash
    • [^/]+: non-slash characters
    • ): stop capturing
Answered By: tdy

Group by the path whilst disregarding the final subpath and aggregate (sum) the other columns.

df['path'] = df["path"].apply(lambda x: "/".join(x.split("/")[:-1]))
df.groupby("path").sum()
Answered By: Joynul Islam