pandas, transforming dataframe & combining duplicates with counts

Question:

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String        CaseType 

0        1           'hello1'      1.0
1        1           'hello2'      1.0
2        1           'goodbye1'    1.0
3        1           'goodbye2'    1.0
4        2           'hello1'      3.0
5        2           'hello2'      3.0
6        2           'hello3'      3.0
7        3           'goodbye1'    1.0
8        3           'goodbye2'    1.0
9        3           'goodbye3'    1.0
10       4           'hello'       2.0
11       4           'goodbye'     2.0

Expected Output:
(Please note they are grouped based on uniqueID, and the case Type follows the last string of the uniqueID.)

idx   Source      Destination 

0        'hello1'     'hello2'
1        'hello2'     'goodbye1'
3        'goodbye1'   'goodbye2'
4        'goodbye2'   '1.0'
6        'hello1'     'hello2'
7        'hello2'     'hello3' 
8        'hello3'     '3.0'
10       'goodbye1'   'goodbye2'
11       'goodbye2'   'goodbye3'
12       'goodbye3'   '1.0'
13       'hello'      'goodbye'
14       'goodbye'    '2.0'

Question: How do I transform the pandas dataframe in this way?

Currently, I am iterating through every row in a for loop, and for each uniqueId, adding each string+CaseType (at the end) to a list, then splitting up that list and adding it to a new dataframe. It is incredibly slow.

Following this, the next step is to get the total counts/occurences for each row of the output. Essentially, if there are duplicate rows of source:destination (ie, we have 3 rows of ‘hello’ ‘goodbye’, it would result in 1 row with ‘hello’ ‘goodbye’ with their count as the 3rd column)

Example:

Original: 
idx   Source      Destination  

0     'hello1'     'hello2'   
1     'hello2'     'hello3' 
2     'hello1'     'hello2' 
3     'hello4'     'goodbye'  

Expected Output:

idx   Source      Destination  Count

0     'hello1'     'hello2'    2
1     'hello2'     'hello3'    1
2     'hello4'     'goodbye'   1

I presume the first step is slightly more complex with pandas logic, and the next step is essentially just combining duplicates and getting their count, but I am new to pandas and not entirely sure how to do either. Thank you in advance.

Asked By: Ev0

||

Answers:

df = pd.DataFrame({'idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'uniqueID': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4],
                   'String': ["'hello1'", "'hello2'", "'goodbye1'", "'goodbye2'", "'hello1'", "'hello2'", "'hello3'",
                              "'goodbye1'", "'goodbye2'", "'goodbye3'", "'hello'", "'goodbye'"],
                   'CaseType': [1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 2.0, 2.0]}).set_index('idx')

df = df.groupby('uniqueID').apply(
    lambda x: pd.DataFrame({'Source': x.String, 'Destination': x.String.shift(-1, fill_value=f''{x.CaseType.iloc[0]}'')}))
print(df)
         Source Destination
idx                        
0      'hello1'    'hello2'
1      'hello2'  'goodbye1'
2    'goodbye1'  'goodbye2'
3    'goodbye2'       '1.0'
4      'hello1'    'hello2'
5      'hello2'    'hello3'
6      'hello3'       '3.0'
7    'goodbye1'  'goodbye2'
8    'goodbye2'  'goodbye3'
9    'goodbye3'       '1.0'
10      'hello'   'goodbye'
11    'goodbye'       '2.0'
Answered By: Алексей Р

You could try the following:

df_res = df[["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
    df.groupby("uniqueID")["String"].transform("shift", -1)
      .fillna(df["CaseType"])
      .astype("str")
)

Result for your sample is:

      Source Destination
0     hello1      hello2
1     hello2    goodbye1
2   goodbye1    goodbye2
3   goodbye2         1.0
4     hello1      hello2
5     hello2      hello3
6     hello3         3.0
7   goodbye1    goodbye2
8   goodbye2    goodbye3
9   goodbye3         1.0
10     hello     goodbye
11   goodbye         2.0

Regarding the second part: Try

df_counts = (
    df_res.value_counts(["Source", "Destination"])
    .to_frame(name="Counts").reset_index()
)

to get

     Source Destination  Counts
0  goodbye1    goodbye2       2
1    hello1      hello2       2
2   goodbye         2.0       1
3  goodbye2         1.0       1
4  goodbye2    goodbye3       1
5  goodbye3         1.0       1
6     hello     goodbye       1
7    hello2    goodbye1       1
8    hello2      hello3       1
9    hello3         3.0       1

If you want to keep the Source and Destinationcolumns in the index then remove .reset_index() at the end.

Regarding the modification in the comment: You could try more or less the same, but restricted to a filtered df:

approved = ["hello1", "hello"]
m = df.groupby("uniqueID")["String"].transform("first").isin(approved)
df_res = df.loc[m, ["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
    df[m].groupby("uniqueID")["String"].transform("shift", -1)
      .fillna(df.loc[m, "CaseType"])
      .astype("str")
)
df_res = df_res.reset_index(drop=True)

The mask m removes those groups whose first row String is not in the list of approved strings.

Result:

     Source Destination
0    hello1      hello2
1    hello2    goodbye1
2  goodbye1    goodbye2
3  goodbye2         1.0
4    hello1      hello2
5    hello2      hello3
6    hello3         3.0
7     hello     goodbye
8   goodbye         2.0
Answered By: Timus
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.