Piviot dataframe with duplicate indexes without applying aggfunction

Question:

I have data that is in long form. It looks like this

Analysis Mode  SID    Run ID  Accession ID   Analyte  Concentration 
QC              -     111522   QC L             Be        0.01
QC              -     111522   QC M             Be        0.04 
SAMPLE        EM123  111522   E2214172001       Be        1.20
SAMPLE        EM124  111522   E2214172002       Be        1.40
............
QC              -     111522   QC L             Be        0.02
QC              -     111522   QC M             Be        0.05 
............
QC              -     111522   QC L             V         0.1
QC              -     111522   QC M             V         0.2
SAMPLE        EM123  111522   E2214172001       V         1.0
SAMPLE        EM124  111522   E2214172002       V         1.3
............
QC              -     111522   QC L             V         0.1
QC              -     111522   QC M             V         0.3

In total the df is size (12879,8). There are 23 different analytes. I want to pivot the data so that the analytes are columns and the values are concentration

columns= [Analysis Mode , SID   , Run ID  ,Accession ID,   Be,  V]

This issue is there are duplicate IDs- In the example above QC L was analyzed for Be in run 111522 twice. I need to retain these numbers exactly meaning I do not want to average or do anything to them. It needs to be its own row

The output df I want would look something like this

 Analysis Mode  SID    Run ID  Accession ID  Be     V    
QC              -     111522   QC L           0.01   0.1       
QC              -     111522   QC M           0.04   0.2        
SAMPLE        EM123  111522   E2214172001     1.20   1.0      
SAMPLE        EM124  111522   E2214172002     1.40   1.3         
............
QC              -     111522   QC L           0.02   0.1        
QC              -     111522   QC M           0.05   0.3          
                    

I tried using pivot and piviot table but I keep getting duplicate index errors. I also trying using groupby and cumsum but it also gives me duplicate index errors

Asked By: Gingerhaze

||

Answers:

Input df:

   AnalysisMode    SID   RunID  AccessionID Analyte  Concentration
0            QC      -  111522          QCL      Be           0.01
1            QC      -  111522          QCM      Be           0.04
2        SAMPLE  EM123  111522  E2214172001      Be           1.20
3        SAMPLE  EM124  111522  E2214172002      Be           1.40
4            QC      -  111522          QCL      Be           0.02
5            QC      -  111522          QCM      Be           0.05
6            QC      -  111522          QCL       V           0.10
7            QC      -  111522          QCM       V           0.20
8        SAMPLE  EM123  111522  E2214172001       V           1.00
9        SAMPLE  EM124  111522  E2214172002       V           1.30
10           QC      -  111522          QCL       V           0.10
11           QC      -  111522          QCM       V           0.30

The following worked for me, but I feel like there should be a better way (I couldn’t figure out how though 🙂 ):

import pandas as pd

df = pd.read_clipboard() # Put your df here

# Note: I edited the columns and values a bit to not have spaces
columns = ["AnalysisMode", "SID", "RunID", "AccessionID"]
dupes = df.groupby(columns + ["Analyte"]).cumcount()
out = df.pivot_table(values="Concentration", columns="Analyte", index=columns+[dupes]).droplevel(level=4).reset_index()

output df:

Analyte AnalysisMode    SID   RunID  AccessionID    Be    V
0                 QC      -  111522          QCL  0.01  0.1
1                 QC      -  111522          QCL  0.02  0.1
2                 QC      -  111522          QCM  0.04  0.2
3                 QC      -  111522          QCM  0.05  0.3
4             SAMPLE  EM123  111522  E2214172001  1.20  1.0
5             SAMPLE  EM124  111522  E2214172002  1.40  1.3
Answered By: Chrysophylaxs
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.