Normalize input data frame into separate frames

Question:

I’m trying to create a python script to help me normalize my database. I’ve written up some example code.

I’m trying to check if dataArray exists in df1, under the names of the columns columArray

if it does exist, return the ID of the row (1 would return for both cases)

if it does not exist, add a new row and return the ID of the new row (3 would return for both)

In [df1, df2] I’ve got a few colums I don’t care about scanning (‘ID1′,’ID2′,’C’,’F’)

Here’s my current code.

import pandas as pd

df1=pd.DataFrame({
'ID1': ['0', '1', '2'],
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['Data does', 'not matter', '']})

df2=pd.DataFrame({
'ID2': ['0', '1', '2'],
'D': ['D0', 'D1', 'D2'],
'E': ['E0', 'E1', 'E2'],
'F': ['Data does', 'not matter', '']})

df_import = pd.DataFrame({
'A': ['A1', 'A4'],
'B': ['B1', 'B4'],
'C': ['C1', 'C4'],
'D': ['D1', 'D4'],
'E': ['E1', 'E4']
})

def n1(dataFrame,pkName,columArray,dataArray):
    #if dataArray in df.columArray:
    if 1!=1:
        return('Win')
        #return index/pkName (ID# '1')
    else:
        dataFrame.loc[len(dataFrame.index)] = [len(dataFrame.index), dataArray[0], dataArray[1], ""]
        return(len(dataFrame.index)-1)

i=0
for index, row in df_import.iterrows():
    pk_test1=n1(df1,'ID1',['A','B'],[row['A'], row['B']])
    pk_test2=n1(df2,'ID2',['D','E'],[row['D'], row['E']])
    print(pk_test1)
    print(pk_test2)
    i+=1

print(df1)
print()
print(df2)
print()
print(df_import)

and the output

3
3
4
4
  ID1   A   B           C
0   0  A0  B0   Data does
1   1  A1  B1  not matter
2   2  A2  B2            
3   3  A1  B1            
4   4  A4  B4            

  ID2   D   E           F
0   0  D0  E0   Data does
1   1  D1  E1  not matter
2   2  D2  E2            
3   3  D1  E1            
4   4  D4  E4            

    A   B   C   D   E
0  A1  B1  C1  D1  E1
1  A4  B4  C4  D4  E4

I’d like the output to be

1
3
1
3
  ID1   A   B           C
0   0  A0  B0   Data does
1   1  A1  B1  not matter
2   2  A2  B2                      
3   3  A4  B4            

  ID2   D   E           F
0   0  D0  E0   Data does
1   1  D1  E1  not matter
2   2  D2  E2            
3   3  D4  E4            

    A   B   C   D   E
0  A1  B1  C1  D1  E1
1  A4  B4  C4  D4  E4

The part of this I’m particularly asking about is the commended out if line, and the return on true.

Asked By: ClaskoTheKnight

||

Answers:

I’m particularly asking about is the commended out if line

>>> df1
  ID1   A   B           C
0   0  A0  B0   Data does
1   1  A1  B1  not matter
2   2  A2  B2
>>> columArray
['A', 'B']
>>> find = ['A1','B1']                              

Use .all(1).any() for the conditional.

if (find == df1[columArray]).all(1).any():

Use .all(0) to find rows where all columns equal the criteria; use .argmax() to return the index of the first row equal to the criteria.

>>> (find == df1[columArray]).all(1).argmax()        
1
>>> 
Answered By: wwii

This function will add the dataArray to the dataFrame if not found in the subcolumns columArray, and return the value of the pkName of the dataFrame at the matching index.

def n1(dataFrame,pkName,columArray,dataArray):
    if not (dataArray == dataFrame[columArray]).all(1).any():
        dataFrame.loc[len(dataFrame.index)] = [len(dataFrame.index)] + dataArray + [""]
    return dataFrame[pkName][(dataArray == dataFrame[columArray]).all(1).argmax()]
Answered By: ClaskoTheKnight
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.