Dataframe Check in each columns if anything is repeating, create new column and add. Remove unnecessary " – " or " , " or anything

Question:

In the first row Zip is repeating in same so remove repeated data and in content column remove "-" and then match
If anything is repeating then add in 2nd column like zip -> zip2

Take care about EMPTY value like id 4 column Contact

id  Zip          Contact     
1   12345,12345  201-201-2012, 2012012012
2   67890        801-801-8012, 456-456-4567
3   12345,67890  7897897897
4   89898

Output

id Zip    Zip2   Contact      Contact2
1  12345         2012012012
2  67890         8018018012   4564564567
3  12345  67890  7897897897
4  89898
Asked By: parth hirpara

||

Answers:

You can do like below and it should work with any number of zip, contact values.

df_zip = (
    df["Zip"]
    .apply(lambda x: ",".join(list(OrderedDict.fromkeys(x.split(",")))))
    .str.split(",", expand=True)
)
df_zip.columns = df_zip.columns.map("zip_{}".format)

df_contact = df["Contact"].astype(str).str.replace("-", "")
df_contact = df_contact.apply(
    lambda x: ",".join(list(OrderedDict.fromkeys([e.strip() for e in x.split(",")])))
).str.split(",", expand=True)
df_contact.columns = df_contact.columns.map("Contact_{}".format)

out = pd.concat([df["id"], df_zip, df_contact], axis=1).fillna("")

print(out):

   id  zip_0  zip_1   Contact_0   Contact_1
0   1  12345         2012012012            
1   2  67890         8018018012  4564564567
2   3  12345  67890  7897897897            
3   4  89898    

You see I used OrderedDict to preserve the order of the keys

Answered By: SomeDude
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.