Pandas to Pyspark conversion (repeat/explode)

Question

I’m trying to take a notebook that I’ve written in Python/Pandas and modify/convert it to use Pyspark. The dataset I’m working with is (as real world datasets often are) complete and utter garbage, and so some of the things I have to do to it are potentially a little non-standard as far as built-in Pyspark functions are concerned.

So, the part of the conversion I’m getting hung up on is this (this is what I’ve got in Pandas):

# Unstack all the columns individually

exploded = [
    df[['User_Name', 'cert_len']].loc[df.index.repeat(df['cert_len'])].reset_index(drop=True)['User_Name'],
    df['Certification'].str.split(',').explode().reset_index(drop=True),
    df['Provider'].str.split(',').explode().reset_index(drop=True),
    df['Credential_ID'].str.split(',').explode().reset_index(drop=True),
    …
]

# Concat unstacked columns back together

df_final = pd.concat(exploded, axis=1)

‘User_Name’ values are actually numbers, something like, e.g., 105432. The ‘cert_len’ values are a count of items in the ‘Certification’ column. Values in the remaining columns are concatenated strings, joined by commas. For instance if ‘cert_len’ was 5, the value in ‘Certification’ would be something like ‘Certified Scrum Master,AWS Cloud Practitioner,Tensorflow Developer,CompTS Security+,AWS Developer’. Etc. That is to say that each row has a single ‘User_Name’ value and then each subsequent column value contains all the info about their certs separated by commas. The desired end format should be one row per certification.

So my specific issue, you’ll notice in the code in the first line in the ‘exploded’ list, notice that I’m exploding the ‘User_Name’ column slightly differently than the rest of the columns; . What I’m doing there is taking the value in the User_Name column and repeating it as new rows so that the number of rows is equal to the number in the ‘cert_len’ column. Then when all the other columns are exploded everything matches up. Hope that make sense.

The only working solution I’ve been able to come up with involves a UDF, which I presume is the sort of thing you’d want to avoid in Spark, since the whole point is big data and a UDF would run row by row? That strategy actually involved modifying the value in the User_Name column so that it matched the format of the other columns (e.g. if User_Name was 105432 and cert_len was 5, ‘User_Name’ becomes 105432,105432,105432,105432,105432) and then exploding it like the rest. Exploding all the other columns isn’t giving me trouble, just the ‘User_Name’ column.

Basically, what I’m wondering is if there’s a way to do that without a UDF or if there’s some other strategy worth pursuing that anyone can think of that would accomplish the same as all the above. Please of course ask for clarification if I’ve been vague or left anything out. Thanks a bunch!

Asked By: snakeeyes021

||

Source

Answer 1

Setup

df.show()

+---------+--------+-------------+--------+
|User_Name|cert_len|Certification|Provider|
+---------+--------+-------------+--------+
|   105432|       2|          A,B|     P,Q|
|   105433|       3|        C,D,E|   R,S,T|
|   105434|       1|            F|       U|
+---------+--------+-------------+--------+

Pyspark Solution

# Define the id columns
ids = ['User_name', 'cert_len']

# Define the columns which you want to split and explode
cols = ['Certification', 'Provider']

# Like in pandas we split the strings
arr = [F.split(c, ',').alias(c) for c in cols]

# Zip the splited strings in each row and explode
df1 = df.select(*ids, F.explode(F.arrays_zip(*arr)).alias('temp'))

# for each column in cols extract the relevant column from temp
# column and assign the new column back to the original dataframe
df1 = df1.select(*ids, *[F.col('temp')[c].alias(c) for c in cols])

Result

df1.show()

+---------+--------+-------------+--------+
|User_name|cert_len|Certification|Provider|
+---------+--------+-------------+--------+
|   105432|       2|            A|       P|
|   105432|       2|            B|       Q|
|   105433|       3|            C|       R|
|   105433|       3|            D|       S|
|   105433|       3|            E|       T|
|   105434|       1|            F|       U|
+---------+--------+-------------+--------+

Answered By: Shubham Sharma

Pandas to Pyspark conversion (repeat/explode)

Question:

Answers:

Setup

Pyspark Solution

Result