Remove period then email extension after '@' into new column to extract first and last name information

Question:

I have a list of emails that are in the format [email protected]. I would like to create a new column with only the first and last name extracted from the email address.

I am using PySpark. This is an example of the desired output:

data = [{"Email": "[email protected]", "Role": "manager"},
{"Email": "[email protected]", "Role": "vp"}]

df = spark.createDataFrame(data)

type(df)

# original data set
+------------------+-------+
|Email             |Role   |
+------------------+-------+
|[email protected]|manager|
|[email protected]|vp     |
+------------------|-------+

# what I want the output to look like
+------------------+-------+--------+
|Email             |Role   |Name    |
+------------------+-------+--------+
|[email protected]|manager|john doe|
|[email protected]|vp     |jane doe|
+------------------|-------|--------+

How can I remove the period, replace it with a space, then drop everything after the @ into a new column to get the names like the example above?

Asked By: user19454773

||

Answers:

You can use Python’s .split method for strings and a loop to add a "Name" field to each record in your list.

for d in data:
    d["Name] = " ".join(d["Email"].split("@")[0].split("."))

In the above loop we split the "Email" field at the "@" character, creating a list of two elements, of which we take the first one, and then split that on the character ".", which gives us the first and last name. Then we join them with a space (" ") in between.

Answered By: Efrem Rensi

You can use regex_extract and regex_replace.

from pyspark.sql import functions as F
df = df.withColumn('Name', F.regexp_extract(
        F.regexp_replace('Email', '.', ' '), 
        '(.*)@', 
        1)
     )

First, regexp_replace('Email', '.', ' ') will replace . to space in Email column.

Then, regexp_extract(..., '(.*)@', 1) will extract the 1st capture group.

Regex explanation

(.*) => .* is any characters with any length. Wrap with () to make a capture group.
@ => match @ mark. 

(.*)@ => 1st Capture group will capture any characters before @.
Answered By: Emma

It will replace the . and @... with a space which we’ll have to trim from the end.

from pyspark.sql import functions as F

df.withColumn('Name', F.trim(F.regexp_replace('Email', '.|@.*', ' '))).show()
# +------------------+-------+--------+
# |             Email|   Role|    Name|
# +------------------+-------+--------+
# |[email protected]|manager|john doe|
# |[email protected]|     vp|jane doe|
# +------------------+-------+--------+
Answered By: ZygD