PySpark – String matching to create new column
Question:
I have a dataframe like:
ID Notes
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
Let’s say for example there are only 3 employees to check: John, Stacy, or Marsha. I’d like to make a new column like so:
ID Notes Employee
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
Is regex or grep better here? What kind of function should I try? Thanks!
EDIT: I’ve been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:
ID Notes John Stacy Marsha
2345 Checked by John 1 0 0
2398 Verified by Stacy 0 1 0
3983 Double Checked on 2/23/17 by Marsha 0 0 1
Answers:
Something like this should work
import org.apache.spark.sql.functions._
dataFrame.withColumn("Employee", substring_index(col("Notes"), "t", 2))
In case you want to use regex to extract the proper value you need something like
dataFrame.withColumn("Employee", regexp_extract(col("Notes"), 'regex', <groupId>)
Brief
In its simplest form, and according to the example provided, this answer should suffice, albeit the OP should post more samples if other samples exist where the name should be preceded by any word other than by
.
Code
Regex
^(w+)[ t]*(.*bby[ t]+(w+)[ t]*.*)$
Replacement
1t2t3
Results
Input
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
Output
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
Note: The above output separates each column by the tab t
character, so it may not appear to be correct to the naked eye, but simply using an online regex parser and inserting t
into the regex match section should show you where each column begins/ends.
Explanation
Regex
^
Assert position at the beginning of the line
(w+)
Capture one or more word characters (a-zA-Z0-9_
) into group 1
[ t]*
Match any number of spaces or tab characters ([ t]
can be replaced with h
in some regex flavours such as PCRE)
(.*bby[ t]+(w+)[ t]*.*)
Capture the following into group 2
.*
Match any character (except newline unless the s
modifier is used)
bby
Match a word boundary b
, followed by by
literally
[ t]+
Match one or more spaces or tab characters
(w+)
Capture one or more word characters (a-zA-Z0-9_
) into group 3
[ t]*
Match any number of spaces or tab characters
.*
Match any character any number of times
$
Assert position at the end of the line
Replacement
1
Matches the same text as most recently matched by the 1st capturing group
t
Tab character
1
Matches the same text as most recently matched by the 2nd capturing group
t
Tab character
1
Matches the same text as most recently matched by the 3rd capturing group
In short:
regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))
This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes')
)
In Detail:
Create a sample dataframe
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
Do the needed imports
from pyspark.sql.functions import regexp_extract, col
On df
extract Employee
name from column using regexp_extract(column_name, regex, group_number)
.
Here regex('(.)(by)(s+)(w+)'
) means
- (.) – Any character (except newline)
- (by) – Word by in the text
- (s+) – One or many spaces
- (w+) – Alphanumeric or underscore chars of length one
and group_number is 4 because group (w+)
is in 4th position in expression
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
Note:
regexp_extract(col('Notes'), '.bys+(w+)', 1))
seems much cleaner version and check the Regex in use here
When I read the question again, the OP may speak of a fixed list of employees (“Let’s say for example there are only 3 employees to check: John, Stacy, or Marsha”).
If this is really a known list, then the simplest way is to check against this list of names with word boundaries:
regexp_extract(col('Notes'), 'b(John|Stacy|Marsha)b', 1)
I have a dataframe like:
ID Notes
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
Let’s say for example there are only 3 employees to check: John, Stacy, or Marsha. I’d like to make a new column like so:
ID Notes Employee
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
Is regex or grep better here? What kind of function should I try? Thanks!
EDIT: I’ve been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:
ID Notes John Stacy Marsha
2345 Checked by John 1 0 0
2398 Verified by Stacy 0 1 0
3983 Double Checked on 2/23/17 by Marsha 0 0 1
Something like this should work
import org.apache.spark.sql.functions._
dataFrame.withColumn("Employee", substring_index(col("Notes"), "t", 2))
In case you want to use regex to extract the proper value you need something like
dataFrame.withColumn("Employee", regexp_extract(col("Notes"), 'regex', <groupId>)
Brief
In its simplest form, and according to the example provided, this answer should suffice, albeit the OP should post more samples if other samples exist where the name should be preceded by any word other than by
.
Code
Regex
^(w+)[ t]*(.*bby[ t]+(w+)[ t]*.*)$
Replacement
1t2t3
Results
Input
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
Output
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
Note: The above output separates each column by the tab t
character, so it may not appear to be correct to the naked eye, but simply using an online regex parser and inserting t
into the regex match section should show you where each column begins/ends.
Explanation
Regex
^
Assert position at the beginning of the line(w+)
Capture one or more word characters (a-zA-Z0-9_
) into group 1[ t]*
Match any number of spaces or tab characters ([ t]
can be replaced withh
in some regex flavours such as PCRE)(.*bby[ t]+(w+)[ t]*.*)
Capture the following into group 2.*
Match any character (except newline unless thes
modifier is used)bby
Match a word boundaryb
, followed byby
literally[ t]+
Match one or more spaces or tab characters(w+)
Capture one or more word characters (a-zA-Z0-9_
) into group 3[ t]*
Match any number of spaces or tab characters.*
Match any character any number of times
$
Assert position at the end of the line
Replacement
1
Matches the same text as most recently matched by the 1st capturing groupt
Tab character1
Matches the same text as most recently matched by the 2nd capturing groupt
Tab character1
Matches the same text as most recently matched by the 3rd capturing group
In short:
regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))
This expression extracts employee name from any position where it is after by then space(s) in text column(
col('Notes')
)
In Detail:
Create a sample dataframe
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
Do the needed imports
from pyspark.sql.functions import regexp_extract, col
On df
extract Employee
name from column using regexp_extract(column_name, regex, group_number)
.
Here regex('(.)(by)(s+)(w+)'
) means
- (.) – Any character (except newline)
- (by) – Word by in the text
- (s+) – One or many spaces
- (w+) – Alphanumeric or underscore chars of length one
and group_number is 4 because group (w+)
is in 4th position in expression
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
Note:
regexp_extract(col('Notes'), '.bys+(w+)', 1))
seems much cleaner version and check the Regex in use here
When I read the question again, the OP may speak of a fixed list of employees (“Let’s say for example there are only 3 employees to check: John, Stacy, or Marsha”).
If this is really a known list, then the simplest way is to check against this list of names with word boundaries:
regexp_extract(col('Notes'), 'b(John|Stacy|Marsha)b', 1)