How to apply aplhabetic and numeric validation rule to database columns in pyspark?

Question:

I have one DB contains emp table ID,NAME,YEAR,AGE,DEPT columns.
I want to print pass if the NAME column passes the condition that contains characters only else fail.
And pass if year is in dd-mm-yyyy format else fail
pass if age col contains integers only else fail

And is it possible that above whole process can move to 1 function ?

Asked By: Priti M

||

Answers:

For each part of your question, you can use a trick.

name: you can use regular-expression with rlike() function.

date: you can cast date string to date format and check if it is valid.

name: you can cast to integer and check if it is valid.

note that if a cast is not valid pyspark returns Null.

schema = ['age', 'name', 'date']
data = [
    ("1", "A1", '30-12-2022'),
    ("2", "Aa", '36-11-2022'),
    ("3", "Aa", '2022-10-12'),
    ("4a", "Aa", '30-11-2022'),
    ("5", "Aa", '30-11-2022'),
]
df = spark.createDataFrame(data = data, schema = schema)
(
    df
    .filter(F.col('name').rlike("^[a-zA-Z]+$"))
    .filter(F.to_date(F.col('date'), 'dd-MM-yyyy').isNotNull())
    .filter(F.col('age').cast('int').isNotNull())
).show()

+---+----+----------+
|age|name|      date|
+---+----+----------+
|  5|  Aa|30-11-2022|
+---+----+----------+
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.