Removing non-ascii and special character in pyspark dataframe column

Question:

I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters.

df = spark.read.csv(path, header=True, schema=availSchema)

I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below

df = df['textcolumn'].str.encode('ascii', 'ignore').str.decode('ascii')

There are no spaces in my column name. I receive an error

TypeError: 'Column' object is not callable
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-1486957561378215> in <module>
----> 1 InvFilteredDF = InvFilteredDF['SearchResultDescription'].str.encode('ascii', 'ignore').str.decode('ascii')

TypeError: 'Column' object is not callable

Is there an alternative to accomplish this, appreciate any help with this.

Asked By: sab

||

Answers:

This should work.

First creating a temporary example dataframe:

df = spark.createDataFrame([
    (0, "This is Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Data science is  cool"),
    (3, "This is aSA")
], ["id", "words"])

df.show()

Output

+---+--------------------+
| id|               words|
+---+--------------------+
|  0|       This is Spark|
|  1|I wish Java could...|
|  2|Data science is  ...|
|  3|      This is aSA|
+---+--------------------+

Now to write a UDF because those functions that you use cannot be directly performed on a column type and you will get the Column object not callable error

Solution

from pyspark.sql.functions import udf

def ascii_ignore(x):
    return x.encode('ascii', 'ignore').decode('ascii')

ascii_udf = udf(ascii_ignore)

df.withColumn("foo", ascii_udf('words')).show()

Output

+---+--------------------+--------------------+
| id|               words|                 foo|
+---+--------------------+--------------------+
|  0|       This is Spark|       This is Spark|
|  1|I wish Java could...|I wish Java could...|
|  2|Data science is  ...|Data science is  ...|
|  3|      This is aSA|         This is aSA|
+---+--------------------+--------------------+
Answered By: Rahul P

This answer worked well for me but it doesn’t like NULL. I added a small mod:

def ascii_ignore(x):
  if x:
    return x.encode('ascii', 'ignore').decode('ascii')
  else:
    return None
Answered By: SinisterPenguin

Both answers are really useful, but I couldn’t help but notice that we could just add udf as a decorator and be more pythonic

from pyspark.sql.functions import udf

@udf
def ascii_ignore(x):
    return x.encode('ascii', 'ignore').decode('ascii') if x else None

df.withColumn("foo", ascii_ignore('words')).limit(5).show()
Answered By: Partha Mandal