Remove specific stopwords Pyspark

Question:

New to Pyspark, I’d like to remove some french stopwords from pyspark column.
Due to some constraint, I can’t use NLTK/Spacy, StopWordsRemover is the only option that I got.

Below is what I have tried so far without success

from pyspark.ml import *
from pyspark.ml.feature import *

stop = ['EARL ', 'EIRL ', 'EURL ', 'SARL ', 'SA ', 'SAS ', 'SASU ', 'SCI ', 'SCM ', 'SCP ']
stop = [l.lower() for l in stop]
    
model = Pipeline(stages = [
        Tokenizer(inputCol = "name", outputCol="token"), 
        StopWordsRemover(inputCol="token", outputCol="stop", stopWords = stop),]).fit(df)
    
  result = model.transform(df)

Here is the expected output

|name          |stop          |
|2A            |2A            |
|AZEJADE       |AZEJADE       |
|MONAZTESANTOS |MONAZTESANTOS |
|SCI SANTOS    |SANTOS        |
|SA FCB        |FCB           |
Asked By: A2N15

||

Answers:

The problem is that you have trailing spaces in your stop words. Also, you don’t need to lowercase them unless you need the StopWordsRemover to be case sensitive. By default it is set to false, you can change that using the parameter caseSensitive.

Note that when you are using Tokenizer the output will be in lowercase. If you need the output with the same case as input column name, then it might be preferable to simply split the column name by white spaces.

Try with this:

from pyspark.ml.feature import StopWordsRemover
import pyspark.sql.functions as F

stop = ['EARL', 'EIRL', 'EURL', 'SARL', 'SA', 'SAS', 'SASU', 'SCI', 'SCM', 'SCP']
df = spark.createDataFrame([("2A",), ("AZEJADE",), ("MONAZTESANTOS",), ("SCI SANTOS",), ("SA FCB",)], ["name"])

df = df.withColumn("tokens", F.split("name", "\s+"))
remover = StopWordsRemover(stopWords=stop, inputCol="tokens", outputCol="stop")

result = remover.transform(df).select("name", F.array_join("stop", " ").alias("stop"))

result.show()
#+-------------+-------------+
#|         name|         stop|
#+-------------+-------------+
#|           2A|           2A|
#|      AZEJADE|      AZEJADE|
#|MONAZTESANTOS|MONAZTESANTOS|
#|   SCI SANTOS|       SANTOS|
#|       SA FCB|          FCB|
#+-------------+-------------+
Answered By: blackbishop

To remove the Stopwords from dataframe, I tried Join and Filter approach: –

  1. Dataframe Left : WordCound output in form of dataframe
  2. Dataframe Right : Stopwords in a single column
  3. Left Join on the required ‘text’ columns
  4. Filter out the records where there is a match in joined columns
  5. (Used lowercase in both dataframes)

 word_df = clean_df 
            .withColumn('words',explode(split(col('course_title'), ' ')) )
            .withColumn('lowerCaseWords', lower(col("words")) ) 
            .groupBy('lowerCaseWords')
            .count()

stopwords_df = spark 
                .read 
                .option("header",False) 
                .csv("/FileStore/tables/standard/stopwords.csv") 
                .withColumn("stopword", lower(col("_c0")) )

join_word_df = word_df 
                    .join(stopwords_df,word_df["lowerCaseWords"] == stopwords_df["stopword"],"left")

final_wordcount_df  = join_word_df
                        .filter(col("stopword").isNull()) 
                        .filter(length(col("lowerCaseWords")) != 1 ) 
                        .filter(length(col("lowerCaseWords")) != 0) 
                        .drop("stopword","_c0") 
                        .orderBy(col("count").desc()) 
                        .display()

     

Answered By: Vibha