apache-spark-ml

Remove specific stopwords Pyspark

Remove specific stopwords Pyspark Question: New to Pyspark, I’d like to remove some french stopwords from pyspark column. Due to some constraint, I can’t use NLTK/Spacy, StopWordsRemover is the only option that I got. Below is what I have tried so far without success from pyspark.ml import * from pyspark.ml.feature import * stop = [‘EARL …

Total answers: 2

How do I convert an array (i.e. list) column to Vector

How do I convert an array (i.e. list) column to Vector Question: Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql import Row source_data = [ Row(city=”Chicago”, temperatures=[-1.0, -2.0, -3.0]), Row(city=”New York”, temperatures=[-7.0, -7.0, -5.0]), ] df = spark.createDataFrame(source_data) Notice that the temperatures field is …

Total answers: 3

How to split Vector into columns – using PySpark

How to split Vector into columns – using PySpark Question: Context: I have a DataFrame with 2 columns: word and vector. Where the column type of “vector” is VectorUDT. An Example: word | vector assert | [435,323,324,212…] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | …

Total answers: 4

Create a custom Transformer in PySpark ML

Create a custom Transformer in PySpark ML Question: I am new to Spark SQL DataFrames and ML on them (PySpark). How can I create a custom tokenizer, which for example removes stop words and uses some libraries from nltk? Can I extend the default one? Asked By: Niko || Source Answers: Can I extend the …

Total answers: 1