How to Find Indices where multiple vectors all are zero
Question:
Beginner pySpark question here:
How do I find the indices where all vectors are zero?
After a series of transformations, I have a spark df with ~2.5M rows and a tfidf Sparse Vector of length ~262K. I would like to perform PCA dimensionality reduction to make this data more manageable for multi-layer perceptron model fitting, but pyspark’s PCA is limited to a max of 65,535 columns.
+--------------------+
| tfidf_features| df.count() >>> 2.5M
+--------------------+ Example Vector:
|(262144,[1,37,75,...| SparseVector(262144, {7858: 1.7047, 12326: 1.2993, 15207: 0.0953,
|(262144,[0],[0.12...| 24112: 0.452, 40184: 1.7047,...255115: 1.2993, 255507: 1.2993})
|(262144,[0],[0.12...|
|(262144,[0],[0.12...|
|(262144,[0,6,22,3...|
+--------------------+
Therefore, I would like to delete the indicies or columns of the sparse tfidf vector that are zero for all ~2.5M documents (rows). This will hopefully get me under the 65,535 maximum for PCA.
My plan is to to create a udf that (1) converts the Sparse Vectors to Dense Vectors (or np arrays) (2) searches all Vectors to find indices where all Vectors are zero (3) delete the index. However, I am struggling with the second part (finding the indices where all vectors equal zero). Here’s where I am so far, but I think my plan of attack is way too time consuming and not very pythonic (especially for such a big dataset):
import numpy as np
row_count = df.count()
def find_zero_indicies(df):
vectors = df.select('tfidf_features').take(row_count)[0]
zero_indices = []
to_delete = []
for vec in vectors:
vec = vec.toArray()
for value in vec:
if value.nonzero():
zero_indices.append(vec.index(value))
for value in zero_indices:
if zero_inices.count(value) == row_count:
to_delete.append(value)
return to_delete
Any advice or help appreciated!
Answers:
If anything, it makes more sense to find indices which should be preserved:
from pyspark.ml.linalg import DenseVector, SparseVector
from pyspark.sql.functions import explode, udf
from operator import itemgetter
@udf("array<integer>")
def indices(v):
if isinstance(v, DenseVector):
return [i for i in range(len(v))]
if isinstance(v, SparseVector):
return v.indices.tolist()
return []
indices_list = (df
.select(explode(indices("tfidf_features")))
.distinct()
.rdd.map(itemgetter(0))
.collect())
and use VectorSlicer
:
from pyspark.ml.feature import VectorSlicer
slicer = VectorSlicer(
inputCol="tfidf_features",
outputCol="tfidf_features_subset", indices=indices_list)
slicer.transform(df)
However in practice I would recommend using fixed size vector, either with HashingTF
:
HashingTF(inputCol="words", outputCol="tfidf_features", numFeatures=65535)
or CountVectorizer
:
CountVectorizer(inputCol="words", outputCol="vectorizer_features",
vocabSize=65535)
In both cases you can combine it with StopWordsRemover
.
Beginner pySpark question here:
How do I find the indices where all vectors are zero?
After a series of transformations, I have a spark df with ~2.5M rows and a tfidf Sparse Vector of length ~262K. I would like to perform PCA dimensionality reduction to make this data more manageable for multi-layer perceptron model fitting, but pyspark’s PCA is limited to a max of 65,535 columns.
+--------------------+
| tfidf_features| df.count() >>> 2.5M
+--------------------+ Example Vector:
|(262144,[1,37,75,...| SparseVector(262144, {7858: 1.7047, 12326: 1.2993, 15207: 0.0953,
|(262144,[0],[0.12...| 24112: 0.452, 40184: 1.7047,...255115: 1.2993, 255507: 1.2993})
|(262144,[0],[0.12...|
|(262144,[0],[0.12...|
|(262144,[0,6,22,3...|
+--------------------+
Therefore, I would like to delete the indicies or columns of the sparse tfidf vector that are zero for all ~2.5M documents (rows). This will hopefully get me under the 65,535 maximum for PCA.
My plan is to to create a udf that (1) converts the Sparse Vectors to Dense Vectors (or np arrays) (2) searches all Vectors to find indices where all Vectors are zero (3) delete the index. However, I am struggling with the second part (finding the indices where all vectors equal zero). Here’s where I am so far, but I think my plan of attack is way too time consuming and not very pythonic (especially for such a big dataset):
import numpy as np
row_count = df.count()
def find_zero_indicies(df):
vectors = df.select('tfidf_features').take(row_count)[0]
zero_indices = []
to_delete = []
for vec in vectors:
vec = vec.toArray()
for value in vec:
if value.nonzero():
zero_indices.append(vec.index(value))
for value in zero_indices:
if zero_inices.count(value) == row_count:
to_delete.append(value)
return to_delete
Any advice or help appreciated!
If anything, it makes more sense to find indices which should be preserved:
from pyspark.ml.linalg import DenseVector, SparseVector
from pyspark.sql.functions import explode, udf
from operator import itemgetter
@udf("array<integer>")
def indices(v):
if isinstance(v, DenseVector):
return [i for i in range(len(v))]
if isinstance(v, SparseVector):
return v.indices.tolist()
return []
indices_list = (df
.select(explode(indices("tfidf_features")))
.distinct()
.rdd.map(itemgetter(0))
.collect())
and use VectorSlicer
:
from pyspark.ml.feature import VectorSlicer
slicer = VectorSlicer(
inputCol="tfidf_features",
outputCol="tfidf_features_subset", indices=indices_list)
slicer.transform(df)
However in practice I would recommend using fixed size vector, either with HashingTF
:
HashingTF(inputCol="words", outputCol="tfidf_features", numFeatures=65535)
or CountVectorizer
:
CountVectorizer(inputCol="words", outputCol="vectorizer_features",
vocabSize=65535)
In both cases you can combine it with StopWordsRemover
.