How to Find Indices where multiple vectors all are zero

Question:

Beginner pySpark question here:

How do I find the indices where all vectors are zero?

After a series of transformations, I have a spark df with ~2.5M rows and a tfidf Sparse Vector of length ~262K. I would like to perform PCA dimensionality reduction to make this data more manageable for multi-layer perceptron model fitting, but pyspark’s PCA is limited to a max of 65,535 columns.

+--------------------+
|      tfidf_features| df.count() >>> 2.5M 
+--------------------+ Example Vector:
|(262144,[1,37,75,...| SparseVector(262144, {7858: 1.7047, 12326: 1.2993, 15207: 0.0953, 
|(262144,[0],[0.12...|      24112: 0.452, 40184: 1.7047,...255115: 1.2993, 255507: 1.2993})
|(262144,[0],[0.12...|
|(262144,[0],[0.12...|
|(262144,[0,6,22,3...|
+--------------------+

Therefore, I would like to delete the indicies or columns of the sparse tfidf vector that are zero for all ~2.5M documents (rows). This will hopefully get me under the 65,535 maximum for PCA.

My plan is to to create a udf that (1) converts the Sparse Vectors to Dense Vectors (or np arrays) (2) searches all Vectors to find indices where all Vectors are zero (3) delete the index. However, I am struggling with the second part (finding the indices where all vectors equal zero). Here’s where I am so far, but I think my plan of attack is way too time consuming and not very pythonic (especially for such a big dataset):

import numpy as np    
row_count = df.count()
def find_zero_indicies(df):
     vectors = df.select('tfidf_features').take(row_count)[0]
     zero_indices = []
     to_delete = []
     for vec in vectors:
          vec = vec.toArray()
          for value in vec:
               if value.nonzero():
                    zero_indices.append(vec.index(value))
     for value in zero_indices:
          if zero_inices.count(value) == row_count:
               to_delete.append(value)
return to_delete

Any advice or help appreciated!

Asked By: whs2k

||

Answers:

If anything, it makes more sense to find indices which should be preserved:

from pyspark.ml.linalg import DenseVector, SparseVector
from pyspark.sql.functions import explode, udf
from operator import itemgetter

@udf("array<integer>")
def indices(v):
    if isinstance(v, DenseVector):
        return [i for i in range(len(v))]
    if isinstance(v, SparseVector):
        return v.indices.tolist()
    return []

indices_list = (df
    .select(explode(indices("tfidf_features")))
    .distinct()
    .rdd.map(itemgetter(0))
    .collect())

and use VectorSlicer:

from pyspark.ml.feature import VectorSlicer

slicer = VectorSlicer(
    inputCol="tfidf_features",
    outputCol="tfidf_features_subset", indices=indices_list)

slicer.transform(df)

However in practice I would recommend using fixed size vector, either with HashingTF:

HashingTF(inputCol="words", outputCol="tfidf_features", numFeatures=65535)

or CountVectorizer:

CountVectorizer(inputCol="words", outputCol="vectorizer_features", 
    vocabSize=65535)

In both cases you can combine it with StopWordsRemover.

Answered By: zero323