How to interpret results of Spark OneHotEncoder

Question:

I read the OHE entry from Spark docs,

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

but sadly they do not give full explanation on the OHE result. So ran the given code:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",      outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

And got the results:

   +---+--------+-------------+-------------+
   | id|category|categoryIndex|  categoryVec|
   +---+--------+-------------+-------------+
   |  0|       a|          0.0|(2,[0],[1.0])|
   |  1|       b|          2.0|    (2,[],[])|
   |  2|       c|          1.0|(2,[1],[1.0])|
   |  3|       a|          0.0|(2,[0],[1.0])|
   |  4|       a|          0.0|(2,[0],[1.0])|
   |  5|       c|          1.0|(2,[1],[1.0])|
   +---+--------+-------------+-------------+

How could I interpret the results of OHE(last column)?

Asked By: Maria

||

Answers:

One-hot encoding transforms the values in categoryIndex into a binary vector. (Exactly one value is 1, while the others are 0) Since there are three values, the vector is of length 2 and the mapping is as follows:

0  -> 10
1  -> 01
2  -> 00

(Why is the mapping like this? See this question about the one-hot encoder dropping the last category.)

The values in column categoryVecare exactly these but represented in sparse format. In this format the zeros of a vector are not printed. The first value (2) shows the length of the vector, the second value is an array that lists zero or more indices where non-zero entries are found. The third value is another array that tells which numbers are found at these indices.
So (2,[0],[1.0]) means a vector of length 2 with 1.0 at position 0 and 0 elsewhere.

See: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector

Answered By: moe