Tensorflow One Hot Encoder?

Question:

Does tensorflow have something similar to scikit learn’s one hot encoder for processing categorical data? Would using a placeholder of tf.string behave as categorical data?

I realize I can manually pre-process the data before sending it to tensorflow, but having it built in is very convenient.

Asked By: Robert Graves

||

Answers:

tf.one_hot() is available in TF and easy to use.

Lets assume you have 4 possible categories (cat, dog, bird, human) and 2 instances (cat, human). So your depth=4 and your indices=[0, 3]

import tensorflow as tf
res = tf.one_hot(indices=[0, 3], depth=4)
with tf.Session() as sess:
    print sess.run(res)

Keep in mind that if you provide index=-1 you will get all zeros in your one-hot vector.

Old answer, when this function was not available.

After looking though the python documentation, I have not found anything similar. One thing that strengthen my belief that it does not exist is that in their own example they write one_hot manually.

def dense_to_one_hot(labels_dense, num_classes=10):
  """Convert class labels from scalars to one-hot vectors."""
  num_labels = labels_dense.shape[0]
  index_offset = numpy.arange(num_labels) * num_classes
  labels_one_hot = numpy.zeros((num_labels, num_classes))
  labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
  return labels_one_hot

You can also do this in scikitlearn.

Answered By: Salvador Dali

As of TensorFlow 0.8, there is now a native one-hot op, tf.one_hot that can convert a set of sparse labels to a dense one-hot representation. This is in addition to tf.nn.sparse_softmax_cross_entropy_with_logits, which can in some cases let you compute the cross entropy directly on the sparse labels instead of converting them to one-hot.

Previous answer, in case you want to do it the old way:
@Salvador’s answer is correct – there (used to be) no native op to do it. Instead of doing it in numpy, though, you can do it natively in tensorflow using the sparse-to-dense operators:

num_labels = 10

# label_batch is a tensor of numeric labels to process
# 0 <= label < num_labels

sparse_labels = tf.reshape(label_batch, [-1, 1])
derived_size = tf.shape(label_batch)[0]
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
outshape = tf.pack([derived_size, num_labels])
labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)

The output, labels, is a one-hot matrix of batch_size x num_labels.

Note also that as of 2016-02-12 (which I assume will eventually be part of a 0.7 release), TensorFlow also has the tf.nn.sparse_softmax_cross_entropy_with_logits op, which in some cases can let you do training without needing to convert to a one-hot encoding.

Edited to add: At the end, you may need to explicitly set the shape of labels. The shape inference doesn’t recognize the size of the num_labels component. If you don’t need a dynamic batch size with derived_size, this can be simplified.

Edited 2016-02-12 to change the assignment of outshape per comment below.

Answered By: dga

Take a look at tf.nn.embedding_lookup. It maps from categorical IDs to their embeddings.

For an example of how it’s used for input data, see here.

Answered By: Markus

Maybe it’s due to changes to Tensorflow since Nov 2015, but @dga’s answer produced errors. I did get it to work with the following modifications:

sparse_labels = tf.reshape(label_batch, [-1, 1])
derived_size = tf.shape(sparse_labels)[0]
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
outshape = tf.concat(0, [tf.reshape(derived_size, [1]), tf.reshape(num_labels, [1])])
labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)
Answered By: CFB

You can use tf.sparse_to_dense:

The sparse_indices argument indicates where the ones should go, output_shape should be set to the number of possible outputs (e.g. the number of labels), and sparse_values should be 1 with the desired type (it will determine the type of the output from the type of sparse_values).

Answered By: Josh11b

There’s embedding_ops in Scikit Flow and examples that deal with categorical variables, etc.

If you just begin to learn TensorFlow, I would suggest you trying out examples in TensorFlow/skflow first and then once you are more familiar with TensorFlow it would be fairly easy for you to insert TensorFlow code to build a custom model you want (there are also examples for this).

Hope those examples for images and text understanding could get you started and let us know if you encounter any issues! (post issues or tag skflow in SO).

Answered By: Yuan Tang

Recent versions of TensorFlow (nightlies and maybe even 0.7.1) have an op called tf.one_hot that does what you want. Check it out!

On the other hand if you have a dense matrix and you want to look up and aggregate values in it, you would want to use the embedding_lookup function.

Answered By: Eugene Brevdo

Current versions of tensorflow implement the following function for creating one-hot tensors:

https://www.tensorflow.org/versions/master/api_docs/python/array_ops.html#one_hot

Answered By: Peteris

A simple and short way to one-hot encode any integer or list of intergers:

a = 5 
b = [1, 2, 3]
# one hot an integer
one_hot_a = tf.nn.embedding_lookup(np.identity(10), a)
# one hot a list of integers
one_hot_b = tf.nn.embedding_lookup(np.identity(max(b)+1), b)
Answered By: Rajarshee Mitra

numpy does it!

import numpy as np
np.eye(n_labels)[target_vector]
Answered By: Prakhar Agrawal

There are a couple ways to do it.

ans = tf.constant([[5, 6, 0, 0], [5, 6, 7, 0]]) #batch_size*max_seq_len
labels = tf.reduce_sum(tf.nn.embedding_lookup(np.identity(10), ans), 1)

>>> [[ 0.  0.  0.  0.  0.  1.  1.  0.  0.  0.]
>>> [ 0.  0.  0.  0.  0.  1.  1.  1.  0.  0.]]

The other way to do it is.

labels2 = tf.reduce_sum(tf.one_hot(ans, depth=10, on_value=1, off_value=0, axis=1), 2)

 >>> [[0 0 0 0 0 1 1 0 0 0]
 >>> [0 0 0 0 0 1 1 1 0 0]]
Answered By: Apurv

My version of @CFB and @dga example, shortened a bit to ease understanding.

num_labels = 10
labels_batch = [2, 3, 5, 9]

sparse_labels = tf.reshape(labels_batch, [-1, 1])
derived_size = len(labels_batch)
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels]) 
labels = tf.sparse_to_dense(concated, [derived_size, num_labels], 1.0, 0.0)
Answered By: VladimirLenin

As mentioned above by @dga, Tensorflow has tf.one_hot now:

labels = tf.constant([5,3,2,4,1])
highest_label = tf.reduce_max(labels)
labels_one_hot = tf.one_hot(labels, highest_label + 1)

array([[ 0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.]], dtype=float32)

You need to specify depth, otherwise you’ll get a pruned one-hot tensor.

If you like to do it manually:

labels = tf.constant([5,3,2,4,1])
size = tf.shape(labels)[0]
highest_label = tf.reduce_max(labels)
labels_t = tf.reshape(labels, [-1, 1])
indices = tf.reshape(tf.range(size), [-1, 1])
idx_with_labels = tf.concat([indices, labels_t], 1)
labels_one_hot = tf.sparse_to_dense(idx_with_labels, [size, highest_label + 1], 1.0)

array([[ 0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.]], dtype=float32)

Note arguments order in tf.concat()

Answered By: Alex Svetkin
In [7]: one_hot = tf.nn.embedding_lookup(np.eye(5), [1,2])

In [8]: one_hot.eval()
Out[8]: 
array([[ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.]])

works on TF version 1.3.0. As of Sep 2017.

Answered By: aerin

Tensorflow 2.0 Compatible Answer: You can do it efficiently using Tensorflow Transform.

Code for performing One-Hot Encoding using Tensorflow Transform is shown below:

def get_feature_columns(tf_transform_output):
  """Returns the FeatureColumns for the model.

  Args:
    tf_transform_output: A `TFTransformOutput` object.

  Returns:
    A list of FeatureColumns.
  """
  # Wrap scalars as real valued columns.
  real_valued_columns = [tf.feature_column.numeric_column(key, shape=())
                         for key in NUMERIC_FEATURE_KEYS]

  # Wrap categorical columns.
  one_hot_columns = [
      tf.feature_column.categorical_column_with_vocabulary_file(
          key=key,
          vocabulary_file=tf_transform_output.vocabulary_file_by_name(
              vocab_filename=key))
      for key in CATEGORICAL_FEATURE_KEYS]

  return real_valued_columns + one_hot_columns

For more information, refer this Tutorial on TF_Transform.

Answered By: Tensorflow Support