How to inspect a Tensorflow .tfrecord file?
Question:
I have a .tfrecord
but I don’t know how it is structured. How can I inspect the schema to understand what the .tfrecord
file contains?
All Stackoverflow answers or documentation seem to assume I know the structure of the file.
reader = tf.TFRecordReader()
file = tf.train.string_input_producer("record.tfrecord")
_, serialized_record = reader.read(file)
...HOW TO INSPECT serialized_record...
Answers:
Use TensorFlow tf.TFRecordReader
with the tf.parse_single_example
decoder as specified in https://www.tensorflow.org/programmers_guide/reading_data
PS, tfrecord contains ‘Example’ records defined in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto
Once you extract the record into a string, parsing it is something like this
a=tf.train.Example()
result = a.ParseFromString(binary_string_with_example_record)
However, I’m not sure where’s the raw support for extracting individual records from a file, you can track it down in TFRecordReader
Found it!
import tensorflow as tf
for example in tf.python_io.tf_record_iterator("data/foobar.tfrecord"):
print(tf.train.Example.FromString(example))
You can also add:
from google.protobuf.json_format import MessageToJson
...
jsonMessage = MessageToJson(tf.train.Example.FromString(example))
If your .tftrecord
contains SequenceExample, the accepted answer won’t show you everything. You can use:
import tensorflow as tf
for example in tf.python_io.tf_record_iterator("data/foobar.tfrecord"):
result = tf.train.SequenceExample.FromString(example)
break
print(result)
This will show you the content of the first example.
Then you can also inspect individual Features using their keys:
result.context.feature["foo_key"]
And for FeatureLists:
result.feature_lists.feature_list["bar_key"]
If it’s an option to install another Python package, tfrecord_lite is very convenient.
Example:
In [1]: import tensorflow as tf
...: from tfrecord_lite import decode_example
...:
...: it = tf.python_io.tf_record_iterator('nsynth-test.tfrecord')
...: decode_example(next(it))
...:
Out[1]:
{'audio': array([ 3.8138387e-06, -3.8721851e-06, 3.9331076e-06, ...,
-3.6526076e-06, 3.7041993e-06, -3.7578957e-06], dtype=float32),
'instrument': array([417], dtype=int64),
'instrument_family': array([0], dtype=int64),
'instrument_family_str': [b'bass'],
'instrument_source': array([2], dtype=int64),
'instrument_source_str': [b'synthetic'],
'instrument_str': [b'bass_synthetic_033'],
'note': array([149013], dtype=int64),
'note_str': [b'bass_synthetic_033-100-100'],
'pitch': array([100], dtype=int64),
'qualities': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64),
'sample_rate': array([16000], dtype=int64),
'velocity': array([100], dtype=int64)}
You can install it by pip install tfrecord_lite
.
Above solutions didn’t work for me so for TF 2.0 use this:
import tensorflow as tf
raw_dataset = tf.data.TFRecordDataset("path-to-file")
for raw_record in raw_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
https://www.tensorflow.org/tutorials/load_data/tfrecord#reading_a_tfrecord_file_2
I’d recommend the following script: tfrecord-view.
It enables a convenient visual inspection of TF records using TF and openCV, although needs a bit of modifications (for labels and such).
See further instructions inside the repository
Improvement of the accepted solution :
import tensorflow as tf
import json
from google.protobuf.json_format import MessageToJson
dataset = tf.data.TFRecordDataset("mydata.tfrecord")
for d in dataset:
ex = tf.train.Example()
ex.ParseFromString(d.numpy())
m = json.loads(MessageToJson(ex))
print(m['features']['feature'].keys())
In my case, I was running on TF2, and a single example was too big to fit on my screen, so I needed to use a dictionary to inspect the keys (the accepted solution return a full string).
The answer from amalik works, in addition you can decode the record with whatever method you have already implemented, for example here i can check the images saved in the tf record converting them to numpy array after reshaping them to tensors:
raw_dataset = tf.data.TFRecordDataset('/content/valid.tfrecords')
for raw_record in raw_dataset.take(1):
x, y = decode_record_spatial_measureimage(raw_record)
print(type(x.numpy()))
draw(x)
where i use this method to decode the 2 images in the tf record
def decode_record_spatial_measureimage(record):
name_to_features = {'input': tf.io.FixedLenFeature([], tf.string), 'ground': tf.io.FixedLenFeature([], tf.string)}
new_record = tf.io.parse_single_example(record, name_to_features)
input_raw = tf.io.decode_raw(new_record['input'], out_type=tf.float32)
ground_raw = tf.io.decode_raw(new_record['ground'], out_type=tf.float32)
return tf.reshape(input_raw, input_shape), tf.reshape(ground_raw, input_shape)
I have a .tfrecord
but I don’t know how it is structured. How can I inspect the schema to understand what the .tfrecord
file contains?
All Stackoverflow answers or documentation seem to assume I know the structure of the file.
reader = tf.TFRecordReader()
file = tf.train.string_input_producer("record.tfrecord")
_, serialized_record = reader.read(file)
...HOW TO INSPECT serialized_record...
Use TensorFlow tf.TFRecordReader
with the tf.parse_single_example
decoder as specified in https://www.tensorflow.org/programmers_guide/reading_data
PS, tfrecord contains ‘Example’ records defined in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto
Once you extract the record into a string, parsing it is something like this
a=tf.train.Example()
result = a.ParseFromString(binary_string_with_example_record)
However, I’m not sure where’s the raw support for extracting individual records from a file, you can track it down in TFRecordReader
Found it!
import tensorflow as tf
for example in tf.python_io.tf_record_iterator("data/foobar.tfrecord"):
print(tf.train.Example.FromString(example))
You can also add:
from google.protobuf.json_format import MessageToJson
...
jsonMessage = MessageToJson(tf.train.Example.FromString(example))
If your .tftrecord
contains SequenceExample, the accepted answer won’t show you everything. You can use:
import tensorflow as tf
for example in tf.python_io.tf_record_iterator("data/foobar.tfrecord"):
result = tf.train.SequenceExample.FromString(example)
break
print(result)
This will show you the content of the first example.
Then you can also inspect individual Features using their keys:
result.context.feature["foo_key"]
And for FeatureLists:
result.feature_lists.feature_list["bar_key"]
If it’s an option to install another Python package, tfrecord_lite is very convenient.
Example:
In [1]: import tensorflow as tf
...: from tfrecord_lite import decode_example
...:
...: it = tf.python_io.tf_record_iterator('nsynth-test.tfrecord')
...: decode_example(next(it))
...:
Out[1]:
{'audio': array([ 3.8138387e-06, -3.8721851e-06, 3.9331076e-06, ...,
-3.6526076e-06, 3.7041993e-06, -3.7578957e-06], dtype=float32),
'instrument': array([417], dtype=int64),
'instrument_family': array([0], dtype=int64),
'instrument_family_str': [b'bass'],
'instrument_source': array([2], dtype=int64),
'instrument_source_str': [b'synthetic'],
'instrument_str': [b'bass_synthetic_033'],
'note': array([149013], dtype=int64),
'note_str': [b'bass_synthetic_033-100-100'],
'pitch': array([100], dtype=int64),
'qualities': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64),
'sample_rate': array([16000], dtype=int64),
'velocity': array([100], dtype=int64)}
You can install it by pip install tfrecord_lite
.
Above solutions didn’t work for me so for TF 2.0 use this:
import tensorflow as tf
raw_dataset = tf.data.TFRecordDataset("path-to-file")
for raw_record in raw_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
https://www.tensorflow.org/tutorials/load_data/tfrecord#reading_a_tfrecord_file_2
I’d recommend the following script: tfrecord-view.
It enables a convenient visual inspection of TF records using TF and openCV, although needs a bit of modifications (for labels and such).
See further instructions inside the repository
Improvement of the accepted solution :
import tensorflow as tf
import json
from google.protobuf.json_format import MessageToJson
dataset = tf.data.TFRecordDataset("mydata.tfrecord")
for d in dataset:
ex = tf.train.Example()
ex.ParseFromString(d.numpy())
m = json.loads(MessageToJson(ex))
print(m['features']['feature'].keys())
In my case, I was running on TF2, and a single example was too big to fit on my screen, so I needed to use a dictionary to inspect the keys (the accepted solution return a full string).
The answer from amalik works, in addition you can decode the record with whatever method you have already implemented, for example here i can check the images saved in the tf record converting them to numpy array after reshaping them to tensors:
raw_dataset = tf.data.TFRecordDataset('/content/valid.tfrecords')
for raw_record in raw_dataset.take(1):
x, y = decode_record_spatial_measureimage(raw_record)
print(type(x.numpy()))
draw(x)
where i use this method to decode the 2 images in the tf record
def decode_record_spatial_measureimage(record):
name_to_features = {'input': tf.io.FixedLenFeature([], tf.string), 'ground': tf.io.FixedLenFeature([], tf.string)}
new_record = tf.io.parse_single_example(record, name_to_features)
input_raw = tf.io.decode_raw(new_record['input'], out_type=tf.float32)
ground_raw = tf.io.decode_raw(new_record['ground'], out_type=tf.float32)
return tf.reshape(input_raw, input_shape), tf.reshape(ground_raw, input_shape)