Parsing Yann LeCun's MNIST IDX file format
Question:
I would like to understand how to open this version of the MNIST data set. For example, the training set label file train-labels-idx1-ubyte
is defined as:
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
And I found some code online that seems to work, but do not understand how it works:
with open('train-labels-idx1-ubyte', 'rb') as f:
bytes = f.read(8)
magic, size = struct.unpack(">II", bytes)
print(magic) # 2049
print(size) # 60000
My understanding is that struct.unpack
interprets the second argument as a big-endian byte string of two 4-byte integers (See here). When I actually print the value of bytes
, though, I get:
b'x00x00x08x01x00x00xea`'
The first four-byte integer makes sense:
b'x00x00x08x01'
The first two bytes are 0. The next indicates the data are unsigned bytes. And 0x01
indicates a 1-dimensional vector of labels. Assuming my understanding is correct so far, what is happening with the next three (four?) bytes:
...x00x00xea`
How does this translate to 60,000?
Answers:
To understand how it works, you need to convert it to its binary representation.
As you mentioned, Python is correctly extracting the right information:
>>> import struct
>>> with open('train-labels-idx1-ubyte', 'rb') as f:
... data = f.read(8)
...
>>> print(data)
b'x00x00x08x01x00x00xea`'
>>> print(struct.unpack('>II', data))
(2049, 60000)
In the header of the string, there are two 4-bytes integers. We can see their binary and decimal representation if we iterate over data
:
>>> for char in data:
... print('{0:08b} - {0:3d} - {1:s}'.format(char, str(bytes([char]))))
...
00000000 - 0 - b'x00'
00000000 - 0 - b'x00'
00001000 - 8 - b'x08'
00000001 - 1 - b'x01'
00000000 - 0 - b'x00'
00000000 - 0 - b'x00'
11101010 - 234 - b'xea'
01100000 - 96 - b'`'
The easy part is to know that the first 4 bytes are the first integer (the magic number), and the next 4 bytes are the second integer (the number of items).
Then, given these last 4 bytes, there are two ways one can construct the integer value they represent.
The first option (the one used in MNIST), is big or high endian. Which means, that the MOST significant bytes are found first:
00000000 00000000 11101010 01100000
If you check the decimal value of this binary number, it is 60,000, the number of items in the MNIST dataset.
Also, we could interpret this as little endian. In this case, the LESS significant bytes are found first:
01100000 11101010 00000000 00000000
Which in its decimal representation, is the number 1,625,948,160.
So, if you simply convert each byte in x00x00xea`
to binary, and you find the decimal representation of that whole binary number (reverting the order of the bytes if little endian), you have the integer value they represent.
I wrote the following code in case anyone needs to parse the whole dataset of images (as it appears in the question’s title), and not just the first two bytes.
import numpy as np
import struct
with open('samples/t10k-images-idx3-ubyte','rb') as f:
magic, size = struct.unpack(">II", f.read(8))
nrows, ncols = struct.unpack(">II", f.read(8))
data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size, nrows, ncols))
This assumes you uncompressed the .gz
file. You can also work with the compressed file, as indicated by Marktodisco’s answer, by adding import gzip
, using gzip.open(...)
instead of open(...)
, and using np.frombuffer(f.read(), ...)
instead of np.fromfile(f, ...)
.
And just to check, show the first digit. In my case it’s a 7.
import matplotlib.pyplot as plt
plt.imshow(data[0,:,:], cmap='gray')
plt.show()
In addition, the following code reads the file with labels
with open('samples/t10k-labels-idx1-ubyte','rb') as f:
magic, size = struct.unpack(">II", f.read(8))
data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size,)) # (Optional)
print(data)
# Prints: [7 2 1 ... 4 5 6]
The last reshape can be (size,)
or (1, size)
depending on your standards.
Carlos’s answer is great, but it breaks if the files are still in .gz
format. When I run the code I get the following error:
ValueError: cannot reshape array of size 1648861 into shape (10000,28,28)
Since the raw data downloads with a .gz
extension by default, I’ve modified Carlos’s code. See below.
import gzip
import struct
import numpy as np
with gzip.open('t10k-images-idx3-ubyte.gz','rb') as f:
magic, size = struct.unpack(">II", f.read(8))
nrows, ncols = struct.unpack(">II", f.read(8))
data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size, nrows, ncols))
And the images still load correctly.
import matplotlib.pyplot as plt
plt.imshow(data[0,:,:], cmap='gray')
plt.show()
Merging, this work for me:
def load_dataset(path_dataset):
with gzip.open(path_dataset,'rb') as f:
magic, size = struct.unpack(">II", f.read(8))
nrows, ncols = struct.unpack(">II", f.read(8))
data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size, nrows, ncols))
return data
def load_label(path_label):
with gzip.open(path_label,'rb') as f:
magic, size = struct.unpack('>II', f.read(8))
label = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
return label
X = load_dataset(r'samples/train-images-idx3-ubyte.gz')
y = load_label(r'samples/train-labels-idx1-ubyte.gz')
I would like to understand how to open this version of the MNIST data set. For example, the training set label file train-labels-idx1-ubyte
is defined as:
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
And I found some code online that seems to work, but do not understand how it works:
with open('train-labels-idx1-ubyte', 'rb') as f:
bytes = f.read(8)
magic, size = struct.unpack(">II", bytes)
print(magic) # 2049
print(size) # 60000
My understanding is that struct.unpack
interprets the second argument as a big-endian byte string of two 4-byte integers (See here). When I actually print the value of bytes
, though, I get:
b'x00x00x08x01x00x00xea`'
The first four-byte integer makes sense:
b'x00x00x08x01'
The first two bytes are 0. The next indicates the data are unsigned bytes. And 0x01
indicates a 1-dimensional vector of labels. Assuming my understanding is correct so far, what is happening with the next three (four?) bytes:
...x00x00xea`
How does this translate to 60,000?
To understand how it works, you need to convert it to its binary representation.
As you mentioned, Python is correctly extracting the right information:
>>> import struct
>>> with open('train-labels-idx1-ubyte', 'rb') as f:
... data = f.read(8)
...
>>> print(data)
b'x00x00x08x01x00x00xea`'
>>> print(struct.unpack('>II', data))
(2049, 60000)
In the header of the string, there are two 4-bytes integers. We can see their binary and decimal representation if we iterate over data
:
>>> for char in data:
... print('{0:08b} - {0:3d} - {1:s}'.format(char, str(bytes([char]))))
...
00000000 - 0 - b'x00'
00000000 - 0 - b'x00'
00001000 - 8 - b'x08'
00000001 - 1 - b'x01'
00000000 - 0 - b'x00'
00000000 - 0 - b'x00'
11101010 - 234 - b'xea'
01100000 - 96 - b'`'
The easy part is to know that the first 4 bytes are the first integer (the magic number), and the next 4 bytes are the second integer (the number of items).
Then, given these last 4 bytes, there are two ways one can construct the integer value they represent.
The first option (the one used in MNIST), is big or high endian. Which means, that the MOST significant bytes are found first:
00000000 00000000 11101010 01100000
If you check the decimal value of this binary number, it is 60,000, the number of items in the MNIST dataset.
Also, we could interpret this as little endian. In this case, the LESS significant bytes are found first:
01100000 11101010 00000000 00000000
Which in its decimal representation, is the number 1,625,948,160.
So, if you simply convert each byte in x00x00xea`
to binary, and you find the decimal representation of that whole binary number (reverting the order of the bytes if little endian), you have the integer value they represent.
I wrote the following code in case anyone needs to parse the whole dataset of images (as it appears in the question’s title), and not just the first two bytes.
import numpy as np
import struct
with open('samples/t10k-images-idx3-ubyte','rb') as f:
magic, size = struct.unpack(">II", f.read(8))
nrows, ncols = struct.unpack(">II", f.read(8))
data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size, nrows, ncols))
This assumes you uncompressed the .gz
file. You can also work with the compressed file, as indicated by Marktodisco’s answer, by adding import gzip
, using gzip.open(...)
instead of open(...)
, and using np.frombuffer(f.read(), ...)
instead of np.fromfile(f, ...)
.
And just to check, show the first digit. In my case it’s a 7.
import matplotlib.pyplot as plt
plt.imshow(data[0,:,:], cmap='gray')
plt.show()
In addition, the following code reads the file with labels
with open('samples/t10k-labels-idx1-ubyte','rb') as f:
magic, size = struct.unpack(">II", f.read(8))
data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size,)) # (Optional)
print(data)
# Prints: [7 2 1 ... 4 5 6]
The last reshape can be (size,)
or (1, size)
depending on your standards.
Carlos’s answer is great, but it breaks if the files are still in .gz
format. When I run the code I get the following error:
ValueError: cannot reshape array of size 1648861 into shape (10000,28,28)
Since the raw data downloads with a .gz
extension by default, I’ve modified Carlos’s code. See below.
import gzip
import struct
import numpy as np
with gzip.open('t10k-images-idx3-ubyte.gz','rb') as f:
magic, size = struct.unpack(">II", f.read(8))
nrows, ncols = struct.unpack(">II", f.read(8))
data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size, nrows, ncols))
And the images still load correctly.
import matplotlib.pyplot as plt
plt.imshow(data[0,:,:], cmap='gray')
plt.show()
Merging, this work for me:
def load_dataset(path_dataset):
with gzip.open(path_dataset,'rb') as f:
magic, size = struct.unpack(">II", f.read(8))
nrows, ncols = struct.unpack(">II", f.read(8))
data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
data = data.reshape((size, nrows, ncols))
return data
def load_label(path_label):
with gzip.open(path_label,'rb') as f:
magic, size = struct.unpack('>II', f.read(8))
label = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
return label
X = load_dataset(r'samples/train-images-idx3-ubyte.gz')
y = load_label(r'samples/train-labels-idx1-ubyte.gz')