Why Pytorch officially use mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225] to normalize images?

Question:

In this page (https://pytorch.org/vision/stable/models.html), it says that "All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]".

Shouldn’t the usual mean and std of normalization be [0.5, 0.5, 0.5] and [0.5, 0.5, 0.5]? Why is it setting such strange values?

Asked By: laridzhang

||

Answers:

Using the mean and std of Imagenet is a common practice. They are calculated based on millions of images. If you want to train from scratch on your own dataset, you can calculate the new mean and std. Otherwise, using the Imagenet pretrianed model with its own mean and std is recommended.

Answered By: zihaozhihao

In that example, they are using the mean and stddev of ImageNet, but if you look at their MNIST examples, the mean and stddev are 1-dimensional (since the inputs are greyscale– no RGB channels).

Whether or not to use ImageNet’s mean and stddev depends on your data. Assuming your data are ordinary photos of "natural scenes" (people, buildings, animals, varied lighting/angles/backgrounds, etc.), and assuming your dataset is biased in the same way ImageNet is (in terms of class balance), then it’s ok to normalize with ImageNet’s scene statistics. If the photos are "special" somehow (color filtered, contrast adjusted, uncommon lighting, etc.) or an "un-natural subject" (medical images, satellite imagery, hand drawings, etc.) then I would recommend correctly normalizing your dataset before model training!*

Here’s some sample code to get you started:

import os
import torch
from torchvision import datasets, transforms
from torch.utils.data.dataset import Dataset
from tqdm.notebook import tqdm
from time import time

N_CHANNELS = 1

dataset = datasets.MNIST("data", download=True,
                 train=True, transform=transforms.ToTensor())
full_loader = torch.utils.data.DataLoader(dataset, shuffle=False, num_workers=os.cpu_count())

before = time()
mean = torch.zeros(1)
std = torch.zeros(1)
print('==> Computing mean and std..')
for inputs, _labels in tqdm(full_loader):
    for i in range(N_CHANNELS):
        mean[i] += inputs[:,i,:,:].mean()
        std[i] += inputs[:,i,:,:].std()
mean.div_(len(dataset))
std.div_(len(dataset))
print(mean, std)

print("time elapsed: ", time()-before)

In computer vision, "Natural scene" has a specific meaning which isn’t related to nature vs man-made, see https://en.wikipedia.org/wiki/Natural_scene_perception

* Otherwise you run into optimization problems due to elongations in the loss function– see my answer here.

Answered By: crypdick

I wasn’t able to calculate the standard deviation as planned, but did it using the code below. The grayscale imagenet’s train dataset mean and standard deviation are (round it as much as you like):

Mean: 0.44531356896770125

Standard Deviation: 0.2692461874154524

def calcSTD(d):
    meanValue = 0.44531356896770125
    squaredError = 0
    numberOfPixels = 0
    for f in os.listdir("/home/imagenet/ILSVRC/Data/CLS-LOC/train/"+str(d)+"/"): 
        if f.endswith(".JPEG"):
            
            image = imread("/home/imagenet/ILSVRC/Data/CLS-LOC/train/"+str(d)+"/"+str(f))
                
            ###Transform to gray if not already gray anyways  
            if  np.array(image).ndim == 3:
                matrix = np.array(image)
                blue = matrix[:,:,0]/255
                green = matrix[:,:,1]/255
                red = matrix[:,:,2]/255
                gray = (0.2989 * red + 0.587 * green + 0.114 * blue)
            else:
                gray = np.array(image)/255
            ###----------------------------------------------------       
                    
            for line in gray:
                for pixel in line:


                    squaredError += (pixel-meanValue)**2
                    numberOfPixels += 1
    
    return (squaredError, numberOfPixels)

a_pool = multiprocessing.Pool()
folders = []
[folders.append(f.name) for f in os.scandir("/home/imagenet/ILSVRC/Data/CLS-LOC/train") if f.is_dir()]
resultStD = a_pool.map(calcSTD, folders)

StD = (sum([intensity[0] for intensity in resultStD])/sum([pixels[1] for pixels in resultStD]))**0.5
print(StD)

Source: https://stackoverflow.com/a/65717887/7156266

Answered By: Engr Ali

TL;DR

I believe the reason is, like many things in (deep) machine learning,
it just happens to work well.

Details

The word ‘normalization’ in statistic can apply to different transformation.
For example:
for all x in X: x->(x - min(x))/(max(x)-min(x)
will normalize and stretch the values of X to [0..1] range.

Another example:
for all x in X: x->(x - mean(X))/stdv(x)
will transform the image to have mean=0, and standard deviation = 1. This transformation is called standard score, or sometimes standardization. If we multiply the result by sigma, and add mu we will set the result to have mean=mu and stdv=sigma

PyTorch doesn’t do any of these – instead it applies the standard score, but not with the mean and stdv values of X (the image to be normalized) but with values that are the average mean and average stdv over a large set of Imagenet images. But does not set the mean and stdv to these value.

If the image happens to have the same mean and standard deviation as the average of Imagenet set – it will be transformed to have mean 0 and stdv 1.
Otherwise, it will transform to something which is a function of its mean and stdv and said averages.

To me it is not clear what this rigorously means (why average of stdv?, and why applying standard score to the averages?).

Perhaps someone can clarify this?

However, like many things in deep machine learning, the theory is not fully established yet. My guess is that people have tried different normalization and this one just happens to perform well.

It does not means that this is the best possible normalization, only that it is a decent one. And of course, if you are using pre-trained values that were learned using this specific normalization, you are probably better of using the same normalization for inference or derived model as was used in the training.

Answered By: PolarBear2015
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.