Jupyter: The kernel appears to have died. It will restart automatically. (Keras Related)

Question:

I’m trying to train a Resnet50 but failing no matter what I do since the Jupyter notebook’s Kernel is dying (The kernel appears to have died. It will restart automatically), the moment it starts training (Epoch 1/100). I have GeForce GTX 1060 Ti, and when I do nvidia-smi during the training (which lasts 1 sec though) I only see 80 MB of memory being allocated compared to the past, and then the Kernel dies, as if it tries but it fails.

Here are the requirements:

pandas==0.25.1
numpy==1.17.2
opencv-python==4.1.1.26
scikit-image==0.15.0
scikit-learn==0.21.3
tensorflow-gpu==1.14.0
Keras==2.2.5
matplotlib==3.1.1
Pillow==6.1.0
albumentations==0.3.2
tqdm==4.35.0
jupyter

which I satisfy. Here is how I set up the training session:

config = tf.ConfigProto()
config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.9
sess = tf.Session(config=config) 
keras.backend.set_session(sess)

keras.__version__
os.environ["CUDA_VISIBLE_DEVICES"] = '0' #yes, this is the ID of my GPU.

# create the FCN model
model_mobilenet = ResNet50(input_shape=(1024, 1024, 3), include_top=False) # use the Resnet
model_x8_output = Conv2D(128, (1, 1), activation='relu')(model_mobilenet.layers[-95].output)
model_x8_output = UpSampling2D(size=(8, 8))(model_x8_output)
model_x8_output = Conv2D(3, (3, 3), padding='same', activation='sigmoid')(model_x8_output)
MODEL_x8 = Model(inputs=model_mobilenet.input, outputs=model_x8_output)

MODEL_x8.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3), metrics=[jaccard_distance])

MODEL_x8.fit_generator(train_generator, steps_per_epoch=300, epochs=100, verbose=1, validation_data=val_generator, validation_steps=10)

Epoch 1/100
  1/300 [..............................] - ETA: 1:01:59 - loss: 0.7193 - jaccard_distance: 0.1125

I have tried setting:

  • config.gpu_options.allow_growth to True.
  • config.gpu_options.per_process_gpu_memory_fraction to any other arbitrary value such as 0.1
  • commenting out: #os.environ["CUDA_VISIBLE_DEVICES"] = 0

none of them worked. I appreciate constructive answers.

Thanks in advance.

EDIT: I now tried running this as a script (not as a notebook) and the moment Tensorflow session line comes up, terminal throws the following:

2020-01-28 13:44:55.756819: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757047: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757736: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.808416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-01-28 13:44:55.808444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...

which is strange because I don’t have CUDA 10, rather 9.0, so this should not even be asked. Is my Tensorflow version wrong?

Asked By: Schütze

||

Answers:

Most possibly this is because there is not enough memory to store the data/model. Your input image size is also 1024×1024. I would siggest you to try training with a small image size like 256 or even 128, just to see if it is at least working.

Also, is your GPU being detected by TF?

Answered By: Rishabh Sahrawat

Okay, got it.

The problem was my tensorflow=gpu version (1.14) which was not compatible with my CUDA version (9.0). I had to install a version that is lower than 1.13. But that’s not the only catch. My CuDNN version (705) was also problematic, I had to reduce my Tensorflow-gpu all the way down to 1.9.0.

Now everything works.

Answered By: Schütze

In my case (windows 10, rtx 3050 ti GPU with vram of 4 GB), "The kernel appears to have died" error has been resolved by uninstalling CUDA 11 (and its relevant cuDNN) and installing CUDA 10.1 (and cuDNN 2.2.0) as well as uninstalling tensorflow-gpu 2.3.0 and installing tesorflow-gpu 2.2.0 (python 3.8 worked for me while tensorflow website had been tested python 3.5, so I did not downgrade python). However, I am not satisfied with the result as my GPU takes too long to make models compared to my core-i7 intel CPU.

In a word, this error seems to be related to incompatibility of GPU and CUDA version which can be fixed by downgrading CUDA and installing relevant counterparts according to new CUDA.

Answered By: Hamed Sabagh