Jupyter: The kernel appears to have died. It will restart automatically. (Keras Related)
Question:
I’m trying to train a Resnet50 but failing no matter what I do since the Jupyter notebook’s Kernel is dying (The kernel appears to have died. It will restart automatically
), the moment it starts training (Epoch 1/100). I have GeForce GTX 1060 Ti, and when I do nvidia-smi
during the training (which lasts 1 sec though) I only see 80 MB of memory being allocated compared to the past, and then the Kernel dies, as if it tries but it fails.
Here are the requirements:
pandas==0.25.1
numpy==1.17.2
opencv-python==4.1.1.26
scikit-image==0.15.0
scikit-learn==0.21.3
tensorflow-gpu==1.14.0
Keras==2.2.5
matplotlib==3.1.1
Pillow==6.1.0
albumentations==0.3.2
tqdm==4.35.0
jupyter
which I satisfy. Here is how I set up the training session:
config = tf.ConfigProto()
config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.9
sess = tf.Session(config=config)
keras.backend.set_session(sess)
keras.__version__
os.environ["CUDA_VISIBLE_DEVICES"] = '0' #yes, this is the ID of my GPU.
# create the FCN model
model_mobilenet = ResNet50(input_shape=(1024, 1024, 3), include_top=False) # use the Resnet
model_x8_output = Conv2D(128, (1, 1), activation='relu')(model_mobilenet.layers[-95].output)
model_x8_output = UpSampling2D(size=(8, 8))(model_x8_output)
model_x8_output = Conv2D(3, (3, 3), padding='same', activation='sigmoid')(model_x8_output)
MODEL_x8 = Model(inputs=model_mobilenet.input, outputs=model_x8_output)
MODEL_x8.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3), metrics=[jaccard_distance])
MODEL_x8.fit_generator(train_generator, steps_per_epoch=300, epochs=100, verbose=1, validation_data=val_generator, validation_steps=10)
Epoch 1/100
1/300 [..............................] - ETA: 1:01:59 - loss: 0.7193 - jaccard_distance: 0.1125
I have tried setting:
config.gpu_options.allow_growth
to True
.
config.gpu_options.per_process_gpu_memory_fraction
to any other arbitrary value such as 0.1
- commenting out:
#os.environ["CUDA_VISIBLE_DEVICES"] = 0
none of them worked. I appreciate constructive answers.
Thanks in advance.
EDIT: I now tried running this as a script (not as a notebook) and the moment Tensorflow session line comes up, terminal throws the following:
2020-01-28 13:44:55.756819: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757047: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757736: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.808416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-01-28 13:44:55.808444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
which is strange because I don’t have CUDA 10, rather 9.0, so this should not even be asked. Is my Tensorflow version wrong?
Answers:
Most possibly this is because there is not enough memory to store the data/model. Your input image size is also 1024×1024. I would siggest you to try training with a small image size like 256 or even 128, just to see if it is at least working.
Also, is your GPU being detected by TF?
Okay, got it.
The problem was my tensorflow=gpu version (1.14) which was not compatible with my CUDA version (9.0). I had to install a version that is lower than 1.13. But that’s not the only catch. My CuDNN version (705) was also problematic, I had to reduce my Tensorflow-gpu all the way down to 1.9.0.
Now everything works.
In my case (windows 10, rtx 3050 ti GPU with vram of 4 GB), "The kernel appears to have died" error has been resolved by uninstalling CUDA 11 (and its relevant cuDNN) and installing CUDA 10.1 (and cuDNN 2.2.0) as well as uninstalling tensorflow-gpu 2.3.0 and installing tesorflow-gpu 2.2.0 (python 3.8 worked for me while tensorflow website had been tested python 3.5, so I did not downgrade python). However, I am not satisfied with the result as my GPU takes too long to make models compared to my core-i7 intel CPU.
In a word, this error seems to be related to incompatibility of GPU and CUDA version which can be fixed by downgrading CUDA and installing relevant counterparts according to new CUDA.
I’m trying to train a Resnet50 but failing no matter what I do since the Jupyter notebook’s Kernel is dying (The kernel appears to have died. It will restart automatically
), the moment it starts training (Epoch 1/100). I have GeForce GTX 1060 Ti, and when I do nvidia-smi
during the training (which lasts 1 sec though) I only see 80 MB of memory being allocated compared to the past, and then the Kernel dies, as if it tries but it fails.
Here are the requirements:
pandas==0.25.1
numpy==1.17.2
opencv-python==4.1.1.26
scikit-image==0.15.0
scikit-learn==0.21.3
tensorflow-gpu==1.14.0
Keras==2.2.5
matplotlib==3.1.1
Pillow==6.1.0
albumentations==0.3.2
tqdm==4.35.0
jupyter
which I satisfy. Here is how I set up the training session:
config = tf.ConfigProto()
config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.9
sess = tf.Session(config=config)
keras.backend.set_session(sess)
keras.__version__
os.environ["CUDA_VISIBLE_DEVICES"] = '0' #yes, this is the ID of my GPU.
# create the FCN model
model_mobilenet = ResNet50(input_shape=(1024, 1024, 3), include_top=False) # use the Resnet
model_x8_output = Conv2D(128, (1, 1), activation='relu')(model_mobilenet.layers[-95].output)
model_x8_output = UpSampling2D(size=(8, 8))(model_x8_output)
model_x8_output = Conv2D(3, (3, 3), padding='same', activation='sigmoid')(model_x8_output)
MODEL_x8 = Model(inputs=model_mobilenet.input, outputs=model_x8_output)
MODEL_x8.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3), metrics=[jaccard_distance])
MODEL_x8.fit_generator(train_generator, steps_per_epoch=300, epochs=100, verbose=1, validation_data=val_generator, validation_steps=10)
Epoch 1/100
1/300 [..............................] - ETA: 1:01:59 - loss: 0.7193 - jaccard_distance: 0.1125
I have tried setting:
config.gpu_options.allow_growth
toTrue
.config.gpu_options.per_process_gpu_memory_fraction
to any other arbitrary value such as0.1
- commenting out:
#os.environ["CUDA_VISIBLE_DEVICES"] = 0
none of them worked. I appreciate constructive answers.
Thanks in advance.
EDIT: I now tried running this as a script (not as a notebook) and the moment Tensorflow session line comes up, terminal throws the following:
2020-01-28 13:44:55.756819: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757047: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757736: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.808416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-01-28 13:44:55.808444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
which is strange because I don’t have CUDA 10, rather 9.0, so this should not even be asked. Is my Tensorflow version wrong?
Most possibly this is because there is not enough memory to store the data/model. Your input image size is also 1024×1024. I would siggest you to try training with a small image size like 256 or even 128, just to see if it is at least working.
Also, is your GPU being detected by TF?
Okay, got it.
The problem was my tensorflow=gpu version (1.14) which was not compatible with my CUDA version (9.0). I had to install a version that is lower than 1.13. But that’s not the only catch. My CuDNN version (705) was also problematic, I had to reduce my Tensorflow-gpu all the way down to 1.9.0.
Now everything works.
In my case (windows 10, rtx 3050 ti GPU with vram of 4 GB), "The kernel appears to have died" error has been resolved by uninstalling CUDA 11 (and its relevant cuDNN) and installing CUDA 10.1 (and cuDNN 2.2.0) as well as uninstalling tensorflow-gpu 2.3.0 and installing tesorflow-gpu 2.2.0 (python 3.8 worked for me while tensorflow website had been tested python 3.5, so I did not downgrade python). However, I am not satisfied with the result as my GPU takes too long to make models compared to my core-i7 intel CPU.
In a word, this error seems to be related to incompatibility of GPU and CUDA version which can be fixed by downgrading CUDA and installing relevant counterparts according to new CUDA.