Loading Hugging face model is taking too much memory

Question:

I am trying to load a large Hugging face model with code like below:

model_from_disc = AutoModelForCausalLM.from_pretrained(path_to_model)
tokenizer_from_disc = AutoTokenizer.from_pretrained(path_to_model)
generator = pipeline("text-generation", model=model_from_disc, tokenizer=tokenizer_from_disc)

The program is quickly crashing after the first line because it is running out of memory. Is there a way to chunk the model as I am loading it, so that the program doesn’t crash?


EDIT

See cronoik’s answer for accepted solution, but here are the relevant pages on Hugging Face’s documentation:

Sharded Checkpoints: https://huggingface.co/docs/transformers/big_models#sharded-checkpoints:~:text=in%20the%20future.-,Sharded%20checkpoints,-Since%20version%204.18.0

Large Model Loading: https://huggingface.co/docs/transformers/main_classes/model#:~:text=the%20weights%20instead.-,Large%20model%20loading,-In%20Transformers%204.20.0

Asked By: Bud Linville

||

Answers:

You could try to load it with low_cpu_mem_usage:

from transformers import AutoModelForSeq2SeqLM

model_from_disc = AutoModelForCausalLM.from_pretrained(path_to_model, low_cpu_mem_usage=True)

Please note that low_cpu_mem_usage requires:
Accelerate >= 0.9.0 and PyTorch >= 1.9.0.

Answered By: cronoik