Does converting a seq2seq NLP model to the ONNX format negatively affect its performance?
Question:
I was looking at potentially converting an ml NLP model to the ONNX format in order to take advantage of its speed increase (ONNX Runtime). However, I don’t really understand what is fundamentally changed in the new models compared to the old models. Also, I don’t know if there are any drawbacks. Any thoughts on this would be very appreciated.
Answers:
performance of the model by accuracy will be the same (just considering the output of encoder and decoder). inference performance may vary based on the method you used for inferencing (eg: greedy search, beam search, top-k & top-p ). for more info on this.
for onnx seq2seq model, you need to implement model.generate()
method by hand. But onnxt5
lib has done a good job of implementing greedy search (for onnx model). However, most NLP generative models yield good results by beam search method (you can refer to the linked source for how huggingface implemented beam search for their models). Unfortunately for onnx models, you have to implement it by yourself.
the inference speed definitely increases as shown in this notebook by onnx-runtime (the example is on bert).
you have to run both the encoder and decoder separately on the onnx-runtime & can take advantage of the onnx-runtime. if you wanna know more about onnx and its runtime refer to this link.
Update: you can refer to fastT5
library, it implements both greedy
and beam search
for t5. for bart
have a look at this issue.
Advantages of going from the PyTorch eager world to ONNX include:
- ONNX Runtime is much lighter than PyTorch.
- General and transformer-specific optimizations and quantization from ONNX Runtime can be leveraged
- ONNX makes it easy to use many backends, first through the many execution providers supported in ONNX Runtime, from TensorRT to OpenVINO, to TVM. Some of them are top notch for inference speed on CPU/GPU.
- For some specific seq2seq architectures (gpt2, bart, t5), ONNX Runtime supports native
BeamSearch
and GreedySearch
operators: https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models , allowing to avoid making use of PyTorch generate()
method, but at the cost of less flexibility.
A decent compromise / alternative to fastT5 with more flexibility could be to export separately encoder and decoder parts of the model, do the execution with ONNX Runtime, but use PyTorch to handle generation. This is exactly what is implemented in the ORTModelForSeq2SeqLM
from Optimum library:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True, use_cache=True)
inputs = tokenizer("Translate English to German: Is this model actually good?", return_tensors="pt")
gen_tokens = model.generate(**inputs, use_cache=True)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['<pad> Ist dieses Modell tatsächlich gut?</s>']
As a side note, PyTorch will unveil an official support for torchdynamo in PyTorch 2.0, which is in my opinion a strong competitor to the ONNX + ONNX Runtime deployment path. I personally believe that PyTorch XLA + a good torchdynamo backend will rock it for generation.
Disclaimer: I am a contributor to the Optimum library
I was looking at potentially converting an ml NLP model to the ONNX format in order to take advantage of its speed increase (ONNX Runtime). However, I don’t really understand what is fundamentally changed in the new models compared to the old models. Also, I don’t know if there are any drawbacks. Any thoughts on this would be very appreciated.
performance of the model by accuracy will be the same (just considering the output of encoder and decoder). inference performance may vary based on the method you used for inferencing (eg: greedy search, beam search, top-k & top-p ). for more info on this.
for onnx seq2seq model, you need to implement model.generate()
method by hand. But onnxt5
lib has done a good job of implementing greedy search (for onnx model). However, most NLP generative models yield good results by beam search method (you can refer to the linked source for how huggingface implemented beam search for their models). Unfortunately for onnx models, you have to implement it by yourself.
the inference speed definitely increases as shown in this notebook by onnx-runtime (the example is on bert).
you have to run both the encoder and decoder separately on the onnx-runtime & can take advantage of the onnx-runtime. if you wanna know more about onnx and its runtime refer to this link.
Update: you can refer to fastT5
library, it implements both greedy
and beam search
for t5. for bart
have a look at this issue.
Advantages of going from the PyTorch eager world to ONNX include:
- ONNX Runtime is much lighter than PyTorch.
- General and transformer-specific optimizations and quantization from ONNX Runtime can be leveraged
- ONNX makes it easy to use many backends, first through the many execution providers supported in ONNX Runtime, from TensorRT to OpenVINO, to TVM. Some of them are top notch for inference speed on CPU/GPU.
- For some specific seq2seq architectures (gpt2, bart, t5), ONNX Runtime supports native
BeamSearch
andGreedySearch
operators: https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models , allowing to avoid making use of PyTorchgenerate()
method, but at the cost of less flexibility.
A decent compromise / alternative to fastT5 with more flexibility could be to export separately encoder and decoder parts of the model, do the execution with ONNX Runtime, but use PyTorch to handle generation. This is exactly what is implemented in the ORTModelForSeq2SeqLM
from Optimum library:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True, use_cache=True)
inputs = tokenizer("Translate English to German: Is this model actually good?", return_tensors="pt")
gen_tokens = model.generate(**inputs, use_cache=True)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['<pad> Ist dieses Modell tatsächlich gut?</s>']
As a side note, PyTorch will unveil an official support for torchdynamo in PyTorch 2.0, which is in my opinion a strong competitor to the ONNX + ONNX Runtime deployment path. I personally believe that PyTorch XLA + a good torchdynamo backend will rock it for generation.
Disclaimer: I am a contributor to the Optimum library