ONNX Operators for Regex Replacements

Question:

QUESTION: How can I use ONNX operators to do string replacements with regular expressions?

I am trying to export a Scikit-Learn machine learning pipeline to the Open Neural Network Exchange (ONNX) format. The pipeline takes text as input. Many of the steps that are included in the pipeline are nicely included in the standard, like a TfIdfVectorizer and a TruncatedSVD transformer. However, the first pipeline step is a custom transformer which makes a set of changes to the input text through the exploitation of regular expressions.

When adding a custom transformer, the scikitlearn-onnx docs suggest that a custom shape and converter function should be written. The converter function in particular must be written by combining a set of predefined operators that exist within the ONNX standard. However, from what I can tell, it is not possible to do even basic string manipulation with the operators that exist.

One of the regular expression powered replacements that I want to make is a unit conversion, for example:

12m -> 12 meters

With Python’s re package this is trivial:

import re

my_string = "The Empire State Building is 443m tall."

meters_pattern = re.compile("(?<=[0-9])m ")
my_transformed_string = re.sub(meters_pattern, " meters ", my_string)

>>> print(my_transformed_string)
The Empire State Building is 443 meters tall.

However, I cannot conceive of a way to do this with the available ONNX operators. Here’s what I’ve thought to try:

  1. Use a regular expression opererator in a similar manner to the Python example above.

Problem: ONNX does not have a regex operator.

  1. Evaluate the input string sequentially, one character at a time. If an "m" follows a digit, change the string as described above.

Problem: This approach requires a comparison of strings: does "this character in the string" equal "m"? However, the existing OnnxEqual operator does not support string comparison.

  1. Translate the input string, character by character, to it’s ASCII decimal equivalent and then perform step 2.

Problem: ONNX does not have a translate-like operator (like GNU tr) for strings. ONNX also does not support casting non-strictly numeric strings with OnnxCast.

  1. Use the OnnxUnique operator and it’s inverse_indicies property to translate the input string to something approximating each character’s ASCII decimal value.

Problem: This requires prepending a key string tnr !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_``abcdefghijklmnopqrstuvwxyz{|}~ to the beginning of the input string (so that the numerical values found by OnnxUnique‘s inverse_indicies property have a consistent definition) and splitting the input string into a sequence of tensors of one character each. Unfortunately, OnnxSplit errors when trying to split a string tensor (see code example below), and OnnxSequenceInsert does not append strings into a single element tensor, just a sequence of single element tensors into a single tensor with multiple elements.

import re
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from skl2onnx import to_onnx, update_registered_converter
from skl2onnx.common.data_types import StringTensorType
from skl2onnx.algebra.onnx_ops import OnnxSplit, OnnxConstant
from onnxruntime import InferenceSession

class MyTransformer(BaseEstimator, TransformerMixin):
    def fit_transform(self, X, y=None):
        return re.sub("(?<=[0-9])m ", " meters ", X)

def shape_function(operator):
    input = StringTensorType([1])
    output = StringTensorType([None, 1])
    operator.inputs[0].type = input
    operator.outputs[0].type = output

def converter_function(scope, operator, container):
    op = operator.raw_operator
    opv = container.target_opset
    out = operator.outputs

    X = operator.inputs[0]

    one_tensor = OnnxConstant(value_int=1, op_version=opv)
    string_tensor = OnnxConstant(value_strings=["ab"], op_version=opv)
    string_split_tensor = OnnxSplit(string_tensor, one_tensor, op_version=opv, output_names=out[:1])

    string_split_tensor.add_to(scope, container)

update_registered_converter(MyTransformer, "MyTransformer", shape_function, converter_function)
my_transformer = MyTransformer()
onnx_model = to_onnx(my_transformer, initial_types=[["X", StringTensorType([None, 1])]])

test_string = "The Empire State Building is 443m tall."
sess = InferenceSession(onnx_model.SerializeToString())
output = sess.run(None, {"X": np.array([test_string])})

Yields:

2022-08-16 12:35:46.235861185 [W:onnxruntime:, graph.cc:106 MergeShapeInfo] Error merging shape info for output. 'variable' source:{1} target:{,1}. Falling back to lenient
merge.
2022-08-16 12:35:46.237767860 [E:onnxruntime:, inference_session.cc:1530 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/optimizer_execution_frame.cc:75 onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&,
const onnxruntime::IExecutionProvider&, const std::function<bool(const std::__cxx11::basic_string<char>&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : string tensor can not use pre-allocated buffer

How is one to properly manipulate strings with the available ONNX operators?

Asked By: NolantheNerd

||

Answers:

I asked the ONNX developers this question, and as of August 2022, it simply is not possible to perform REGEX replacements with ONNX operators. See the full thread here: https://github.com/onnx/onnx/issues/4450

Answered By: NolantheNerd
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.