Preserve letter order when replacing LTR chars with RTL chars in a word at byte level

Question:

I have a Hebrew word "יתꢀראꢁ" which needs to be "בראשית". To correct I am encoding and than replacing chars. The replacement works however since I am replacing LTR chars with RTL chars the order gets jumbled.

data="יתꢀראꢁ".encode("unicode_escape")
data=data.replace(b"ua880", b"u05e9")
data=data.replace(b"ua881", b"u05d1")
data=data.decode("unicode_escape")
print(data)

Instead of "בראשית" I get "יתשראב"
Replacing chars on a byte level is my only option. How do I preserve the order after the replacement

EDIT:The garbage text comes from here https://777codes.com/newtestament/gen1.html after a scrape. While I understand it is best to avoid fixing this kind of mess scraping and replacing missing chars seems to be the only solution. My sample is the first word on that page. Any suggestion on how to get the Hebrew text correctly with a straight scrape is most welcome but I doubt this is possible. The garbage in this case are placeholder chars which are being rendered correctly by woff fonts.

Asked By: ShaneO

||

Answers:

Analysis

Let’s first look at the data in a form that will be unambiguous and that can be followed by English readers:

>>> import unicodedata
>>> data="יתꢀראꢁ"
>>> [unicodedata.name(c).split()[-1] for c in data]
['YOD', 'TAV', 'ANUSVARA', 'RESH', 'ALEF', 'VISARGA']

Here, the 'ANUSVARA' and 'VISARGA' are the placeholder characters, which have a left-to-right text order; the others are Hebrew and have a right-to-left text order. For the sake of clarity, let’s use those names (and a couple more) to define some single-character constants:

YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA = data
SHIN = 'ש'
BET = 'ב'

We seek to replace ANUSVARA with SHIN and VISARGA with BET. However, there is a complication: while the logical order of the original characters is YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA, they display on screen left to right as TAV, YOD, ANUSVARA, ALEF, RESH, VISARGA – that is, with each Hebrew segment reversed, because Hebrew is written right-to-left.

We want the resulting text to appear, left to right, as TAV, YOD, SHIN, ALEF, RESH, BET. Since it will be all Hebrew text, the actual order of the characters should be reversed completely: BET, RESH, ALEF, SHIN, YOD, TAV.

Approach

Conceptually, we need to take these steps:

YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA

Split the text into LTR and RTL components:

(YOD, TAV), (ANUSVARA,), (RESH, ALEF), (VISARGA,)

Replace the placeholder LTR components with new RTL ones:

(YOD, TAV), (SHIN,), (RESH, ALEF), (BET,)

Reverse the order of the components:

(BET,), (RESH, ALEF), (SHIN,), (YOD, TAV)

Join up the string:

BET, RESH, ALEF, SHIN, YOD, TAV

To split the string, we can use regex:

>>> pattern = re.compile(rf'({re.escape(ANUSVARA)}|{re.escape(VISARGA)})')
>>> parts = pattern.split(data)

The parts will have an empty string at the end; this is of no consequence. Note the capturing group used in the regex: this makes the actual "split" delimiters appear in the parts (otherwise we would only get the Hebrew parts).

The overall solution fits into a one-liner:

>>> ''.join(
...     SHIN if c == ANUSVARA else BET if c == VISARGA else c
...     for c in reversed(pattern.split(data))
... )
'בראשית'

The idea is that we use a generator expression to iterate over the reversed components, making substitutions as we go. This feeds into ''.join to join the components back together. Since we are replacing entire components, we don’t use .replace; we have extracted e.g. the ANUSVARA as a separate string by itself, so we do an equality check and conditionally replace with SHIN.

Generalization

To create the pattern for more LTR placeholders, build the regex pattern procedurally. We need a regex-escaped (for robustness) version of each literal that we’re searching for, separated by | and surrounded in parentheses, thus:

def any_literal(candidates):
    """Build a regex that matches any of the candidates as literal text."""
    alternatives = '|'.join(re.escape(c) for c in candidates)
    return re.compile(f'({alternatives})')

To do multiple replacements, build a dictionary:

replacements = {ANUSVARA: SHIN, VISARGA: BET}

and use dictionary lookup for the replacement, defaulting to the original value (i.e., for things which aren’t placeholders, replace them with themselves):

def fix_hebrew_with_placeholders(text, replacements):
    splitter = any_literal(replacements.keys())
    return ''.join(
        replacements.get(c, c)
        for c in reversed(splitter.split(text))
    )

Testing it:

>>> fix_hebrew_with_placeholders(data, {ANUSVARA: SHIN, VISARGA: BET})
'בראשית'
>>> fix_hebrew_with_placeholders(data, {ANUSVARA: SHIN, VISARGA: BET})[0]
'ב'
Answered By: Karl Knechtel

My take on the problem:

The website the string is taken from is displaying the Hebrew text as LTR text, the text uses glyphs from the Alphabetic Presentation block that has been reassigned to the Saurashtra block. So we have an overall LTR orientation with a sequence and a mixed script word where Hebrew characters are stored logically within substrings interspaced with LTR characters.

My approach is to:

  1. Use a regex pattern to split the string on the Saurashta characters, which should leave the Hebrew substrings in logical order within substrings.
  2. Then I reverse the list so substrings are in logical order (rather than putting each character in logical order).
  3. Then replace Saurashta characters with the correct Hebrew characters.

There is the added complication that you are working with RTL text so that it is neccesary to work in an environment that has bidirectional support, such as Jupyter Lab, since rendering is handled by the web browser and the UBA is well-supported in web browsers. Printing the string in a terminal/console will incorrectly display RTL strings, complicating the analysis.

import regex
data = "יתꢀראꢁ"
data = "".join(regex.split(r'(p{Saurashtra})', data)[::-1])
data = data.replace("ua880", "u05E9u05C1").replace("ua881", "u05D1u05BC")
# U+05D1 U+05BC U+05E8 U+05D0 U+05E9 U+05C1 U+05D9 U+05EA

Although further testing is needed to ensure the same patterns occur within the rest of the document.

Note that the str.replace() method should be rewritten, since you actually need to be able to handle 23 different replacements. Possibly:

import regex
data = "יתꢀראꢁ"
data = "".join(regex.split(r'(p{Saurashtra})', data)[::-1])
# data = data.replace("ua880", "u05E9u05C1").replace("ua881", "u05D1u05BC")
substitutions = {"ua880": "u05E9u05C1", "ua881": "u05D1u05BC"}
for char in substitutions.keys():
    data = data.replace(char, substitutions[char])

My current hypothesis is that the original document was written with the Ezra SIL font, converted to a PDF file, with an embedded subsetted font (PMPFGN+EzraSIL), the text and font for the website were extracted from the PDF.

Edited:

A more complete substitution matching the webfont:

substitutions = {
    "ua880": "u05E9u05C1", "ua881": "u05D1u05BC", "ua882": "u05DD", "ua883": "u05DCu05B9",
    "ua884": "u05E9u05BCu05C1", "ua885": "u05E5", "ua886": "u05D5u05BC","ua887": "u05DAu05B0",
    "ua888": "u05E4u05BC", "ua889": "u05D5u05B9", "ua88A": "u05DEu05BC", "ua88B": "u05D9u05BC",
    "ua88C": "u05DBu05BC", "ua88D": "u05D3u05BC", "ua88E": "u05DF", "ua88F": "u05E9u05C2",
    "ua890": "u05EAu05BC", "ua891": "u05E7u05BC", "ua892": "u05DCu05BC", "ua893": "u05D2u05BC",
    "ua894": "u05E3", "ua895": "u05E0u05BC", "ua896": "u05D4u05BC"
}
Answered By: Andj
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.