How can I match a pattern, and then everything upto that pattern again? So, match all the words and acronyms in my below example

Question

Context

I have the following paragraph:

text = """
בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא
בגו"ר  - בגשמיות ורוחניות ה"א - ה' אלוקיכם התמי' - התמיהה
בהנ"ל - בהנזכר לעיל ה"א - ה' אלקיך ואח"כ - ואחר כך
בהשי״ת - בהשם יתברך ה"ה - הרי הוא / הוא הדין ואת"ה - ואיגוד תלמידי 
"""

this paragraph is combined with Hebrew words and their acronyms.

A word contains quotation marks (").

So for example, some words would be:

[
    'בביהכנ"ס',
     'דו"ח',
     'הת"ד'
 ]

Now, I’m able to match all the words with this regex:

(b[u05D0-u05EA]*"b[u05D0-u05EA]*b)

Question

But how can I also match all the corresponding acronyms as a separate group? (the acronyms are what’s not matched, so not the green in the picture).

Example acronyms are:

[
    'בבית הכנסת',
    'דין וחשבון',
    'התיקוני דיקנא'
]

Expected output

The expected output should be a dictionary with the Words as keys and the Acronyms as values:

{
    'בביהכנס': 'בבית הכנסת',
    'דו"ח': 'דין וחשבון',
    'הת"ד': 'התיקוני דיקנא'
}

My attempt

What I tried was to match all the words (as above picture):

(b[u05D0-u05EA]*"b[u05D0-u05EA]*b)

and then match everything until the pattern appears again with .*1, so the entire regex would be:

(b[u05D0-u05EA]*"b[u05D0-u05EA]*b).*1

But as you can see, that doesn’t work:

How can I match the words and acronyms to compose a dictionary with the words/acronyms?

Note

When you print the output, it might be printed in Left-to-right order. But it should really be from Right to left. So if you want to print from right to left, see this answer:

right-to-left languages in Python

Asked By: MendelG

||

Source

Answer 1

You can try:

import re

# I've pasted your hebrew text to my text editor and now is mirrored (probably the text editor doesn't have the hebrew support)
text = """
בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא
בגו"ר  - בגשמיות ורוחניות ה"א - ה' אלוקיכם התמי' - התמיהה
בהנ"ל - בהנזכר לעיל ה"א - ה' אלקיך ואח"כ - ואחר כך
בהשי״ת - בהשם יתברך ה"ה - הרי הוא / הוא הדין ואת"ה - ואיגוד תלמידי 
"""

pat = re.compile(r"b([u05D0-u05EA]*["״][u05D0-u05EA]*)b")

data = [
    w.strip(" -") for w in pat.split(" ".join(text.split("n"))) if w.strip()
]

# To get your desired result I've reversed the order character in words. If your editor has support for hebrew text, you probably should skip it (remove the [::-1] part).
out = dict(((k[::-1], v[::-1]) for v, k in zip(data[::2], data[1::2])))
print(out)

Prints (note the keys/values are swapped)

{
    "תסנכה תיבב": 'ס"נכהיבב',
    "ןובשחו ןיד": 'ח"וד',
    "אנקיד ינוקיתה": 'ד"תה',
    "תוינחורו תוימשגב": 'ר"וגב',
    "ההימתה - 'ימתה םכיקולא 'ה": 'א"ה',
    "ליעל רכזנהב": 'ל"נהב',
    "ךיקלא 'ה": 'א"ה',
    "ךכ רחאו": 'כ"חאו',
    "ךרבתי םשהב": "ת״ישהב",
    "ןידה אוה / אוה ירה": 'ה"ה',
    "ידימלת דוגיאו": 'ה"תאו',
}

Answered By: Andrej Kesely

Answer 2

I assume that all characters before/after - are a word (I don’t know it’s true or not). So, I changed your pattern to this:

b[u05D0-u05EA{",',״, ,/}]+

You can add any other character that can be in a Hebrew word in curly brackets.

Code

import re


text = """
בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא
בגו"ר  - בגשמיות ורוחניות ה"א - ה' אלוקיכם התמי' - התמיהה
בהנ"ל - בהנזכר לעיל ה"א - ה' אלקיך ואח"כ - ואחר כך
בהשי״ת - בהשם יתברך ה"ה - הרי הוא / הוא הדין ואת"ה - ואיגוד תלמידי 
"""

words = re.findall(r"b[u05D0-u05EA{",',״, ,/}]+", text)
words = [word.strip() for word in words]

keys = [key for key in words[0::2]]
values = [value for value in words[1::2]]
dictionary = dict((key, value) for key, value in zip(keys, values))

print(dictionary)

Output

{
    'בביהכנ"ס': 'בבית הכנסת דו"ח',
    'דין וחשבון הת"ד': 'התיקוני דיקנא',
    'בגו"ר': 'בגשמיות ורוחניות ה"א',
    "ה' אלוקיכם התמי'": 'התמיהה',
    'בהנ"ל': 'בהנזכר לעיל ה"א',
    'ה' אלקיך ואח"כ': 'ואחר כך',
    'בהשי״ת': 'בהשם יתברך ה"ה',
    'הרי הוא / הוא הדין ואת"ה': 'ואיגוד תלמידי'
}

Answered By: pourya90091

Answer 3

You can do something like this:

import re

pattern = r'(b[u05D0-u05EA]*"b[u05D0-u05EA]*b)s*-s*([^"]+)(s|$)'

text = """בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא"""

for word, acronym, _ in re.findall(pattern, text):
    print(word + ' == ' + acronym)

which outputs

בביהכנ"ס == בבית הכנסת
דו"ח == דין וחשבון
הת"ד == התיקוני דיקנא

Let’s take a closer look how I built the regex pattern. Here’s the pattern from your question that matches words:

(b[u05D0-u05EA]*"b[u05D0-u05EA]*b)

This part will match the delimiter between a word and it’s acronym: s*-s* (spaces then dash then spaces)

This part will match anything except for double quote: ([^"]+)

Finally, not to match the beginning of the next word let’s match space/EOL in the end: (s|$).

Concatenate all the parts above and you’ll get my pattern:
(b[u05D0-u05EA]*"b[u05D0-u05EA]*b)s*-s*([^"]+)(s|$)

re.findall() will return a list of tuples, one tuple for one match. Each tuple will contain strings matching the groups (the stuff within parenthesis) in the same order that groups appear in the pattern. So we need group number 0 (word) and group number 1 (acronym) to build our dict. Group number 2 is not needed.

Answered By: Alex