Print part of the text

Question:

I have a variable called test and inside the variable strings with the same language should be printed, for example:
test = "Hello World سلام دنیا"

test = "Hello World سلام دنیا"
I want it to print only the sentences that are written in Farsi

I should not use regex because the sentence is random and the number of words is unknown

`a = "Hello سلام".replace("H","").replace("e","").replace("l","").replace("o","")

print(a)`

Asked By: Mahyar Mortezaei

||

Answers:

You can use the ASCII values of the letters in the sentence to differentiate English alphabet values from other alphabets. You can use ord(character) to find the ASCII value of the respective character.

Answered By: Dinura Dissanayake

The challenge her is how best to determine if a word is comprised entirely of characters that are valid Farsi (Persian).

The are 106 valid characters in Persian. Many are common to other languages.

The characters can be represented by the following set:

farsi_characters = {32, 33, 36, 37, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 61, 62, 91, 92, 93, 94, 95, 124, 160, 169, 171, 187, 1545, 1548, 1563, 1567, 1569, 1570, 1571, 1572, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1581, 1582, 1583, 1584, 1585, 1586, 1587, 1588, 1589, 1590, 1591, 1592, 1593, 1594, 1601, 1602, 1604, 1605, 1606, 1607, 1608, 1611, 1612, 1613, 1617, 1620, 1642, 1643, 1644, 1662, 1670, 1688, 1705, 1711, 1740, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784, 1785, 8206, 8208, 8209, 8230, 8240, 8249, 8250, 8364, 8722}

This means that the problem can be partially solved as follows:

def is_farsi(word):
    return all(ord(c) in farsi_characters for c in word)

test = "Hello World سلام دنیا"

for word in test.split():
    if is_farsi(word):
        print(word)

Output:

سلام
دنیا

Note:

The problem here is ambiguity. What if we have:

test = "Hello 123 World سلام دنیا"

Then the output would be:

123
سلام
دنیا

Why? Well, it’s because the Arabic numbers 0-9 are also used in Farsi in addition to ۰۱۲۳۴۵۶۷۸۹

You could consider removing values lower than 1545 from the set. This will eliminate many of the characters that are common in other languages

Answered By: Pingu

Actually, you can use regexp to detect if a word is in Farsi:

def remove_non_farsi_words(text):
    texts = text.split(" ")
    return " ".join([t for t in texts if re.search(r'[u0621-u06CC]', t)])

remove_non_farsi_words("Hello World سلام دنیا")
# Returns: 'سلام دنیا' 

Explanation

  • re.search(r'[u0621-u06CC]', t) detects if farsi character exists in word t.
  • If yes, the word is kept in the list.
  • Finally we use " ".join() to re-combine the farsi words into a string.

This methods can be extendable, e.g. If you want anything else in your final results, like digits, simply change your regexp into [0-9u0621-u06CC]

Reference: https://en.wikipedia.org/wiki/Persian_alphabet

Answered By: SimZhou
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.