Python googletrans encoding weird chars

Question:

I have an ui which takes german language among other things and translate these in english sentences.

# -*- coding: utf-8 -*-
from googletrans import Translator

def tr(s)
  translator =  Translator()
  return translator.translate(wordDE,src='de',dest='en').text

Sometimes I get weird characters from the translator.
For example:

DE: Pascal und PHP sind Programmiersprachen für Softwareentwickler und Ingenieure.

googletrans EN(utf8): Pascal and PHP are programming languages ​​for software developers and engineers.

This is how the string looks in utf8 format. When I open it with the windows textEditor, it looks like this:

googletrans EN: Pascal and PHP are programming languages ​​for software developers and engineers.

As you can see before the “for software” are 2 weird characters, which the translate()-function returns. These characters are also in the “googletrans EN(utf8)”-string. You can’t see them, but when you skip through the string with the arrow keys, the cursor doesn’t move for the “for software” for 2 clicks. So the characters are there but not seen. (Maybe you can’t do it here because the string is already formatted from the website)

Sometimes there also occur other characters which can’t be seen after the translation.

I need this characters eliminated. I can’t go for ascii-only, because i need to safe also german-characters like “ö,ä,ü,ß” in a txt-file. Is this maybe just an encoding issue which I don’t understand or what is wrong there?

Asked By: puzzled

||

Answers:

The translated text contains two embedded zero-width space (u200b') characters.

>>> res = t.translate(wordDE, src='de', dest='en').text
>>> res
'Pascal and PHP are programming languages u200bu200bfor software developers and engineers.'

The text editor appears to the decoding the file as cp1252 (or a similar MS 8-bit encoding), hence the mojibake:

>>> res.encode('utf-8').decode('cp1252')
'Pascal and PHP are programming languages ​​for software developers and engineers.'

This is a known bug is the Google Translate API. Pending a fix, you can use str.replace to create a new string that does not contain these characters:

>>> new_res = res.replace('u200b', '')
>>> new_res
'Pascal and PHP are programming languages for software developers and engineers.'
Answered By: snakecharmerb