Weird encoding in vtt file–python

Question

I am trying to obtain text from a subtitles file (vtt format) as follows:

import requests
r = requests.get('https://nogeovod-fy.atresmedia.com/vsg/sitemap/assets4/2022/09/26/C302281D-5C76-4710-A4FB-9AD7252B7F47/es.vtt')
print(r.encoding)

r.encoding = r.apparent_encoding

print(r.text)

Some characters seem to be missed as the original encoding ISO-8859-1 is not the right one. However, when I try to change it to utf-8, still all the accents remain weird…

Asked By: Luis

||

Source

Answer 1

The file appears to contain the following replaced characters:

Ć for á
Ž for é
Ð for í
Š for ó
ž for ñ
ë for ú
Č for ¡
č for ¿

With that, simply replacing these one-to-one should fix your problem. We still don’t know which encoding this is, but the damage is quite limited.

fixed = r.text.replace("Ć", "á").replace("Ž", "é").replace(
  "Ð", "í").replace("Š", "ó").replace("ž", "ñ").replace(
  "ë", "ú").replace("Č", "¡").replace("č", "¿")

Answered By: tripleee

Weird encoding in vtt file–python

Question:

Answers: