Weird encoding in vtt file–python

Question:

I am trying to obtain text from a subtitles file (vtt format) as follows:

import requests
r = requests.get('https://nogeovod-fy.atresmedia.com/vsg/sitemap/assets4/2022/09/26/C302281D-5C76-4710-A4FB-9AD7252B7F47/es.vtt')
print(r.encoding)

r.encoding = r.apparent_encoding

print(r.text)

Some characters seem to be missed as the original encoding ISO-8859-1 is not the right one. However, when I try to change it to utf-8, still all the accents remain weird…

Asked By: Luis

||

Answers:

The file appears to contain the following replaced characters:

  • Ć for á
  • Ž for é
  • Ð for í
  • Š for ó
  • ž for ñ
  • ë for ú
  • Č for ¡
  • č for ¿

With that, simply replacing these one-to-one should fix your problem. We still don’t know which encoding this is, but the damage is quite limited.

fixed = r.text.replace("Ć", "á").replace("Ž", "é").replace(
  "Ð", "í").replace("Š", "ó").replace("ž", "ñ").replace(
  "ë", "ú").replace("Č", "¡").replace("č", "¿")
Answered By: tripleee
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.