Handling special characters and getting specific text in large HTML file using lxml and Python

Question:

I’m using Python in Google Colab for extract my YouTube video history and generate a DataFrame with the data obtained from the .html file – that contains the video history.

I’m using lxml for parsing the HTML data, but, I’m facing the following problems:

  • The text obtained with lxml cannot decode the special characters – i.e. "á", "é", "í", emojies, etc. EDIT (17/03/2023): I’ve set the <meta charset="utf-8" /> tag on the HTML file’s content in order to solve this point, though, I’m open to alternatives for avoid edit the file.
  • I’m unable to get the date I’ve view the video. This text is inside a div, but, it doesn’t have a clear or easy way to extract the date. The desired result for each div is to get the date – example: 10 feb 2023, 08:03:13 COT, etc.

Here is the extract of the HTML content:

<div class="mdl-grid">
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=zj7kUzvqNBk">El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos | Chavez vs Haugen</a>
        <br>
        <a href="https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA">El Rayo Deportivo</a>
        <br>
        10 feb 2023, 08:03:13 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Historial de reproducciones de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=VArBMOvwiNE">No hay legado más grande, que el que dura para siempre. #LEOGACY</a>
        <br>
        Visto a las 08:02

        <br>
        10 feb 2023, 08:02:40 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>Detalles:</b>
        <br>
        &emsp;De los anuncios de Google

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Actividad en la Web y en Aplicaciones,&nbsp;Historial de reproducciones de YouTube,&nbsp;Historial de búsquedas de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=VLr8hPtyrIU">Noraver Gripa Fast - Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa</a>
        <br>
        Visto a las 08:02

        <br>
        10 feb 2023, 08:02:02 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>Detalles:</b>
        <br>
        &emsp;De los anuncios de Google

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Actividad en la Web y en Aplicaciones,&nbsp;Historial de reproducciones de YouTube,&nbsp;Historial de búsquedas de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=YM9kkSFwmg0">Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p</a>
        <br>
        <a href="https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA">Radical Dreamer</a>
        <br>
        10 feb 2023, 07:46:44 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Historial de reproducciones de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
</div>
[...]

This is the code I’m using:

from lxml import etree, html

parser = etree.HTMLParser()
tree = etree.parse("/content/historial de reproducciones.html", parser)

# Get the divs that contains the entry:
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")

# Store the data here as JSON:
js_data = []

# Loop the divs with the data for extract the values:
# > URL of the video
# > Video title
# > URL of the channel
# > Channel title
# > Watched time: Date where I watched the said video.
# NOTE: In the takeout, there are removed/deleted videos that does not 
# retrieve any links - in this case, the data is a empty row in the dataframe.
# If the "watched_time" is obtained, that row instead will contain ONLY the "watched_time" value, and that's the desired output for those cases.
for ind, div in enumerate(divs):
  temp_links = div.xpath("a/@href")
  temp_links_texts = div.xpath("a")

  if (temp_links is not None):
  
    # Here, I've omitted certain code for fill the "temp_links_texts" - it's not the problem 
    # for get the "watched_time".
    video_link = temp_links[0]
    channel_link = temp_links[1]
    video_text = temp_links_texts[0].text
    channel_name = temp_links_texts[1].text

    # When I try to get the date,  I'm unable to get the desired text = 10 feb 2023, 08:03:13 COT. 
    # watched_time = "" # I only get the (Has visto ) text - instead of (10 feb 2023, 08:03:13 COT).
    
    # Append the data: 
    js_data.append({ 
        "Video" : video_text, #OK
        "Link": video_link, #OK
        "Channel": channel_name, 
        "URL": channel_link,
        "Watched" : watched_time
    })
    
    # Clear variables: 
    video_link = ""
    video_text = ""
    channel_link = ""
    channel_name = ""
    watched_time = ""

This is the actual results – shortened for demostrative purposes:

index Video Link Channel URL Watched
0 El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos | Chavez vs Haugen https://www.youtube.com/watch?v=zj7kUzvqNBk El Rayo Deportivo https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA
1 No hay legado más grande, que el que dura para siempre. #LEOGACY https://www.youtube.com/watch?v=VArBMOvwiNE Gatorade Colombia https://www.youtube.com/@gatoradecolombia9204
2 Noraver Gripa Fast – Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa https://www.youtube.com/watch?v=VLr8hPtyrIU Noraver Colombia https://www.youtube.com/@noravercolombia3979
3 Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p https://www.youtube.com/watch?v=YM9kkSFwmg0 Radical Dreamer https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA
4 Mega Retrospectiva: Los Simpson Hit & Run (Parte 4) https://www.youtube.com/watch?v=Xh9p5fbUgY8 Max Power https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A
5 Gana más con Cabify https://www.youtube.com/watch?v=gTwtHdKuozY Cabify https://www.youtube.com/@Cabifys
6 ESTOS DARKLORDS SI PEGAN DURO https://www.youtube.com/watch?v=jwahSkgYhYY Duel Random L 2.0 https://www.youtube.com/channel/UCkty7zBHmasLAwIBc9cgiCA
7 Office Space – fixed the glitch https://www.youtube.com/watch?v=BUE0PPQI3is Peter Ghosh https://www.youtube.com/channel/UCqNUlAWLXWLf9ZwTRGX6Xqw
8 Llegamos a más de 800 destinos y a millones de hogares en Colombia https://www.youtube.com/watch?v=Tq0LDrLZzEI Homecenter Colombia https://www.youtube.com/@homecentercolombia
9 Iron Man: Archivo de Monstruos | Mech Strike https://www.youtube.com/watch?v=JHa0rLQh4cE Marvel HQ LA https://www.youtube.com/@MarvelHQLA
10 Review Temporada 31: Esto no necesitaba dos partes https://www.youtube.com/watch?v=UUYebZN4Xv8 Max Power https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A
11 Vuélvete Todo Claro y recibe más beneficios sin pagar más https://www.youtube.com/watch?v=a9pverB6qAU Claro Colombia https://www.youtube.com/@ClaroColombia
12 Atomic Heart = Russian Propaganda? https://www.youtube.com/watch?v=AzDCQzH1sO0 Setarko https://www.youtube.com/channel/UCcznKF2NAgexfaZtsOz7lFw
13 Mortal Kombat as an 80’s Dark Fantasy Film https://www.youtube.com/watch?v=gms5pD-EbMk SON https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA
14 Pirates of the Caribbean as an 80’s Dark Fantasy Film https://www.youtube.com/watch?v=2z36d5eiCcs SON https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA
15 What Happens When You Are a Coomer https://www.youtube.com/watch?v=DhXmO64ut4I Lord Wojak https://www.youtube.com/channel/UCULzpwL5TRDydBF_bWfJjrw
16 Wagering our RAREST Yu-Gi-Oh Cards in a Duel! Rare Hunters 11 – Ancient Sanctuary https://www.youtube.com/watch?v=x29ZRFdCpfE Team APS https://www.youtube.com/channel/UCaqlCjzSFmunjBOP4EmYpxg
17 Yu-Gi-Oh – 4,900 Attack Dark Paladin https://www.youtube.com/watch?v=XUHJIVvgkVc GrandMasterKaiba https://www.youtube.com/channel/UCLj6N6_WjY8upsjJVS9Osug
18 T̶E̵T̶R̸I̵S̷ B̶E̸A̶T̶B̸O̵X̵ [VOID] {2019} https://www.youtube.com/watch?v=_xQgxk4OyKI TimTam60.mp5 https://www.youtube.com/channel/UCsObTxoQ2g90OuxriztJyHg
19 Wojak’s New Year New Me https://www.youtube.com/watch?v=hlShmY1XrDo Wojak Life https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA
20 Evils of Instant Gratification https://www.youtube.com/watch?v=pWQ3LH-DTvk Wojak Life https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA

With the information posted here, is there something missing? – how the mentioned issue can be solved?

Answers:

The viewing dates are in the tail of the second <br> element within each <div>.

Demo (test.html contains the HTML in the question):

from lxml import html

tree = html.parse("test.html") 
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")

for div in divs:
    br_elems = div.xpath("br")
    print(br_elems[1].tail.strip())

Output:

10 feb 2023, 08:03:13 COT
10 feb 2023, 08:02:40 COT
10 feb 2023, 08:02:02 COT
10 feb 2023, 07:46:44 COT
Answered By: mzjn
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.