Handling special characters and getting specific text in large HTML file using lxml and Python

Question

I’m using Python in Google Colab for extract my YouTube video history and generate a DataFrame with the data obtained from the .html file – that contains the video history.

I’m using lxml for parsing the HTML data, but, I’m facing the following problems:

The text obtained with lxml cannot decode the special characters – i.e. "á", "é", "í", emojies, etc. EDIT (17/03/2023): I’ve set the <meta charset="utf-8" /> tag on the HTML file’s content in order to solve this point, though, I’m open to alternatives for avoid edit the file.
I’m unable to get the date I’ve view the video. This text is inside a div, but, it doesn’t have a clear or easy way to extract the date. The desired result for each div is to get the date – example: 10 feb 2023, 08:03:13 COT, etc.

Here is the extract of the HTML content:

<div class="mdl-grid">
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=zj7kUzvqNBk">El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos | Chavez vs Haugen</a>
        <br>
        <a href="https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA">El Rayo Deportivo</a>
        <br>
        10 feb 2023, 08:03:13 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Historial de reproducciones de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=VArBMOvwiNE">No hay legado más grande, que el que dura para siempre. #LEOGACY</a>
        <br>
        Visto a las 08:02

        <br>
        10 feb 2023, 08:02:40 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>Detalles:</b>
        <br>
        &emsp;De los anuncios de Google

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Actividad en la Web y en Aplicaciones,&nbsp;Historial de reproducciones de YouTube,&nbsp;Historial de búsquedas de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=VLr8hPtyrIU">Noraver Gripa Fast - Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa</a>
        <br>
        Visto a las 08:02

        <br>
        10 feb 2023, 08:02:02 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>Detalles:</b>
        <br>
        &emsp;De los anuncios de Google

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Actividad en la Web y en Aplicaciones,&nbsp;Historial de reproducciones de YouTube,&nbsp;Historial de búsquedas de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
  <div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
    <div class="mdl-grid">
      <div class="header-cell mdl-cell mdl-cell--12-col">
        <p class="mdl-typography--title">
          YouTube

          <br>
        </p>
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
        Has visto <a href="https://www.youtube.com/watch?v=YM9kkSFwmg0">Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p</a>
        <br>
        <a href="https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA">Radical Dreamer</a>
        <br>
        10 feb 2023, 07:46:44 COT
      </div>
      <div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
      <div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
        <b>Productos:</b>
        <br>
        &emsp;YouTube

        <br>
        <b>¿Por qué se grabó esta actividad?</b>
        <br>
        &emsp;Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración:&nbsp;Historial de reproducciones de YouTube.&nbsp;Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.

      </div>
    </div>
  </div>
</div>
[...]

This is the code I’m using:

from lxml import etree, html

parser = etree.HTMLParser()
tree = etree.parse("/content/historial de reproducciones.html", parser)

# Get the divs that contains the entry:
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")

# Store the data here as JSON:
js_data = []

# Loop the divs with the data for extract the values:
# > URL of the video
# > Video title
# > URL of the channel
# > Channel title
# > Watched time: Date where I watched the said video.
# NOTE: In the takeout, there are removed/deleted videos that does not 
# retrieve any links - in this case, the data is a empty row in the dataframe.
# If the "watched_time" is obtained, that row instead will contain ONLY the "watched_time" value, and that's the desired output for those cases.
for ind, div in enumerate(divs):
  temp_links = div.xpath("a/@href")
  temp_links_texts = div.xpath("a")

  if (temp_links is not None):
  
    # Here, I've omitted certain code for fill the "temp_links_texts" - it's not the problem 
    # for get the "watched_time".
    video_link = temp_links[0]
    channel_link = temp_links[1]
    video_text = temp_links_texts[0].text
    channel_name = temp_links_texts[1].text

    # When I try to get the date,  I'm unable to get the desired text = 10 feb 2023, 08:03:13 COT. 
    # watched_time = "" # I only get the (Has visto ) text - instead of (10 feb 2023, 08:03:13 COT).
    
    # Append the data: 
    js_data.append({ 
        "Video" : video_text, #OK
        "Link": video_link, #OK
        "Channel": channel_name, 
        "URL": channel_link,
        "Watched" : watched_time
    })
    
    # Clear variables: 
    video_link = ""
    video_text = ""
    channel_link = ""
    channel_name = ""
    watched_time = ""

This is the actual results – shortened for demostrative purposes:

index	Video	Link	Channel	URL
0	El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos \| Chavez vs Haugen	https://www.youtube.com/watch?v=zj7kUzvqNBk	El Rayo Deportivo	https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA
1	No hay legado más grande, que el que dura para siempre. #LEOGACY	https://www.youtube.com/watch?v=VArBMOvwiNE	Gatorade Colombia	https://www.youtube.com/@gatoradecolombia9204
2	Noraver Gripa Fast – Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa	https://www.youtube.com/watch?v=VLr8hPtyrIU	Noraver Colombia	https://www.youtube.com/@noravercolombia3979
3	Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p	https://www.youtube.com/watch?v=YM9kkSFwmg0	Radical Dreamer	https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA
4	Mega Retrospectiva: Los Simpson Hit & Run (Parte 4)	https://www.youtube.com/watch?v=Xh9p5fbUgY8	Max Power	https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A
5	Gana más con Cabify	https://www.youtube.com/watch?v=gTwtHdKuozY	Cabify	https://www.youtube.com/@Cabifys
6	ESTOS DARKLORDS SI PEGAN DURO	https://www.youtube.com/watch?v=jwahSkgYhYY	Duel Random L 2.0	https://www.youtube.com/channel/UCkty7zBHmasLAwIBc9cgiCA
7	Office Space – fixed the glitch	https://www.youtube.com/watch?v=BUE0PPQI3is	Peter Ghosh	https://www.youtube.com/channel/UCqNUlAWLXWLf9ZwTRGX6Xqw
8	Llegamos a más de 800 destinos y a millones de hogares en Colombia	https://www.youtube.com/watch?v=Tq0LDrLZzEI	Homecenter Colombia	https://www.youtube.com/@homecentercolombia
9	Iron Man: Archivo de Monstruos \| Mech Strike	https://www.youtube.com/watch?v=JHa0rLQh4cE	Marvel HQ LA	https://www.youtube.com/@MarvelHQLA
10	Review Temporada 31: Esto no necesitaba dos partes	https://www.youtube.com/watch?v=UUYebZN4Xv8	Max Power	https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A
11	Vuélvete Todo Claro y recibe más beneficios sin pagar más	https://www.youtube.com/watch?v=a9pverB6qAU	Claro Colombia	https://www.youtube.com/@ClaroColombia
12	Atomic Heart = Russian Propaganda?	https://www.youtube.com/watch?v=AzDCQzH1sO0	Setarko	https://www.youtube.com/channel/UCcznKF2NAgexfaZtsOz7lFw
13	Mortal Kombat as an 80’s Dark Fantasy Film	https://www.youtube.com/watch?v=gms5pD-EbMk	SON	https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA
14	Pirates of the Caribbean as an 80’s Dark Fantasy Film	https://www.youtube.com/watch?v=2z36d5eiCcs	SON	https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA
15	What Happens When You Are a Coomer	https://www.youtube.com/watch?v=DhXmO64ut4I	Lord Wojak	https://www.youtube.com/channel/UCULzpwL5TRDydBF_bWfJjrw
16	Wagering our RAREST Yu-Gi-Oh Cards in a Duel! Rare Hunters 11 – Ancient Sanctuary	https://www.youtube.com/watch?v=x29ZRFdCpfE	Team APS	https://www.youtube.com/channel/UCaqlCjzSFmunjBOP4EmYpxg
17	Yu-Gi-Oh – 4,900 Attack Dark Paladin	https://www.youtube.com/watch?v=XUHJIVvgkVc	GrandMasterKaiba	https://www.youtube.com/channel/UCLj6N6_WjY8upsjJVS9Osug
18	T̶E̵T̶R̸I̵S̷ B̶E̸A̶T̶B̸O̵X̵ [VOID] {2019}	https://www.youtube.com/watch?v=_xQgxk4OyKI	TimTam60.mp5	https://www.youtube.com/channel/UCsObTxoQ2g90OuxriztJyHg
19	Wojak’s New Year New Me	https://www.youtube.com/watch?v=hlShmY1XrDo	Wojak Life	https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA
20	Evils of Instant Gratification	https://www.youtube.com/watch?v=pWQ3LH-DTvk	Wojak Life	https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA

With the information posted here, is there something missing? – how the mentioned issue can be solved?

Asked By: Marco Aurelio Fernandez Reyes

||

Source

Answer 1

The viewing dates are in the tail of the second <br> element within each <div>.

Demo (test.html contains the HTML in the question):

from lxml import html

tree = html.parse("test.html") 
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")

for div in divs:
    br_elems = div.xpath("br")
    print(br_elems[1].tail.strip())

Output:

10 feb 2023, 08:03:13 COT
10 feb 2023, 08:02:40 COT
10 feb 2023, 08:02:02 COT
10 feb 2023, 07:46:44 COT

Answered By: mzjn

Handling special characters and getting specific text in large HTML file using lxml and Python

Question:

Answers: