Handling special characters and getting specific text in large HTML file using lxml and Python
Question:
I’m using Python in Google Colab for extract my YouTube video history and generate a DataFrame with the data obtained from the .html file – that contains the video history.
I’m using lxml for parsing the HTML data, but, I’m facing the following problems:
- The text obtained with lxml cannot decode the special characters – i.e.
"á", "é", "í"
, emojies, etc. EDIT (17/03/2023): I’ve set the <meta charset="utf-8" />
tag on the HTML file’s content in order to solve this point, though, I’m open to alternatives for avoid edit the file.
- I’m unable to get the date I’ve view the video. This text is inside a
div
, but, it doesn’t have a clear or easy way to extract the date. The desired result for each div
is to get the date – example: 10 feb 2023, 08:03:13 COT
, etc.
Here is the extract of the HTML content:
<div class="mdl-grid">
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=zj7kUzvqNBk">El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos | Chavez vs Haugen</a>
<br>
<a href="https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA">El Rayo Deportivo</a>
<br>
10 feb 2023, 08:03:13 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Historial de reproducciones de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=VArBMOvwiNE">No hay legado más grande, que el que dura para siempre. #LEOGACY</a>
<br>
Visto a las 08:02
<br>
10 feb 2023, 08:02:40 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>Detalles:</b>
<br>
 De los anuncios de Google
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Actividad en la Web y en Aplicaciones, Historial de reproducciones de YouTube, Historial de búsquedas de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=VLr8hPtyrIU">Noraver Gripa Fast - Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa</a>
<br>
Visto a las 08:02
<br>
10 feb 2023, 08:02:02 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>Detalles:</b>
<br>
 De los anuncios de Google
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Actividad en la Web y en Aplicaciones, Historial de reproducciones de YouTube, Historial de búsquedas de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=YM9kkSFwmg0">Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p</a>
<br>
<a href="https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA">Radical Dreamer</a>
<br>
10 feb 2023, 07:46:44 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Historial de reproducciones de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
</div>
[...]
This is the code I’m using:
from lxml import etree, html
parser = etree.HTMLParser()
tree = etree.parse("/content/historial de reproducciones.html", parser)
# Get the divs that contains the entry:
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")
# Store the data here as JSON:
js_data = []
# Loop the divs with the data for extract the values:
# > URL of the video
# > Video title
# > URL of the channel
# > Channel title
# > Watched time: Date where I watched the said video.
# NOTE: In the takeout, there are removed/deleted videos that does not
# retrieve any links - in this case, the data is a empty row in the dataframe.
# If the "watched_time" is obtained, that row instead will contain ONLY the "watched_time" value, and that's the desired output for those cases.
for ind, div in enumerate(divs):
temp_links = div.xpath("a/@href")
temp_links_texts = div.xpath("a")
if (temp_links is not None):
# Here, I've omitted certain code for fill the "temp_links_texts" - it's not the problem
# for get the "watched_time".
video_link = temp_links[0]
channel_link = temp_links[1]
video_text = temp_links_texts[0].text
channel_name = temp_links_texts[1].text
# When I try to get the date, I'm unable to get the desired text = 10 feb 2023, 08:03:13 COT.
# watched_time = "" # I only get the (Has visto ) text - instead of (10 feb 2023, 08:03:13 COT).
# Append the data:
js_data.append({
"Video" : video_text, #OK
"Link": video_link, #OK
"Channel": channel_name,
"URL": channel_link,
"Watched" : watched_time
})
# Clear variables:
video_link = ""
video_text = ""
channel_link = ""
channel_name = ""
watched_time = ""
This is the actual results – shortened for demostrative purposes:
index
Video
Link
Channel
URL
Watched
0
El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos | Chavez vs Haugen
https://www.youtube.com/watch?v=zj7kUzvqNBk
El Rayo Deportivo
https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA
1
No hay legado más grande, que el que dura para siempre. #LEOGACY
https://www.youtube.com/watch?v=VArBMOvwiNE
Gatorade Colombia
https://www.youtube.com/@gatoradecolombia9204
2
Noraver Gripa Fast – Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa
https://www.youtube.com/watch?v=VLr8hPtyrIU
Noraver Colombia
https://www.youtube.com/@noravercolombia3979
3
Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p
https://www.youtube.com/watch?v=YM9kkSFwmg0
Radical Dreamer
https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA
4
Mega Retrospectiva: Los Simpson Hit & Run (Parte 4)
https://www.youtube.com/watch?v=Xh9p5fbUgY8
Max Power
https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A
5
Gana más con Cabify
https://www.youtube.com/watch?v=gTwtHdKuozY
Cabify
https://www.youtube.com/@Cabifys
6
ESTOS DARKLORDS SI PEGAN DURO
https://www.youtube.com/watch?v=jwahSkgYhYY
Duel Random L 2.0
https://www.youtube.com/channel/UCkty7zBHmasLAwIBc9cgiCA
7
Office Space – fixed the glitch
https://www.youtube.com/watch?v=BUE0PPQI3is
Peter Ghosh
https://www.youtube.com/channel/UCqNUlAWLXWLf9ZwTRGX6Xqw
8
Llegamos a más de 800 destinos y a millones de hogares en Colombia
https://www.youtube.com/watch?v=Tq0LDrLZzEI
Homecenter Colombia
https://www.youtube.com/@homecentercolombia
9
Iron Man: Archivo de Monstruos | Mech Strike
https://www.youtube.com/watch?v=JHa0rLQh4cE
Marvel HQ LA
https://www.youtube.com/@MarvelHQLA
10
Review Temporada 31: Esto no necesitaba dos partes
https://www.youtube.com/watch?v=UUYebZN4Xv8
Max Power
https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A
11
Vuélvete Todo Claro y recibe más beneficios sin pagar más
https://www.youtube.com/watch?v=a9pverB6qAU
Claro Colombia
https://www.youtube.com/@ClaroColombia
12
Atomic Heart = Russian Propaganda?
https://www.youtube.com/watch?v=AzDCQzH1sO0
Setarko
https://www.youtube.com/channel/UCcznKF2NAgexfaZtsOz7lFw
13
Mortal Kombat as an 80’s Dark Fantasy Film
https://www.youtube.com/watch?v=gms5pD-EbMk
SON
https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA
14
Pirates of the Caribbean as an 80’s Dark Fantasy Film
https://www.youtube.com/watch?v=2z36d5eiCcs
SON
https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA
15
What Happens When You Are a Coomer
https://www.youtube.com/watch?v=DhXmO64ut4I
Lord Wojak
https://www.youtube.com/channel/UCULzpwL5TRDydBF_bWfJjrw
16
Wagering our RAREST Yu-Gi-Oh Cards in a Duel! Rare Hunters 11 – Ancient Sanctuary
https://www.youtube.com/watch?v=x29ZRFdCpfE
Team APS
https://www.youtube.com/channel/UCaqlCjzSFmunjBOP4EmYpxg
17
Yu-Gi-Oh – 4,900 Attack Dark Paladin
https://www.youtube.com/watch?v=XUHJIVvgkVc
GrandMasterKaiba
https://www.youtube.com/channel/UCLj6N6_WjY8upsjJVS9Osug
18
T̶E̵T̶R̸I̵S̷ B̶E̸A̶T̶B̸O̵X̵ [VOID] {2019}
https://www.youtube.com/watch?v=_xQgxk4OyKI
TimTam60.mp5
https://www.youtube.com/channel/UCsObTxoQ2g90OuxriztJyHg
19
Wojak’s New Year New Me
https://www.youtube.com/watch?v=hlShmY1XrDo
Wojak Life
https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA
20
Evils of Instant Gratification
https://www.youtube.com/watch?v=pWQ3LH-DTvk
Wojak Life
https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA
With the information posted here, is there something missing? – how the mentioned issue can be solved?
Answers:
The viewing dates are in the tail
of the second <br>
element within each <div>
.
Demo (test.html contains the HTML in the question):
from lxml import html
tree = html.parse("test.html")
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")
for div in divs:
br_elems = div.xpath("br")
print(br_elems[1].tail.strip())
Output:
10 feb 2023, 08:03:13 COT
10 feb 2023, 08:02:40 COT
10 feb 2023, 08:02:02 COT
10 feb 2023, 07:46:44 COT
I’m using Python in Google Colab for extract my YouTube video history and generate a DataFrame with the data obtained from the .html file – that contains the video history.
I’m using lxml for parsing the HTML data, but, I’m facing the following problems:
- The text obtained with lxml cannot decode the special characters – i.e.
"á", "é", "í"
, emojies, etc. EDIT (17/03/2023): I’ve set the<meta charset="utf-8" />
tag on the HTML file’s content in order to solve this point, though, I’m open to alternatives for avoid edit the file. - I’m unable to get the date I’ve view the video. This text is inside a
div
, but, it doesn’t have a clear or easy way to extract the date. The desired result for eachdiv
is to get the date – example:10 feb 2023, 08:03:13 COT
, etc.
Here is the extract of the HTML content:
<div class="mdl-grid">
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=zj7kUzvqNBk">El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos | Chavez vs Haugen</a>
<br>
<a href="https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA">El Rayo Deportivo</a>
<br>
10 feb 2023, 08:03:13 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Historial de reproducciones de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=VArBMOvwiNE">No hay legado más grande, que el que dura para siempre. #LEOGACY</a>
<br>
Visto a las 08:02
<br>
10 feb 2023, 08:02:40 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>Detalles:</b>
<br>
 De los anuncios de Google
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Actividad en la Web y en Aplicaciones, Historial de reproducciones de YouTube, Historial de búsquedas de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=VLr8hPtyrIU">Noraver Gripa Fast - Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa</a>
<br>
Visto a las 08:02
<br>
10 feb 2023, 08:02:02 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>Detalles:</b>
<br>
 De los anuncios de Google
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Actividad en la Web y en Aplicaciones, Historial de reproducciones de YouTube, Historial de búsquedas de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
<div class="outer-cell mdl-cell mdl-cell--12-col mdl-shadow--2dp">
<div class="mdl-grid">
<div class="header-cell mdl-cell mdl-cell--12-col">
<p class="mdl-typography--title">
YouTube
<br>
</p>
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Has visto <a href="https://www.youtube.com/watch?v=YM9kkSFwmg0">Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p</a>
<br>
<a href="https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA">Radical Dreamer</a>
<br>
10 feb 2023, 07:46:44 COT
</div>
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right"></div>
<div class="content-cell mdl-cell mdl-cell--12-col mdl-typography--caption">
<b>Productos:</b>
<br>
 YouTube
<br>
<b>¿Por qué se grabó esta actividad?</b>
<br>
 Se guardó esta actividad en tu cuenta de Google porque estaban habilitadas las siguientes opciones de configuración: Historial de reproducciones de YouTube. Puedes controlar estas opciones de configuración <a href="https://myaccount.google.com/activitycontrols">aquí</a>.
</div>
</div>
</div>
</div>
[...]
This is the code I’m using:
from lxml import etree, html
parser = etree.HTMLParser()
tree = etree.parse("/content/historial de reproducciones.html", parser)
# Get the divs that contains the entry:
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")
# Store the data here as JSON:
js_data = []
# Loop the divs with the data for extract the values:
# > URL of the video
# > Video title
# > URL of the channel
# > Channel title
# > Watched time: Date where I watched the said video.
# NOTE: In the takeout, there are removed/deleted videos that does not
# retrieve any links - in this case, the data is a empty row in the dataframe.
# If the "watched_time" is obtained, that row instead will contain ONLY the "watched_time" value, and that's the desired output for those cases.
for ind, div in enumerate(divs):
temp_links = div.xpath("a/@href")
temp_links_texts = div.xpath("a")
if (temp_links is not None):
# Here, I've omitted certain code for fill the "temp_links_texts" - it's not the problem
# for get the "watched_time".
video_link = temp_links[0]
channel_link = temp_links[1]
video_text = temp_links_texts[0].text
channel_name = temp_links_texts[1].text
# When I try to get the date, I'm unable to get the desired text = 10 feb 2023, 08:03:13 COT.
# watched_time = "" # I only get the (Has visto ) text - instead of (10 feb 2023, 08:03:13 COT).
# Append the data:
js_data.append({
"Video" : video_text, #OK
"Link": video_link, #OK
"Channel": channel_name,
"URL": channel_link,
"Watched" : watched_time
})
# Clear variables:
video_link = ""
video_text = ""
channel_link = ""
channel_name = ""
watched_time = ""
This is the actual results – shortened for demostrative purposes:
index | Video | Link | Channel | URL | Watched |
---|---|---|---|---|---|
0 | El BOCON que NINGUNEO a Chavez y fue HUMILLADO frente a 130 000 fanáticos | Chavez vs Haugen | https://www.youtube.com/watch?v=zj7kUzvqNBk | El Rayo Deportivo | https://www.youtube.com/channel/UCivSw2EdxpUfH7vJ21Ob7NA | |
1 | No hay legado más grande, que el que dura para siempre. #LEOGACY | https://www.youtube.com/watch?v=VArBMOvwiNE | Gatorade Colombia | https://www.youtube.com/@gatoradecolombia9204 | |
2 | Noraver Gripa Fast – Descongestiona las vías respiratorias y elimina los demás síntomas de la gripa | https://www.youtube.com/watch?v=VLr8hPtyrIU | Noraver Colombia | https://www.youtube.com/@noravercolombia3979 | |
3 | Resident Evil 6 Mercenaries No Mercy Requiem for War 2757k Claire Redfield (Cowgirl) PC 1080p | https://www.youtube.com/watch?v=YM9kkSFwmg0 | Radical Dreamer | https://www.youtube.com/channel/UCqPkDBlhiVGe07S3CZOQGWA | |
4 | Mega Retrospectiva: Los Simpson Hit & Run (Parte 4) | https://www.youtube.com/watch?v=Xh9p5fbUgY8 | Max Power | https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A | |
5 | Gana más con Cabify | https://www.youtube.com/watch?v=gTwtHdKuozY | Cabify | https://www.youtube.com/@Cabifys | |
6 | ESTOS DARKLORDS SI PEGAN DURO | https://www.youtube.com/watch?v=jwahSkgYhYY | Duel Random L 2.0 | https://www.youtube.com/channel/UCkty7zBHmasLAwIBc9cgiCA | |
7 | Office Space – fixed the glitch | https://www.youtube.com/watch?v=BUE0PPQI3is | Peter Ghosh | https://www.youtube.com/channel/UCqNUlAWLXWLf9ZwTRGX6Xqw | |
8 | Llegamos a más de 800 destinos y a millones de hogares en Colombia | https://www.youtube.com/watch?v=Tq0LDrLZzEI | Homecenter Colombia | https://www.youtube.com/@homecentercolombia | |
9 | Iron Man: Archivo de Monstruos | Mech Strike | https://www.youtube.com/watch?v=JHa0rLQh4cE | Marvel HQ LA | https://www.youtube.com/@MarvelHQLA | |
10 | Review Temporada 31: Esto no necesitaba dos partes | https://www.youtube.com/watch?v=UUYebZN4Xv8 | Max Power | https://www.youtube.com/channel/UCZ99bYEb57kXEIaSALrj_6A | |
11 | Vuélvete Todo Claro y recibe más beneficios sin pagar más | https://www.youtube.com/watch?v=a9pverB6qAU | Claro Colombia | https://www.youtube.com/@ClaroColombia | |
12 | Atomic Heart = Russian Propaganda? | https://www.youtube.com/watch?v=AzDCQzH1sO0 | Setarko | https://www.youtube.com/channel/UCcznKF2NAgexfaZtsOz7lFw | |
13 | Mortal Kombat as an 80’s Dark Fantasy Film | https://www.youtube.com/watch?v=gms5pD-EbMk | SON | https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA | |
14 | Pirates of the Caribbean as an 80’s Dark Fantasy Film | https://www.youtube.com/watch?v=2z36d5eiCcs | SON | https://www.youtube.com/channel/UCr0lGe_WRgMxXEQtmmEdsCA | |
15 | What Happens When You Are a Coomer | https://www.youtube.com/watch?v=DhXmO64ut4I | Lord Wojak | https://www.youtube.com/channel/UCULzpwL5TRDydBF_bWfJjrw | |
16 | Wagering our RAREST Yu-Gi-Oh Cards in a Duel! Rare Hunters 11 – Ancient Sanctuary | https://www.youtube.com/watch?v=x29ZRFdCpfE | Team APS | https://www.youtube.com/channel/UCaqlCjzSFmunjBOP4EmYpxg | |
17 | Yu-Gi-Oh – 4,900 Attack Dark Paladin | https://www.youtube.com/watch?v=XUHJIVvgkVc | GrandMasterKaiba | https://www.youtube.com/channel/UCLj6N6_WjY8upsjJVS9Osug | |
18 | T̶E̵T̶R̸I̵S̷ B̶E̸A̶T̶B̸O̵X̵ [VOID] {2019} | https://www.youtube.com/watch?v=_xQgxk4OyKI | TimTam60.mp5 | https://www.youtube.com/channel/UCsObTxoQ2g90OuxriztJyHg | |
19 | Wojak’s New Year New Me | https://www.youtube.com/watch?v=hlShmY1XrDo | Wojak Life | https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA | |
20 | Evils of Instant Gratification | https://www.youtube.com/watch?v=pWQ3LH-DTvk | Wojak Life | https://www.youtube.com/channel/UCTi2tQgA_yu2uu_9RiyT2yA |
With the information posted here, is there something missing? – how the mentioned issue can be solved?
The viewing dates are in the tail
of the second <br>
element within each <div>
.
Demo (test.html contains the HTML in the question):
from lxml import html
tree = html.parse("test.html")
divs = tree.xpath("//div[@class='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1']")
for div in divs:
br_elems = div.xpath("br")
print(br_elems[1].tail.strip())
Output:
10 feb 2023, 08:03:13 COT
10 feb 2023, 08:02:40 COT
10 feb 2023, 08:02:02 COT
10 feb 2023, 07:46:44 COT