Scraping issue with id_tag
Question:
I’m trying to extract data from a website with BeautifulSoup.
I’m actually stuck with this :
"Trad. de l’anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn’t work.
Can someone help me plz?
Answers:
Providing your HTML is correct, static (doesn’t get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par <a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a></p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
<a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a>
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn’t like a challenge?… Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page – considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR+LIVRE+HEROS%3A%3AFolio+Junior+-+Un+Livre+dont+Vous+%C3%AAtes+le+H%C3%A9ros+%40+DEFIS+FANTASTIQ%3A%3AS%C3%A9rie+D%C3%A9fis+Fantastiques/(limit)/3?date%5Bfrom%5D=1980-01-01&date%5Bto%5D=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[class="results bg_white"] > table div[class="item"]')
print()
for i in items:
title = i.select_one('div[class="title"] h3')
author = i.select_one('div[class="author"] a')
history = i.select_one('p[class="collective_work_entries"]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
Title
Author
Translator(s)
Illustrator(s)
0
Le Sépulcre des Ombres
Jonathan Green
Noël Chassériau
Alan Langford
1
La Légende de Zagor
Ian Livingstone
Pascale Houssin
Martin McKenna
2
Les Mages de Solani
Keith Martin
Noël Chassériau
Russ Nicholson
3
Le Siège de Sardath
Keith P. Phillips
Yannick Surcouf
Pete Knifton
4
Retour à la Montagne de Feu
Ian Livingstone
Yannick Surcouf
Martin McKenna
5
Les Mondes de l’Aleph
Peter Darvill-Evans
Yannick Surcouf
Tony Hough
6
Les Mercenaires du Levant
Paul Mason
Mona de Pracontal
Terry Oakes
7
L’Arpenteur de la Lune
Stephen Hand
Pierre de Laubier
Martin McKenna, Terry Oakes
8
La Tour de la Destruction
Keith Martin
Mona de Pracontal
Pete Knifton
9
La Légende des Guerriers Fantômes
Stephen Hand
Alexis Galmot
Martin McKenna
10
Le Repaire des Morts-Vivants
Dave Morris
Nicolas Grenier
David Gallagher
11
L’Ancienne Prophétie
Paul Mason
Mona de Pracontal
Terry Oakes
12
La Vengeance des Démons
Jim Bambra
Mona de Pracontal
Martin McKenna
13
Le Sceptre Noir
Keith Martin
Camille Fabien
David Gallagher
14
La Nuit des Mutants
Peter Darvill-Evans
Anne Collas
Alan Langford
15
L’Élu des Six Clans
Luke Sharp
Noël Chassériau
Martin Mac Kenna, Martin McKenna
16
Le Volcan de Zamarra
Luke Sharp
Olivier Meyer
David Gallagher
17
Les Sombres Cohortes
Ian Livingstone
Noël Chassériau
Nik William
18
Le Vampire du Château Noir
Keith Martin
Mona de Pracontal
Martin McKenna
19
Le Voleur d’Âmes
Keith Martin
Mona de Pracontal
Russ Nicholson
20
Le Justicier de l’Univers
Martin Allen
Mona de Pracontal
Tim Sell
21
Les Esclaves de l’Eternité
Paul Mason
Sylvie Bonnet
Bob Harvey
22
La Créature venue du Chaos
Steve Jackson
Noël Chassériau
Alan Langford
23
Les Rôdeurs de la Nuit
Graeme Davis
Nicolas Grenier
John Sibbick
24
L’Empire des Hommes-Lézards
Marc Gascoigne
Jean Lacroix
David Gallagher
25
Les Gouffres de la Cruauté
Luke Sharp
Sylvie Bonnet
Russ Nicholson
26
Les Spectres de l’Angoisse
Robin Waterfield
Mona de Pracontal
Ian Miller
27
Le Chasseur des Étoiles
Luke Sharp
Arnaud Dupin de Beyssat
Cary Mayes, Gary Mayes
28
Les Sceaux de la Destruction
Robin Waterfield
Sylvie Bonnet
Russ Nicholson
29
La Crypte du Sorcier
Ian Livingstone
Noël Chassériau
John Sibbick
30
La Forteresse du Cauchemar
Peter Darvill-Evans
Mona de Pracontal
Dave Carson
31
La Grande Menace des Robots
Steve Jackson
Danielle Plociennik
Gary Mayes
32
L’Épée du Samouraï
Mark Smith
Pascale Jusforgues
Alan Langford
33
L’Épreuve des Champions
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Brian Williams
34
Défis Sanglants sur l’Océan
Andrew Chapman
Jean Walter
Bob Harvey
35
Les Démons des Profondeurs
Steve Jackson
Noël Chassériau
Bob Harvey
36
Rendez-vous avec la M.O.R.T.
Steve Jackson
Arnaud Dupin de Beyssat
Declan Considine
37
La Planète Rebelle
Robin Waterfield
C. Degolf
Gary Mayes
38
Les Trafiquants de Kelter
Andrew Chapman
Anne Blanchet
Nik Spender
39
Le Combattant de l’Autoroute
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Kevin Bulmer
40
Le Mercenaire de l’Espace
Andrew Chapman
Jean Walthers
Geoffroy Senior
41
Le Temple de la Terreur
Ian Livingstone
Denise May
Bill Houston
42
Le Manoir de l’Enfer
Steve Jackson
43
Le Marais aux Scorpions
Steve Jackson
Camille Fabien
Duncan Smith
44
Le Talisman de la Mort
Steve Jackson
Camille Fabien
Bob Harvey
45
La Sorcière des Neiges
Ian Livingstone
Michel Zénon
Edward Crosby, Gary Ward
46
La Citadelle du Chaos
Steve Jackson
Marie-Raymond Farré
Russ Nicholson
47
La Galaxie Tragique
Steve Jackson
Camille Fabien
Peter Jones
48
La Forêt de la Malédiction
Ian Livingstone
Camille Fabien
Malcolm Barter
49
La Cité des Voleurs
Ian Livingstone
Henri Robillot
Iain McCaig
50
Le Labyrinthe de la Mort
Ian Livingstone
Patricia Marais
Iain McCaig
51
L’Île du Roi Lézard
Ian Livingstone
Fabienne Vimereu
Alan Langford
52
Le Sorcier de la Montagne de Feu
Steve Jackson
Camille Fabien
Russ Nicholson
Bear in mind this method fails for Le Manoir de l'Enfer
, because word ‘Illustrations’ is not found in text. It’s down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list
names = []
for elem in soup:
names.append(elem.text)
I’m trying to extract data from a website with BeautifulSoup.
I’m actually stuck with this :
"Trad. de l’anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn’t work.
Can someone help me plz?
Providing your HTML is correct, static (doesn’t get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par <a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a></p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
<a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a>
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn’t like a challenge?… Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page – considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR+LIVRE+HEROS%3A%3AFolio+Junior+-+Un+Livre+dont+Vous+%C3%AAtes+le+H%C3%A9ros+%40+DEFIS+FANTASTIQ%3A%3AS%C3%A9rie+D%C3%A9fis+Fantastiques/(limit)/3?date%5Bfrom%5D=1980-01-01&date%5Bto%5D=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[class="results bg_white"] > table div[class="item"]')
print()
for i in items:
title = i.select_one('div[class="title"] h3')
author = i.select_one('div[class="author"] a')
history = i.select_one('p[class="collective_work_entries"]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
Title | Author | Translator(s) | Illustrator(s) | |
---|---|---|---|---|
0 | Le Sépulcre des Ombres | Jonathan Green | Noël Chassériau | Alan Langford |
1 | La Légende de Zagor | Ian Livingstone | Pascale Houssin | Martin McKenna |
2 | Les Mages de Solani | Keith Martin | Noël Chassériau | Russ Nicholson |
3 | Le Siège de Sardath | Keith P. Phillips | Yannick Surcouf | Pete Knifton |
4 | Retour à la Montagne de Feu | Ian Livingstone | Yannick Surcouf | Martin McKenna |
5 | Les Mondes de l’Aleph | Peter Darvill-Evans | Yannick Surcouf | Tony Hough |
6 | Les Mercenaires du Levant | Paul Mason | Mona de Pracontal | Terry Oakes |
7 | L’Arpenteur de la Lune | Stephen Hand | Pierre de Laubier | Martin McKenna, Terry Oakes |
8 | La Tour de la Destruction | Keith Martin | Mona de Pracontal | Pete Knifton |
9 | La Légende des Guerriers Fantômes | Stephen Hand | Alexis Galmot | Martin McKenna |
10 | Le Repaire des Morts-Vivants | Dave Morris | Nicolas Grenier | David Gallagher |
11 | L’Ancienne Prophétie | Paul Mason | Mona de Pracontal | Terry Oakes |
12 | La Vengeance des Démons | Jim Bambra | Mona de Pracontal | Martin McKenna |
13 | Le Sceptre Noir | Keith Martin | Camille Fabien | David Gallagher |
14 | La Nuit des Mutants | Peter Darvill-Evans | Anne Collas | Alan Langford |
15 | L’Élu des Six Clans | Luke Sharp | Noël Chassériau | Martin Mac Kenna, Martin McKenna |
16 | Le Volcan de Zamarra | Luke Sharp | Olivier Meyer | David Gallagher |
17 | Les Sombres Cohortes | Ian Livingstone | Noël Chassériau | Nik William |
18 | Le Vampire du Château Noir | Keith Martin | Mona de Pracontal | Martin McKenna |
19 | Le Voleur d’Âmes | Keith Martin | Mona de Pracontal | Russ Nicholson |
20 | Le Justicier de l’Univers | Martin Allen | Mona de Pracontal | Tim Sell |
21 | Les Esclaves de l’Eternité | Paul Mason | Sylvie Bonnet | Bob Harvey |
22 | La Créature venue du Chaos | Steve Jackson | Noël Chassériau | Alan Langford |
23 | Les Rôdeurs de la Nuit | Graeme Davis | Nicolas Grenier | John Sibbick |
24 | L’Empire des Hommes-Lézards | Marc Gascoigne | Jean Lacroix | David Gallagher |
25 | Les Gouffres de la Cruauté | Luke Sharp | Sylvie Bonnet | Russ Nicholson |
26 | Les Spectres de l’Angoisse | Robin Waterfield | Mona de Pracontal | Ian Miller |
27 | Le Chasseur des Étoiles | Luke Sharp | Arnaud Dupin de Beyssat | Cary Mayes, Gary Mayes |
28 | Les Sceaux de la Destruction | Robin Waterfield | Sylvie Bonnet | Russ Nicholson |
29 | La Crypte du Sorcier | Ian Livingstone | Noël Chassériau | John Sibbick |
30 | La Forteresse du Cauchemar | Peter Darvill-Evans | Mona de Pracontal | Dave Carson |
31 | La Grande Menace des Robots | Steve Jackson | Danielle Plociennik | Gary Mayes |
32 | L’Épée du Samouraï | Mark Smith | Pascale Jusforgues | Alan Langford |
33 | L’Épreuve des Champions | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Brian Williams |
34 | Défis Sanglants sur l’Océan | Andrew Chapman | Jean Walter | Bob Harvey |
35 | Les Démons des Profondeurs | Steve Jackson | Noël Chassériau | Bob Harvey |
36 | Rendez-vous avec la M.O.R.T. | Steve Jackson | Arnaud Dupin de Beyssat | Declan Considine |
37 | La Planète Rebelle | Robin Waterfield | C. Degolf | Gary Mayes |
38 | Les Trafiquants de Kelter | Andrew Chapman | Anne Blanchet | Nik Spender |
39 | Le Combattant de l’Autoroute | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Kevin Bulmer |
40 | Le Mercenaire de l’Espace | Andrew Chapman | Jean Walthers | Geoffroy Senior |
41 | Le Temple de la Terreur | Ian Livingstone | Denise May | Bill Houston |
42 | Le Manoir de l’Enfer | Steve Jackson | ||
43 | Le Marais aux Scorpions | Steve Jackson | Camille Fabien | Duncan Smith |
44 | Le Talisman de la Mort | Steve Jackson | Camille Fabien | Bob Harvey |
45 | La Sorcière des Neiges | Ian Livingstone | Michel Zénon | Edward Crosby, Gary Ward |
46 | La Citadelle du Chaos | Steve Jackson | Marie-Raymond Farré | Russ Nicholson |
47 | La Galaxie Tragique | Steve Jackson | Camille Fabien | Peter Jones |
48 | La Forêt de la Malédiction | Ian Livingstone | Camille Fabien | Malcolm Barter |
49 | La Cité des Voleurs | Ian Livingstone | Henri Robillot | Iain McCaig |
50 | Le Labyrinthe de la Mort | Ian Livingstone | Patricia Marais | Iain McCaig |
51 | L’Île du Roi Lézard | Ian Livingstone | Fabienne Vimereu | Alan Langford |
52 | Le Sorcier de la Montagne de Feu | Steve Jackson | Camille Fabien | Russ Nicholson |
Bear in mind this method fails for Le Manoir de l'Enfer
, because word ‘Illustrations’ is not found in text. It’s down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list
names = []
for elem in soup:
names.append(elem.text)