Data scraping using Python
Question:
I developed a script using Python to scrape name of phone from this URL https://www.jumia.com.ng/mobile-phones/
Here is my script:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.jumia.com.ng/mobile-phones/'
uClient =uReq(my_url) #open connection.. grab the page
page_html = uClient.read() #load the content into a varaible
uClient.close() #close the console
page_soup = soup(page_html, "html.parser") #it does the html parser
phone_name = page_soup.findAll("span",{"class":"name"}) #grabs each phone name
print (phone_name)
My expected result should be something like this:
Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey
but what I get is this :
<span class="name" dir="ltr">Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey</span>.
How do I extract the text from this <span class="name" dir="ltr">Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey</span>
?
Answers:
To extract name, use .text
>>> for phone_name in page_soup.findAll("span",{"class":"name"}):
print(phone_name.text)
Boom J8 5.5 Inch (2GB, 16GB ROM) Android Lollipop 5.1 13MP + 5MP Smartphone - White (MWFS)
Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey
Therefore your script should be like :
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.jumia.com.ng/mobile-phones/'
uClient =uReq(my_url) #open connection.. grab the page
page_html = uClient.read() #load the content into a varaible
uClient.close() #close the console
page_soup = soup(page_html, "html.parser") #it does the html parser
for phone_name in page_soup.findAll("span",{"class":"name"}):
print(phone_name.text)
It’s old I know, but for people facing a similar challenge, you can use a robust Jumia scraper I built on Apify – https://apify.com/microworlds/jumia-scraper
I developed a script using Python to scrape name of phone from this URL https://www.jumia.com.ng/mobile-phones/
Here is my script:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.jumia.com.ng/mobile-phones/'
uClient =uReq(my_url) #open connection.. grab the page
page_html = uClient.read() #load the content into a varaible
uClient.close() #close the console
page_soup = soup(page_html, "html.parser") #it does the html parser
phone_name = page_soup.findAll("span",{"class":"name"}) #grabs each phone name
print (phone_name)
My expected result should be something like this:
Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey
but what I get is this :
<span class="name" dir="ltr">Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey</span>.
How do I extract the text from this <span class="name" dir="ltr">Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey</span>
?
To extract name, use .text
>>> for phone_name in page_soup.findAll("span",{"class":"name"}):
print(phone_name.text)
Boom J8 5.5 Inch (2GB, 16GB ROM) Android Lollipop 5.1 13MP + 5MP Smartphone - White (MWFS)
Marathon M5 Mini 5.0-Inch IPS (2GB, 16GB ROM) Android 5.1 Lollipop, 13MP + 8MP Smartphone - Grey
Therefore your script should be like :
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.jumia.com.ng/mobile-phones/'
uClient =uReq(my_url) #open connection.. grab the page
page_html = uClient.read() #load the content into a varaible
uClient.close() #close the console
page_soup = soup(page_html, "html.parser") #it does the html parser
for phone_name in page_soup.findAll("span",{"class":"name"}):
print(phone_name.text)
It’s old I know, but for people facing a similar challenge, you can use a robust Jumia scraper I built on Apify – https://apify.com/microworlds/jumia-scraper