beatiful soup 4 getting an output as (['link1'] ['link2'] ['link3']). How to change as a required format? (['link1', 'link2', 'link3'])

Question

beatiful soup 4 getting an output as (example – [‘link1’][‘link2’][‘link3’]). How to change as a required format? (example – [‘link1’, ‘link2’, ‘link3’])

I am getting this below output.

['link1']
['link2']
['link3']

I need an output as i mentioned below like this to form a data frame, so what i need to do now.

['link1', 'link2', 'link3']

Exaplain with code also fine. please help me to solve this issue, thanks in advance.

My code

import bs4
from bs4 import BeautifulSoup
from csv import writer
import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0'}
HOST = 'https://www.zocdoc.com'
#PAGE = 'gastroenterologists/2'
web_page = 'https://www.zocdoc.com/search?address=Houston%2C%20TX&insurance_carrier=&city=Houston&date_searched_for=&day_filter=AnyDay&filters=%7B%7D&gender=-1&language=-1&latitude=29.7604267&locationType=placemark&longitude=-95.3698028&offset=1&insurance_plan=-1&reason_visit=386&search_query=Gastroenterologist&searchType=specialty&sees_children=false&after_5pm=false&before_10am=false&sort_type=Default&dr_specialty=106&state=TX&visitType=inPersonVisit&&timesgridType='
with requests.Session() as session:
    (r := session.get(HOST, headers=headers)).raise_for_status()
    #(r := session.get(f'{HOST}/{PAGE}', headers=headers)).raise_for_status()
    (r := session.get(f'{web_page}', headers=headers)).raise_for_status()
    # process content from here
print(r.text)
soup = BeautifulSoup(r.text, 'lxml')
soup
print(soup.prettify())

code 1 to get as a link

for item in soup.find_all('img'):
    images = []
    items = (item['src'])
    images = 'https:'+items
    print(images)

code 2 to get below mentioned output format

for item in soup.find_all('img'):
    c = []
    items = (item['src'])
    image = ('https:'+items)
    c.append(image)
    print(c)

Output – [‘link1’]
.
.
[‘linkn’]

Asked By: Rabiyulfahim

||

Source

Answer 1

The reason is you’re setting up a list in each for loop and then overwriting it. Defining the list first then appending to it will work like below.

images = []
for item in soup.find_all('img'):
    items = (item['src'])
    images += [f"https:{i}" for i in items]
    print(images)

Answered By: Sam

Answer 2

You have to append the urls to a list outsite your loop to avoid overwriting and get the structure you expect:

images = []
for item in soup.find_all('img'):
    images.append('https:'+item['src'])

As an alternative you can go with a list comprehension notation:

images = ['https:'+item['src'] for item in soup.find_all('img')]

Just a hint – Avoid storing scraped information in these bunch of lists, use more structured like dict:

data = []
for item in soup.find_all('article'):
    data.append({
        'name':item.find('span',{'itemprop':'name'}).text,
        'image':'https:'+item.img['src'],
        'anyOtherInfo':'anyOtherInfo'
    })

Answered By: HedgeHog

beatiful soup 4 getting an output as (['link1'] ['link2'] ['link3']). How to change as a required format? (['link1', 'link2', 'link3'])

Question:

Answers: