How to bypass bot detection and scrape a website using python

Question:

The problem

I was new to web scraping and I was trying to create a scraper which looks at a playlist link and gets the list of the music and the author.

But the site kept rejecting my connection because it thought that I was a bot, so I used UserAgent to create a fake useragent string to try and bypass the filter.

It sort of worked? But the problem was that when you visited the website by a browser, you could see the contents of the playlist, but when you tried to extract the html code with requests, the contents of the playlist was just a big blank space.

Mabye I have to wait for the page to load? Or there is a stronger bot filter?

My code

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

melon_site="http://kko.to/IU8zwNmjM"

headers = {'User-Agent' : ua.random}
result = requests.get(melon_site, headers = headers)


print(result.status_code)
src = result.content
soup = BeautifulSoup(src,'html.parser')
print(soup)

Link of website

playlist link

html I get when using requests

html with blank space where the playlist was supposed to be

Asked By: Andy_ye

||

Answers:


POINT TO REMEMBERS WHILE SCRAPING


1)Use a good User Agent.. ua.random may be returning you a user agent which is being Blocked by the server

2) If you are Doing Too much scraping, limit down your scraping pace , use time.sleep() so that server may not get loaded by your Ip address else it will block you.

3) If server blocks you try using Ip rotating.

Answered By: Sharyar Vohra

You wanna check out this link to get the content you wish to grab.

The following attempt should fetch you the artist names and their song names.

import requests
from bs4 import BeautifulSoup

url = 'https://www.melon.com/mymusic/playlist/mymusicplaylistview_listSong.htm?plylstSeq=473505374'

r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("tr:has(#artistName)"):
    artist_name = item.select_one("#artistName > a[href*='goArtistDetail']")['title']
    song = item.select_one("a[href*='playSong']")['title']
    print(artist_name,song)

Output are like:

Martin Garrix - 페이지 이동 Used To Love (feat. Dean Lewis) 재생 - 새 창
Post Malone - 페이지 이동 Circles 재생 - 새 창
Marshmello - 페이지 이동 Here With Me 재생 - 새 창
Coldplay - 페이지 이동 Cry Cry Cry 재생 - 새 창

Note: your BeautifulSoup version should be 4.7.0 or later in order for the script to support pseudo selector.

Answered By: SIM

That’s because the playlist is being loaded via javascript API calls AFTER the page has been loaded into an actual web browser and the document.ready event is called. At a quick glance it’s probably "/mymusic/common/mymusiccommon_copyPlaylist.json".
BeautifulSoup does static page manipulation – ie. it will download the html and load it into a DOM to make extraction of data easier, IT WILL NOT RUN DYNAMIC WEBPAGES. You need to use a headless web browser for that like Selenium, pptr or Playwright that will run the javascript code and do all the myriad of EXTRA calls all websites do to fetch the rest of the actual website. I don’t think they are actually using ANY bot detection.

Answered By: LordWabbit