How to use Python requests to fake a browser visit a.k.a and generate User Agent?
Question:
I want to get the content from this website.
If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget
command) to get it, it returns a totally different HTML page.
I thought the developer of the website had made some blocks for this.
Question
How do I fake a browser visit by using python requests or command wget?
Answers:
Provide a User-Agent
header:
import requests
url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
FYI, here is a list of User-Agent strings for different browsers:
As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:
fake-useragent
Up to date simple useragent faker with real world database
Demo:
>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
Try doing this, using firefox as fake user agent (moreover, it’s a good startup script for web scraping with the use of cookies):
#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim_ts=4:sw=4
import cookielib, urllib2, sys
def doIt(uri):
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open(uri)
page.addheaders = [('User-agent', 'Mozilla/5.0')]
print page.read()
for i in sys.argv[1:]:
doIt(i)
USAGE:
python script.py "http://www.ichangtou.com/#company:data_000008.html"
I used fake UserAgent.
How to use:
from fake_useragent import UserAgent
import requests
ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)
Output:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>
The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.
So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.
I had a similar issue but I was unable to use the UserAgent
class inside the fake_useragent
module. I was running the code inside a docker container
import requests
import ujson
import random
response = requests.get('https://fake-useragent.herokuapp.com/browsers/0.1.11')
agents_dictionary = ujson.loads(response.text)
random_browser_number = str(random.randint(0, len(agents_dictionary['randomize'])))
random_browser = agents_dictionary['randomize'][random_browser_number]
user_agents_list = agents_dictionary['browsers'][random_browser]
user_agent = user_agents_list[random.randint(0, len(user_agents_list)-1)]
I targeted the endpoint used in the module. This solution still gave me a random user agent however there is the possibility that the data structure at the endpoint could change.
Answer
You need to create a header with a proper formatted User agent String, it server to communicate client-server.
You can check your own user agent Here.
Example
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0
Third party Package user_agent 0.1.9
I found this module very simple to use, in one line of code it randomly generates a User agent string.
from user_agent import generate_user_agent, generate_navigator
from pprint import pprint
print(generate_user_agent())
# 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'
print(generate_user_agent(os=('mac', 'linux')))
# 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:36.0) Gecko/20100101 Firefox/36.0'
pprint(generate_navigator())
# {'app_code_name': 'Mozilla',
# 'app_name': 'Netscape',
# 'appversion': '5.0',
# 'name': 'firefox',
# 'os': 'linux',
# 'oscpu': 'Linux i686 on x86_64',
# 'platform': 'Linux i686 on x86_64',
# 'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64; rv:41.0) Gecko/20100101 Firefox/41.0',
# 'version': '41.0'}
pprint(generate_navigator_js())
# {'appCodeName': 'Mozilla',
# 'appName': 'Netscape',
# 'appVersion': '38.0',
# 'platform': 'MacIntel',
# 'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Firefox/38.0'}
This is how, I have been using a random user agent from a list of nearlly 1000 fake user agents
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.ANDROID.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value, OperatingSystem.MAC.value]
user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=1000)
# Get list of user agents.
user_agents = user_agent_rotator.get_user_agents()
user_agent_random = user_agent_rotator.get_random_user_agent()
Example
print(user_agent_random)
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
For more details visit this link
I use pyuser_agent. this package use get user agnet
import pyuser_agent
import requests
ua = pyuser_agent.UA()
headers = {
"User-Agent" : ua.random
}
print(headers)
uri = "https://github.com/THAVASIGTI/"
res = requests.request("GET",uri,headers=headers)
print(res)
console out
{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+ (KHTML, like Gecko)'}
<Response [200]>
User agent is ok but he wants to fetch a JavaScript site.we can use selenium but it is annoying to setup and maintain so the best way to fetch a JavaScript rendered page is requests_html module. Which is a superset of the well known request module. To install use pip
pip install requests-html
And to fetch a JavaScript rendered page use
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')
Hope it will help. It uses puppter to render javascript and also it downloads chromium but you don’t have to worry everything is happening under the hood.you will get the end result.
I want to get the content from this website.
If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget
command) to get it, it returns a totally different HTML page.
I thought the developer of the website had made some blocks for this.
Question
How do I fake a browser visit by using python requests or command wget?
Provide a User-Agent
header:
import requests
url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
FYI, here is a list of User-Agent strings for different browsers:
As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:
fake-useragent
Up to date simple useragent faker with real world database
Demo:
>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
Try doing this, using firefox as fake user agent (moreover, it’s a good startup script for web scraping with the use of cookies):
#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim_ts=4:sw=4
import cookielib, urllib2, sys
def doIt(uri):
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open(uri)
page.addheaders = [('User-agent', 'Mozilla/5.0')]
print page.read()
for i in sys.argv[1:]:
doIt(i)
USAGE:
python script.py "http://www.ichangtou.com/#company:data_000008.html"
I used fake UserAgent.
How to use:
from fake_useragent import UserAgent
import requests
ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)
Output:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>
The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.
So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.
I had a similar issue but I was unable to use the UserAgent
class inside the fake_useragent
module. I was running the code inside a docker container
import requests
import ujson
import random
response = requests.get('https://fake-useragent.herokuapp.com/browsers/0.1.11')
agents_dictionary = ujson.loads(response.text)
random_browser_number = str(random.randint(0, len(agents_dictionary['randomize'])))
random_browser = agents_dictionary['randomize'][random_browser_number]
user_agents_list = agents_dictionary['browsers'][random_browser]
user_agent = user_agents_list[random.randint(0, len(user_agents_list)-1)]
I targeted the endpoint used in the module. This solution still gave me a random user agent however there is the possibility that the data structure at the endpoint could change.
Answer
You need to create a header with a proper formatted User agent String, it server to communicate client-server.
You can check your own user agent Here.
Example
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0
Third party Package user_agent 0.1.9
I found this module very simple to use, in one line of code it randomly generates a User agent string.
from user_agent import generate_user_agent, generate_navigator
from pprint import pprint
print(generate_user_agent())
# 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'
print(generate_user_agent(os=('mac', 'linux')))
# 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:36.0) Gecko/20100101 Firefox/36.0'
pprint(generate_navigator())
# {'app_code_name': 'Mozilla',
# 'app_name': 'Netscape',
# 'appversion': '5.0',
# 'name': 'firefox',
# 'os': 'linux',
# 'oscpu': 'Linux i686 on x86_64',
# 'platform': 'Linux i686 on x86_64',
# 'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64; rv:41.0) Gecko/20100101 Firefox/41.0',
# 'version': '41.0'}
pprint(generate_navigator_js())
# {'appCodeName': 'Mozilla',
# 'appName': 'Netscape',
# 'appVersion': '38.0',
# 'platform': 'MacIntel',
# 'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Firefox/38.0'}
This is how, I have been using a random user agent from a list of nearlly 1000 fake user agents
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.ANDROID.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value, OperatingSystem.MAC.value]
user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=1000)
# Get list of user agents.
user_agents = user_agent_rotator.get_user_agents()
user_agent_random = user_agent_rotator.get_random_user_agent()
Example
print(user_agent_random)
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
For more details visit this link
I use pyuser_agent. this package use get user agnet
import pyuser_agent
import requests
ua = pyuser_agent.UA()
headers = {
"User-Agent" : ua.random
}
print(headers)
uri = "https://github.com/THAVASIGTI/"
res = requests.request("GET",uri,headers=headers)
print(res)
console out
{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+ (KHTML, like Gecko)'}
<Response [200]>
User agent is ok but he wants to fetch a JavaScript site.we can use selenium but it is annoying to setup and maintain so the best way to fetch a JavaScript rendered page is requests_html module. Which is a superset of the well known request module. To install use pip
pip install requests-html
And to fetch a JavaScript rendered page use
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')
Hope it will help. It uses puppter to render javascript and also it downloads chromium but you don’t have to worry everything is happening under the hood.you will get the end result.