Scraping a webpage with Python but unsure how to deal with a static(?) URL
Question:
I am trying to learn how to pull data from this url:
https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview
However, the problem is that the URL doesn’t change when I am trying to switch pages so I am not exactly sure how to enumerate or loop through it. Trying to find a better way since the webpage has 3 thousand datapoints of sales.
Here is my starting code it is very simple but I would appreciate any help that can be given or any hints. I think I might need to change to another package but I am not sure which one maybe beautifulsoup?
import requests
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"
html = requests.get(url).content
df_list = pd.read_html(html,header = 1)[0]
df_list = df_list.drop([0,1,2]) #Drop unnecessary rows
Answers:
To get the data from more pages you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = {
"folder": "auctionResults",
"loginID": "00",
"pageNum": "1",
"orderBy": "AdvNum",
"orderDir": "asc",
"justFirstCertOnGroups": "1",
"doSearch": "true",
"itemIDList": "",
"itemSetIDList": "",
"interest": "",
"premium": "",
"itemSetDID": "",
}
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"
all_data = []
for data["pageNum"] in range(1, 3): # <-- increase number of pages here.
soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
for row in soup.select("#searchResults tr")[2:]:
tds = [td.text.strip() for td in row.select("td")]
all_data.append(tds)
columns = [
"SEQ NUM",
"Tax Year",
"Notices",
"Parcel ID",
"Face Amount",
"Winning Bid",
"Sold To",
]
df = pd.DataFrame(all_data, columns=columns)
# print last 10 items from dataframe:
print(df.tail(10).to_markdown())
Prints:
SEQ NUM
Tax Year
Notices
Parcel ID
Face Amount
Winning Bid
Sold To
96
000094
2020
00031-18-001-000
$905.98
$81.00
00005517
97
000095
2020
00031-18-002-000
$750.13
$75.00
00005517
98
000096
2020
00031-18-003-000
$750.13
$75.00
00005517
99
000097
2020
00031-18-004-000
$750.13
$75.00
00005517
100
000098
2020
00031-18-007-000
$750.13
$76.00
00005517
101
000099
2020
00031-18-008-000
$905.98
$84.00
00005517
102
000100
2020
00031-19-001-000
$1,999.83
$171.00
00005517
103
000101
2020
00031-19-004-000
$1,486.49
$131.00
00005517
104
000102
2020
00031-19-006-000
$1,063.44
$96.00
00005517
105
000103
2020
00031-20-001-000
$1,468.47
$126.00
00005517
Use the information wisely and ensure you have the correct permissions to scrape this site and process the information.
Looks like if you f12 the site and go to the networking -> Payload and switch to page two. Form data shows up with the page number, replicating this form and modifying the page value should allow you to scrape this.
As always there’s probably a python package out there which will make this easy
I am trying to learn how to pull data from this url:
https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview
However, the problem is that the URL doesn’t change when I am trying to switch pages so I am not exactly sure how to enumerate or loop through it. Trying to find a better way since the webpage has 3 thousand datapoints of sales.
Here is my starting code it is very simple but I would appreciate any help that can be given or any hints. I think I might need to change to another package but I am not sure which one maybe beautifulsoup?
import requests
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"
html = requests.get(url).content
df_list = pd.read_html(html,header = 1)[0]
df_list = df_list.drop([0,1,2]) #Drop unnecessary rows
To get the data from more pages you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = {
"folder": "auctionResults",
"loginID": "00",
"pageNum": "1",
"orderBy": "AdvNum",
"orderDir": "asc",
"justFirstCertOnGroups": "1",
"doSearch": "true",
"itemIDList": "",
"itemSetIDList": "",
"interest": "",
"premium": "",
"itemSetDID": "",
}
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"
all_data = []
for data["pageNum"] in range(1, 3): # <-- increase number of pages here.
soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
for row in soup.select("#searchResults tr")[2:]:
tds = [td.text.strip() for td in row.select("td")]
all_data.append(tds)
columns = [
"SEQ NUM",
"Tax Year",
"Notices",
"Parcel ID",
"Face Amount",
"Winning Bid",
"Sold To",
]
df = pd.DataFrame(all_data, columns=columns)
# print last 10 items from dataframe:
print(df.tail(10).to_markdown())
Prints:
SEQ NUM | Tax Year | Notices | Parcel ID | Face Amount | Winning Bid | Sold To | |
---|---|---|---|---|---|---|---|
96 | 000094 | 2020 | 00031-18-001-000 | $905.98 | $81.00 | 00005517 | |
97 | 000095 | 2020 | 00031-18-002-000 | $750.13 | $75.00 | 00005517 | |
98 | 000096 | 2020 | 00031-18-003-000 | $750.13 | $75.00 | 00005517 | |
99 | 000097 | 2020 | 00031-18-004-000 | $750.13 | $75.00 | 00005517 | |
100 | 000098 | 2020 | 00031-18-007-000 | $750.13 | $76.00 | 00005517 | |
101 | 000099 | 2020 | 00031-18-008-000 | $905.98 | $84.00 | 00005517 | |
102 | 000100 | 2020 | 00031-19-001-000 | $1,999.83 | $171.00 | 00005517 | |
103 | 000101 | 2020 | 00031-19-004-000 | $1,486.49 | $131.00 | 00005517 | |
104 | 000102 | 2020 | 00031-19-006-000 | $1,063.44 | $96.00 | 00005517 | |
105 | 000103 | 2020 | 00031-20-001-000 | $1,468.47 | $126.00 | 00005517 |
Use the information wisely and ensure you have the correct permissions to scrape this site and process the information.
Looks like if you f12 the site and go to the networking -> Payload and switch to page two. Form data shows up with the page number, replicating this form and modifying the page value should allow you to scrape this.
As always there’s probably a python package out there which will make this easy