How can I use Pandas column to parse textfrom the web?
Question:
I’ve used the map function on a dataframe column of postcodes to create a new Series of tuples which I can then manipulate into a new dataframe.
def scrape_data(series_data):
#A bit of code to create the URL goes here
r = requests.get(url)
root_content = r.content
root = lxml.html.fromstring(root_content)
address = root.cssselect(".lr_results ul")
for place in address:
address_property = place.cssselect("li a")[0].text
house_type = place.cssselect("li")[1].text
house_sell_price = place.cssselect("li")[2].text
house_sell_date = place.cssselect("li")[3].text
return address_property, house_type, house_sell_price, house_sell_date
df = postcode_subset['Postcode'].map(scrape_data)
While it works where there is only one property on a results page, it fails to create a tuple for multiple properties.
What I’d like to be able to do is iterate through a series of pages and then add that content to a dataframe. I know that Pandas can convert nested dicts into dataframes, but really struggling to make it work. I’ve tried to use the answers at How to make a nested dictionary and dynamically append data but I’m getting lost.
Answers:
At the moment your function only returns for the first place
in address
(usually in python you would yield
(rather than return
) to retrieve all the results as a generator.
When subsequently doing an apply/map, you’ll usually want the function to return a Series…
However, I think you just want to return the following DataFrame:
return pd.DataFrame([{'address_ property': place.cssselect("li a")[0].text,
'house_type': place.cssselect("li")[1].text,
'house_sell_price': place.cssselect("li")[2].text,
'house_sell_date': place.cssselect("li")[3].text}
for place in address],
index=address)
To make the code work, I eventually reworked Andy Hayden’s solution to:
listed = []
for place in address:
results = [{'postcode':postcode_bit,'address_ property': place.cssselect("li a")[0].text,
'house_type': place.cssselect("li")[1].text,
'house_sell_price': place.cssselect("li")[2].text,
'house_sell_date': place.cssselect("li")[3].text}]
listed.extend(results)
return listed
At least I understand a bit more about how Python data structures work now.
I’ve used the map function on a dataframe column of postcodes to create a new Series of tuples which I can then manipulate into a new dataframe.
def scrape_data(series_data):
#A bit of code to create the URL goes here
r = requests.get(url)
root_content = r.content
root = lxml.html.fromstring(root_content)
address = root.cssselect(".lr_results ul")
for place in address:
address_property = place.cssselect("li a")[0].text
house_type = place.cssselect("li")[1].text
house_sell_price = place.cssselect("li")[2].text
house_sell_date = place.cssselect("li")[3].text
return address_property, house_type, house_sell_price, house_sell_date
df = postcode_subset['Postcode'].map(scrape_data)
While it works where there is only one property on a results page, it fails to create a tuple for multiple properties.
What I’d like to be able to do is iterate through a series of pages and then add that content to a dataframe. I know that Pandas can convert nested dicts into dataframes, but really struggling to make it work. I’ve tried to use the answers at How to make a nested dictionary and dynamically append data but I’m getting lost.
At the moment your function only returns for the first place
in address
(usually in python you would yield
(rather than return
) to retrieve all the results as a generator.
When subsequently doing an apply/map, you’ll usually want the function to return a Series…
However, I think you just want to return the following DataFrame:
return pd.DataFrame([{'address_ property': place.cssselect("li a")[0].text,
'house_type': place.cssselect("li")[1].text,
'house_sell_price': place.cssselect("li")[2].text,
'house_sell_date': place.cssselect("li")[3].text}
for place in address],
index=address)
To make the code work, I eventually reworked Andy Hayden’s solution to:
listed = []
for place in address:
results = [{'postcode':postcode_bit,'address_ property': place.cssselect("li a")[0].text,
'house_type': place.cssselect("li")[1].text,
'house_sell_price': place.cssselect("li")[2].text,
'house_sell_date': place.cssselect("li")[3].text}]
listed.extend(results)
return listed
At least I understand a bit more about how Python data structures work now.