Access href link value from pandas dataframe

Question:

I currently have a dataframe I’ve produced through scraping https://www.cve.org/downloads.

 Format Unix Compressed (.Z)           Gzipped            Raw                                   Additional Notes
0    CSV       allitems.csv.Z   allitems.csv.gz   allitems.csv  NOTE: suitable for import into spreadsheet pro...
1   HTML      allitems.html.Z  allitems.html.gz  allitems.html                                                NaN
2   Text       allitems.txt.Z   allitems.txt.gz   allitems.txt                                                NaN
3    XML       allitems.xml.Z   allitems.xml.gz   allitems.xml                     XML Schema Design: cve_1.0.xsd

Under the Raw column, allitems.csv is actually a link in the website. Once I display it into a dataframe, the href value of the link could no longer be accessed. Below is the code I currently have using selenium and pandas:

import pandas as pd
from selenium import webdriver


# from selenium import webdriver
Browser = webdriver.Safari()

# # To navigate to a URL:
Browser.get("http://cve.org/downloads")

# # To get raw html string:
RawHtmlString = Browser.page_source

df = pd.read_html(RawHtmlString)[0]

print(df)

How do I edit my program to be capable to extract the link and automatically download it?

Asked By: Isaac Agatep

||

Answers:

First you have to access the a href part where the link is located, in order to get this text "/data/downloads/file.csv.gz"

s = requests.Session() 
link = '/data/downloads/file.csv.gz'
baseUrl= 'https://cve.mitre.org/'

Then you you apply something like this

s.get(url=urllib.parse.urljoin(baseurl,file_link),headers=headers)
Answered By: Art

Get links

If you really want to extract the links, you could first get all the a tags nested inside td with attr data-label="Raw", and then loop through them and get the hrefs. E.g.

raw = Browser.find_elements("xpath", "//td[@data-label='Raw']/a")

links = [r.get_attribute('href') for r in raw]

print(links)
['https://cve.mitre.org/data/downloads/allitems.csv', 
 'https://cve.mitre.org/data/downloads/allitems.html', 
 'https://cve.mitre.org/data/downloads/allitems.txt', 
 'https://cve.mitre.org/data/downloads/allitems.xml']

But if you’re only interested in the csv, you could use:

csvs = Browser.find_elements(
    "xpath", "//td[@data-label='Raw']/a[contains(@href,'.csv')]")
links = [csv.get_attribute('href') for csv in csvs]

# or just use `find_element`, seeing that there is only one such file:

csv_link = Browser.find_element(
    "xpath", "//td[@data-label='Raw']/a[contains(@href,'.csv')]")
    .get_attribute('href')

Of course, in this particular case, these would be quite pointless exercises. As you can see above, all links actually have the same base url. So, you can also simply create an extra column or something:

BASE = 'https://cve.mitre.org/data/downloads/'
df['Urls'] = BASE + df.Raw

print(df.Urls)
0    https://cve.mitre.org/data/downloads/allitems.csv
1    https://cve.mitre.org/data/downloads/allitems....
2    https://cve.mitre.org/data/downloads/allitems.txt
3    https://cve.mitre.org/data/downloads/allitems.xml
Name: Urls, dtype: object

Download files

For downloading, I would rely on urllib.request. Note the warning, though, in the docs: "[This function] might become deprecated at some point in the future". might… That warning has been around for a while. Try something as follows:

from urllib import request

my_path = 'destination_folder_path/' # mind the "/" at the end!

for l in links:
    fname = l.rsplit('/', maxsplit=1)[1]
    print(l) # just to see what we're downloading
    request.urlretrieve(l, f'{my_path}{fname}')
Answered By: ouroboros1
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.