Stripping links returned by BeautifulSoup
Question:
When I use the BeautifulSoup, I get the following code returned from href.
"/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiAf5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
What is the easiest way to cut only the "http://…." pdf so I could download the file?
for link in soup.findAll('a'):
try:
href = link['href']
if re.search(re.compile('.(pdf)'), href):
print href
except KeyError:
pass
Answers:
How consistently do they come across?
href.split('q=')[1].split('&')[0]
Would work without regex. This might also do it:
href[7:href.index('&')] # may need +1 after .index call
They both seem to work in my interactive terminal:
>>> s = "/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiA f5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
>>>
>>> s[7:s.index('&')]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>
>>> s.split('q=')[1].split('&')[0]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>
You can also get there with this regex:
>>> import re
>>>
>>> re.findall('http://.*?.pdf', s)
['http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf']
>>>
A more pythonic way to do it would be the urlparse
library:
A = "/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiAf5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
import urlparse
sol = urlparse.parse_qs(A)
print sol["/url?q"][0]
Which gives:
>> http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf
The synatax is slightly different if you are using Python 3, shown above is the Python 2.7 version. This is really nice if you’d like the other arguments as well, for example:
print sol["ved"]
>> ['0CDkQFjAJ']
When I use the BeautifulSoup, I get the following code returned from href.
"/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiAf5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
What is the easiest way to cut only the "http://…." pdf so I could download the file?
for link in soup.findAll('a'):
try:
href = link['href']
if re.search(re.compile('.(pdf)'), href):
print href
except KeyError:
pass
How consistently do they come across?
href.split('q=')[1].split('&')[0]
Would work without regex. This might also do it:
href[7:href.index('&')] # may need +1 after .index call
They both seem to work in my interactive terminal:
>>> s = "/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiA f5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
>>>
>>> s[7:s.index('&')]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>
>>> s.split('q=')[1].split('&')[0]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>
You can also get there with this regex:
>>> import re
>>>
>>> re.findall('http://.*?.pdf', s)
['http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf']
>>>
A more pythonic way to do it would be the urlparse
library:
A = "/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiAf5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
import urlparse
sol = urlparse.parse_qs(A)
print sol["/url?q"][0]
Which gives:
>> http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf
The synatax is slightly different if you are using Python 3, shown above is the Python 2.7 version. This is really nice if you’d like the other arguments as well, for example:
print sol["ved"]
>> ['0CDkQFjAJ']