Stripping links returned by BeautifulSoup

Question:

When I use the BeautifulSoup, I get the following code returned from href.

"/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiAf5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"

What is the easiest way to cut only the "http://…." pdf so I could download the file?

for link in soup.findAll('a'):
    try:
        href = link['href']
        if re.search(re.compile('.(pdf)'), href):
            print href
    except KeyError:
        pass
Asked By: raw-bin hood

||

Answers:

How consistently do they come across?

href.split('q=')[1].split('&')[0]

Would work without regex. This might also do it:

href[7:href.index('&')] # may need +1 after .index call

They both seem to work in my interactive terminal:

>>> s = "/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiA f5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
>>>
>>> s[7:s.index('&')]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>
>>> s.split('q=')[1].split('&')[0]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>

You can also get there with this regex:

>>> import re
>>>
>>> re.findall('http://.*?.pdf', s)
['http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf']
>>>
Answered By: g.d.d.c

A more pythonic way to do it would be the urlparse library:

A = "/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiAf5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"

import urlparse
sol = urlparse.parse_qs(A)
print sol["/url?q"][0]

Which gives:

>> http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf

The synatax is slightly different if you are using Python 3, shown above is the Python 2.7 version. This is really nice if you’d like the other arguments as well, for example:

print sol["ved"]
>> ['0CDkQFjAJ']
Answered By: Hooked