Separating tag attributes as a dictionary

Question:

My entry (The variable is of string type):

<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>

My expected output:

{
'href': 'https://wikipedia.org/',
'rel': 'nofollow ugc',
'text': 'wiki',
}

How can I do this with Python? Without using beautifulsoup Library

Please tell with the help of lxml library

Asked By: Sardar

||

Answers:

While using BeautifulSoup you could use .attrs to get a dict of of a tags attributes:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>')
soup.a.attrs

--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc']}

To get also the text:

...
data = soup.a.attrs
data.update({'text':soup.a.text})
print(data)

--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc'], 'text': 'wiki'}
Answered By: HedgeHog

Solution with lxml (but without bs!):

from lxml import etree

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
print(root.attrib)

>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc'}

But there’s no text attribute.
You can extract it by using text property:

print(root.text)
>>> 'wiki'

To conclusion:

from lxml import etree

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
dict_ = {}
dict_.update(root.attrib)
dict_.update({'text': root.text})
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}

EDIT

——-regex parsing [X]HTML is deprecated!——-

Solution with regex:

import re
pattern_text = r"[>](w+)[<]"
pattern_href = r'href="(wS+)"'
pattern_rel = r'rel="([A-z ]+)"'

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
dict_ = {
    'href': re.search(pattern_href, xml).group(1),
    'rel': re.search(pattern_rel, xml).group(1),
    'text': re.search(pattern_text, xml).group(1)
}
print(dict_)

>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}

It will work if input is string.

Answered By: vovakirdan

This is how you do it with lxml:

from lxml import etree

html = '''<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'''
root = etree.fromstring(html)
attrib_dict = root.attrib
attrib_dict['text'] = root.text 
print(attrib_dict)

Result:

{'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
Answered By: platipus_on_fire