re.findall outputs blanks along with correct

Question:

I’m trying to get the list output to not have subgroups or empty spaces. I’m trying to stick with a RegEx only solution due to my re.split and array manipulation method is really janky and sort of slow.

HTML file: (Notice that thing 3 & 4 have /b/ before instead of /a/.)

<!DOCTYPE html>
<html>
    <head></head>   
    <body></body>
        <a href="example.com/a/thing1"></a>
        <a href="example.com/a/thing2"></a>
        <a href="example.com/b/thing3"></a>
        <a href="example.com/b/thing4" ><img src="/thing4.png"></a>
    </body>
</html>

Python file:

import re

html = open("help.html", "r").read()
links = re.findall('((?<=.com/a/).*(?="))|((?<=.com/b/).*(?=" ><))|((?<=.com/b/).*(?="></a))',html)

print(links)

What will output when I run the above py file:

[('thing1', '', ''), ('thing2', '', ''), ('', '', 'thing3'), ('', 'thing4', '')]

What I want it to output:

[thing1, thing2, thing3, thing4]
Asked By: Surgemus

||

Answers:

You just have to remove the capturing groups. As stated in re.findall:

Empty matches are included in the result.

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

An example of capturing group is ((?<=.com/a/).*(?=")), so the most external brackets shall be removed, same for the other 2 groups:

links = re.findall('(?<=.com/a/).*(?=")|(?<=.com/b/).*(?=" ><)|(?<=.com/b/).*(?="></a)',HTML)

Output:

['thing1', 'thing2', 'thing3', 'thing4']
Answered By: CreepyRaccoon
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.