re.findall outputs blanks along with correct
Question:
I’m trying to get the list output to not have subgroups or empty spaces. I’m trying to stick with a RegEx only solution due to my re.split and array manipulation method is really janky and sort of slow.
HTML file: (Notice that thing 3 & 4 have /b/
before instead of /a/
.)
<!DOCTYPE html>
<html>
<head></head>
<body></body>
<a href="example.com/a/thing1"></a>
<a href="example.com/a/thing2"></a>
<a href="example.com/b/thing3"></a>
<a href="example.com/b/thing4" ><img src="/thing4.png"></a>
</body>
</html>
Python file:
import re
html = open("help.html", "r").read()
links = re.findall('((?<=.com/a/).*(?="))|((?<=.com/b/).*(?=" ><))|((?<=.com/b/).*(?="></a))',html)
print(links)
What will output when I run the above py file:
[('thing1', '', ''), ('thing2', '', ''), ('', '', 'thing3'), ('', 'thing4', '')]
What I want it to output:
[thing1, thing2, thing3, thing4]
Answers:
You just have to remove the capturing groups. As stated in re.findall:
Empty matches are included in the result.
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
An example of capturing group is ((?<=.com/a/).*(?="))
, so the most external brackets shall be removed, same for the other 2 groups:
links = re.findall('(?<=.com/a/).*(?=")|(?<=.com/b/).*(?=" ><)|(?<=.com/b/).*(?="></a)',HTML)
Output:
['thing1', 'thing2', 'thing3', 'thing4']
I’m trying to get the list output to not have subgroups or empty spaces. I’m trying to stick with a RegEx only solution due to my re.split and array manipulation method is really janky and sort of slow.
HTML file: (Notice that thing 3 & 4 have /b/
before instead of /a/
.)
<!DOCTYPE html>
<html>
<head></head>
<body></body>
<a href="example.com/a/thing1"></a>
<a href="example.com/a/thing2"></a>
<a href="example.com/b/thing3"></a>
<a href="example.com/b/thing4" ><img src="/thing4.png"></a>
</body>
</html>
Python file:
import re
html = open("help.html", "r").read()
links = re.findall('((?<=.com/a/).*(?="))|((?<=.com/b/).*(?=" ><))|((?<=.com/b/).*(?="></a))',html)
print(links)
What will output when I run the above py file:
[('thing1', '', ''), ('thing2', '', ''), ('', '', 'thing3'), ('', 'thing4', '')]
What I want it to output:
[thing1, thing2, thing3, thing4]
You just have to remove the capturing groups. As stated in re.findall:
Empty matches are included in the result.
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
An example of capturing group is ((?<=.com/a/).*(?="))
, so the most external brackets shall be removed, same for the other 2 groups:
links = re.findall('(?<=.com/a/).*(?=")|(?<=.com/b/).*(?=" ><)|(?<=.com/b/).*(?="></a)',HTML)
Output:
['thing1', 'thing2', 'thing3', 'thing4']