Question about a multi line regex in python language
Question:
I want to perform the selection of a group of lines in a text file to get all jobs related to an ipref
The test file is like this :
job numbers : (1,2,3), ip ref : (10,12,10)
text file :
1
… (several lines of text)
xxx 10
2
… (several lines of text)
xxx 12
3
… (several lines of text)
xxx 10
i want to select job numbers for IPref=10.
Code :
#!/usr/bin/python
import re
import sys
fic=open('test2.xml','r')
texte=fic.read()
fic.close()
#pattern='n?d(?!(?:n?xxx d{2}n)*)xxx 10'
pattern='n?d.*?xxx 10'
result= re.findall(pattern,texte, re.DOTALL)
i=1
for match in result:
print("nmatch:",i)
i=i+1
print(match)
Result :
match: 1
1
a
b
xxx 10
match: 2
1
a
b
xxx 12
1
a
b
xxx 10
i have tried to replace .* by a a negative lookahead assertion to only select if no expr like "n?xxx d{2}n"
is before "xxx 10" :
pattern='n?d(?!(?:n?xxx d{2}n)*)xxx 10'
but it is not working …
Answers:
You can write the pattern in this way, repeating the newline and asserting not xxx followed by 1 or more digits:
^d(?:n(?!xxx d+$).*)*nxxx 10$
The pattern matches:
^
Start of string
d
Match a single digit (or d+
for 1 or more)
(?:
Non capture group
n
Match a newline
(?!xxx d+$)
Negative lookahead to assert that the string is not xxx
followed by 1+ digits
.*
If the assertion is true, match the whole line
)*
Close the group and optionally repeat it
nxxx 10$
Match a newline, xxx
and 10
Good day to you 🙂 and Thank you very much for your quick response!!
i give you below the result
Note : i have modified re.DOTALL by re.DOTALL|re.MULTILINE (because the result is none without that… Sorry for the previous presentation … it wat not very clear)
Text file :
1
a
b
xxx 10
1
a
b
xxx 12
1
a
b
xxx 10
Code With your pattern :
#!/usr/bin/python
import re
import sys
fic=open('test2.xml','r')
texte=fic.read()
fic.close()
print(texte)
#pattern='</?(?!(?:span|br|b)(?: [^>]*)?>)[^>/]*>'
#pattern='n?d(?!(?:n?xxx d{2}n?)*?)xxx 10'
#pattern='n?d.*?xxx 10'
pattern='^d(?:n(?!xxx d+$).*)*nxxx 10$'
result= re.findall(pattern,texte, re.DOTALL|re.MULTILINE)
i=1
for match in result:
print("nmatch:",i)
i=i+1
print(match)
Result :
match: 1
1
a
b
xxx 10
1
a
b
xxx 12
1
a
b
xxx 10
but i try to obtain :
match: 1
1
a
b
xxx 10
match 2 :
1
a
b
xxx 10
Thank you very much, (you saved my day !!)
as you say :
pattern='^d(?:n(?!xxx d+$).*)*nxxx 10$'
result= re.findall(pattern,texte, re.MULTILINE)
result : OK, the line group (1..xxx 12) is ignored,
NOTE : i can adapt it to a case where line 1 is a line giving job information and "xxx 12" is a line giving printer IP information.
match: 1
1
a
b
xxx 10
match: 2
1
a
b
xxx 10
file :
job_number job_id
1 10202
bla bla
bla bla bla
xxx 100.10.10.100
2 10203
bla bla
bla bla bla
bla bla bla
xxx 100.10.10.102
3 10204
bla bla bla
bla bla bla
xxx 100.10.10.100
bash script with embedded python script :
#!/bin/bash
# function , $1 : ip of a printer
get_jobs_ip ()
{
cat <<EOF | python
import re
fic=open('test3.xml','r')
texte=fic.read()
fic.close()
"""
The pattern matches example with ip="100.10.10.100" :
thank you to Fourth bird for the pattern !!!
#pattern='^ds+d+(?:n(?!xxx d+.d+.d+.d+$).*)*nxxx 100.10.10.100$'
^ Start of string
d Match a single digit (or d+ for 1 or more)
(?: Non capture group
n Match a newline
(?!xxx d+.d+.d+.d+$) Negative lookahead to assert that the string is not xxx followed by 1+ digits
.* If the assertion is true, match the whole line
)* Close the group and optionally repeat it
nxxx 100.10.10.100$ Match a newline, xxx and 10
"""
ip="$1"
pattern_template='^ds+d+(?:n(?!xxx d+.d+.d+.d+$).*)*nxxx @ip@$'
pattern=pattern_template.replace('@ip@',ip)
result= re.findall(pattern,texte, re.MULTILINE)
i=1
for match in result:
print("nmatch:",i)
i=i+1
print(match)
EOF
}
get_jobs_ip "100.10.10.100"
get_jobs_ip "100.10.10.102"
result :
match: 1
1 10202
bla bla
bla bla bla
xxx 100.10.10.100
match: 2
3 10204
bla bla bla
bla bla bla
xxx 100.10.10.100
match: 1
2 10203
bla bla
bla bla bla
bla bla bla
xxx 100.10.10.102
I want to perform the selection of a group of lines in a text file to get all jobs related to an ipref
The test file is like this :
job numbers : (1,2,3), ip ref : (10,12,10)
text file :
1
… (several lines of text)
xxx 10
2
… (several lines of text)
xxx 12
3
… (several lines of text)
xxx 10
i want to select job numbers for IPref=10.
Code :
#!/usr/bin/python
import re
import sys
fic=open('test2.xml','r')
texte=fic.read()
fic.close()
#pattern='n?d(?!(?:n?xxx d{2}n)*)xxx 10'
pattern='n?d.*?xxx 10'
result= re.findall(pattern,texte, re.DOTALL)
i=1
for match in result:
print("nmatch:",i)
i=i+1
print(match)
Result :
match: 1
1
a
b
xxx 10
match: 2
1
a
b
xxx 12
1
a
b
xxx 10
i have tried to replace .* by a a negative lookahead assertion to only select if no expr like "n?xxx d{2}n"
is before "xxx 10" :
pattern='n?d(?!(?:n?xxx d{2}n)*)xxx 10'
but it is not working …
You can write the pattern in this way, repeating the newline and asserting not xxx followed by 1 or more digits:
^d(?:n(?!xxx d+$).*)*nxxx 10$
The pattern matches:
^
Start of stringd
Match a single digit (ord+
for 1 or more)(?:
Non capture groupn
Match a newline(?!xxx d+$)
Negative lookahead to assert that the string is notxxx
followed by 1+ digits.*
If the assertion is true, match the whole line
)*
Close the group and optionally repeat itnxxx 10$
Match a newline,xxx
and 10
Good day to you 🙂 and Thank you very much for your quick response!!
i give you below the result
Note : i have modified re.DOTALL by re.DOTALL|re.MULTILINE (because the result is none without that… Sorry for the previous presentation … it wat not very clear)
Text file :
1
a
b
xxx 10
1
a
b
xxx 12
1
a
b
xxx 10
Code With your pattern :
#!/usr/bin/python
import re
import sys
fic=open('test2.xml','r')
texte=fic.read()
fic.close()
print(texte)
#pattern='</?(?!(?:span|br|b)(?: [^>]*)?>)[^>/]*>'
#pattern='n?d(?!(?:n?xxx d{2}n?)*?)xxx 10'
#pattern='n?d.*?xxx 10'
pattern='^d(?:n(?!xxx d+$).*)*nxxx 10$'
result= re.findall(pattern,texte, re.DOTALL|re.MULTILINE)
i=1
for match in result:
print("nmatch:",i)
i=i+1
print(match)
Result :
match: 1
1
a
b
xxx 10
1
a
b
xxx 12
1
a
b
xxx 10
but i try to obtain :
match: 1
1
a
b
xxx 10
match 2 :
1
a
b
xxx 10
Thank you very much, (you saved my day !!)
as you say :
pattern='^d(?:n(?!xxx d+$).*)*nxxx 10$'
result= re.findall(pattern,texte, re.MULTILINE)
result : OK, the line group (1..xxx 12) is ignored,
NOTE : i can adapt it to a case where line 1 is a line giving job information and "xxx 12" is a line giving printer IP information.
match: 1
1
a
b
xxx 10
match: 2
1
a
b
xxx 10
file :
job_number job_id
1 10202
bla bla
bla bla bla
xxx 100.10.10.100
2 10203
bla bla
bla bla bla
bla bla bla
xxx 100.10.10.102
3 10204
bla bla bla
bla bla bla
xxx 100.10.10.100
bash script with embedded python script :
#!/bin/bash
# function , $1 : ip of a printer
get_jobs_ip ()
{
cat <<EOF | python
import re
fic=open('test3.xml','r')
texte=fic.read()
fic.close()
"""
The pattern matches example with ip="100.10.10.100" :
thank you to Fourth bird for the pattern !!!
#pattern='^ds+d+(?:n(?!xxx d+.d+.d+.d+$).*)*nxxx 100.10.10.100$'
^ Start of string
d Match a single digit (or d+ for 1 or more)
(?: Non capture group
n Match a newline
(?!xxx d+.d+.d+.d+$) Negative lookahead to assert that the string is not xxx followed by 1+ digits
.* If the assertion is true, match the whole line
)* Close the group and optionally repeat it
nxxx 100.10.10.100$ Match a newline, xxx and 10
"""
ip="$1"
pattern_template='^ds+d+(?:n(?!xxx d+.d+.d+.d+$).*)*nxxx @ip@$'
pattern=pattern_template.replace('@ip@',ip)
result= re.findall(pattern,texte, re.MULTILINE)
i=1
for match in result:
print("nmatch:",i)
i=i+1
print(match)
EOF
}
get_jobs_ip "100.10.10.100"
get_jobs_ip "100.10.10.102"
result :
match: 1
1 10202
bla bla
bla bla bla
xxx 100.10.10.100
match: 2
3 10204
bla bla bla
bla bla bla
xxx 100.10.10.100
match: 1
2 10203
bla bla
bla bla bla
bla bla bla
xxx 100.10.10.102