extract text lines between two lines with text marks using regex
Question:
I have a text file like this:
## COL
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
## USA
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
## ESP
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
I need to extract just the lines for a specific country using regex and python, for example:
## COL
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
Note: There is no key or value that identifies the country, only those text marks line from the previous example
I try this regex without success:
(?<=## COL).*[ws]*(?=##})
Thanks in advance!
Answers:
With a regex:
import re
m = re.search(r'^## COLn(?:(?!##).)+', text, flags=re.S)
if m:
print(m.group())
More efficient alternative:
m = re.search(r'^## COLn(?:(?:(?!##).*)n)+', text).group()
Output:
## COL
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
What about ## COL[^#]*
? It should be sufficient to match the requested pattern ? No look ahead or behind necessary.
See https://regex101.com/r/pc0iaV/1 for demonstration that it works.
Without the re.S
flag you can write the pattern as:
^## COL(?:n(?!## ).*)*
Explanation
^
Start of string
## COL
Match literally
(?:
Non capture group
n(?!## ).*
Match a newline and match the whole line if it does not start with ##
)*
Close the non capture group and optionally repeat it
See a regex demo.
I have a text file like this:
## COL
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
## USA
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
## ESP
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
I need to extract just the lines for a specific country using regex and python, for example:
## COL
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
Note: There is no key or value that identifies the country, only those text marks line from the previous example
I try this regex without success:
(?<=## COL).*[ws]*(?=##})
Thanks in advance!
With a regex:
import re
m = re.search(r'^## COLn(?:(?!##).)+', text, flags=re.S)
if m:
print(m.group())
More efficient alternative:
m = re.search(r'^## COLn(?:(?:(?!##).*)n)+', text).group()
Output:
## COL
{ "Id": 1, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
{ "Id": 1, "key1": "value1", "key2": "valueC", ... "keyN": "valueN"}
{ "Id": 2, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueA", ... "keyN": "valueN"}
{ "Id": 3, "key1": "value1", "key2": "valueB", ... "keyN": "valueN"}
.
.
.
{ "Id": n, "key1": "value1", "key2": "valueZ", ... "keyN": "valueN"}
What about ## COL[^#]*
? It should be sufficient to match the requested pattern ? No look ahead or behind necessary.
See https://regex101.com/r/pc0iaV/1 for demonstration that it works.
Without the re.S
flag you can write the pattern as:
^## COL(?:n(?!## ).*)*
Explanation
^
Start of string## COL
Match literally(?:
Non capture groupn(?!## ).*
Match a newline and match the whole line if it does not start with##
)*
Close the non capture group and optionally repeat it
See a regex demo.