regex to extract text after certain characters regardless it has one or two n
Question:
I’m using Python. I need to use regex to extract certain characters for cells in a row and, and convert them into a new row.
Consider the following example. I have a dataset with many dataframes that look like this:
+-----------------------------------------------------------------------------------------------+
| index | N ° Review | Origen | Fecha de emisión del nreview | Autor |
+-----------------------------------------------------------------------------------------------+
| 1 | 23-0001-523 | Región: nATACAMA | 02/01/2022 | JOSE GARCIA |
+-----------------------------------------------------------------------------------------------+
| 2 | | Provincia: nCOPIAPO | | |
+-----------------------------------------------------------------------------------------------+
| 3 | | Comuna: nSAN PEDRO DE LA | | |
| | | nTIERRA AMARILLA | | |
+-----------------------------------------------------------------------------------------------+
And I need it to look like this:
+-----------------------------------------------------------------------------------------------------------------+
|index|numero_review |fecha_emision_review |autor |comuna |provincia |region |
|-----|--------------|----------------------|-----------|---------------------------------|-----------------------|
|1 |23-0001-523 |02/01/2022 |JOSE GARCIA| | |ATACAMA |
+-----------------------------------------------------------------------------------------------------------------+
|2 | | | | |COPIAPO | |
+-----------------------------------------------------------------------------------------------------------------+
|3 | | | | SAN PEDRO DE LA TIERRA AMARILLA | | |
+-----------------------------------------------------------------------------------------------------------------+
I have implemented the following code to achieve this objective:
table1 = camelot.read_pdf(i, pages = "all", flavor = "lattice", line_scale=40)
df = table1[0].df
df2 = df.transpose(copy = True)
df2
tabla_1 = df2.rename(columns={0:'numero_review', 1:'origen', 2:'fecha_emision_review', 3:'autor'})
tabla_1 = tabla_1.drop(index=0)
tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:sW+(.+)$').fillna('')
tabla_1['provincia'] = tabla_1['origen'].str.extract('Provincia:sW+(.+)$').fillna('')
tabla_1['region'] = tabla_1['origen'].str.extract('Región:sW+(.+)$').fillna('')
tabla_1.drop(['origen'], axis = 1, inplace = True)
tabla_1
However, the regex implemented can’t detect the variable comuna
as it has two n
. Consider this result:
+-----------------------------------------------------------------------------------------------------------------+
|index|numero_review |fecha_emision_review |autor |comuna |provincia |region |
|-----|--------------|----------------------|-----------|---------------------------------|-----------------------|
|1 |23-0001-523 |02/01/2022 |JOSE GARCIA| | |ATACAMA |
+-----------------------------------------------------------------------------------------------------------------+
|2 | | | | |COPIAPO | |
+-----------------------------------------------------------------------------------------------------------------+
|3 | | | | | | |
+-----------------------------------------------------------------------------------------------------------------+
So, I need to implement a regex that in the case of comuna
is able to idenfy appropiatly any text, regardless it has one or two n
Any suggestions? The pdf used in the implemented code can be found here: https://drive.google.com/file/d/1xwTWnYUpYz_8cKqWFe9c6UoAliz2SAgH/view?usp=sharing
Answers:
Use re.MULTILINE and re.DOTALL flags like the following.
(The names of the flags are self-explanatory. See the reference for details.)
tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:s*(.+)', re.MULTILINE | re.DOTALL).fillna('')
And there are a few suggestions.
-
Don’t mix ‘s'(white spaces) and ‘W'(none word characters), because the latter patterns include the former ones.
-
You don’t need ‘$’ or ‘Z’ to capture characters up to the end of the string because the ‘+’ or ‘*’ patterns work greedy by default.
I’m using Python. I need to use regex to extract certain characters for cells in a row and, and convert them into a new row.
Consider the following example. I have a dataset with many dataframes that look like this:
+-----------------------------------------------------------------------------------------------+
| index | N ° Review | Origen | Fecha de emisión del nreview | Autor |
+-----------------------------------------------------------------------------------------------+
| 1 | 23-0001-523 | Región: nATACAMA | 02/01/2022 | JOSE GARCIA |
+-----------------------------------------------------------------------------------------------+
| 2 | | Provincia: nCOPIAPO | | |
+-----------------------------------------------------------------------------------------------+
| 3 | | Comuna: nSAN PEDRO DE LA | | |
| | | nTIERRA AMARILLA | | |
+-----------------------------------------------------------------------------------------------+
And I need it to look like this:
+-----------------------------------------------------------------------------------------------------------------+
|index|numero_review |fecha_emision_review |autor |comuna |provincia |region |
|-----|--------------|----------------------|-----------|---------------------------------|-----------------------|
|1 |23-0001-523 |02/01/2022 |JOSE GARCIA| | |ATACAMA |
+-----------------------------------------------------------------------------------------------------------------+
|2 | | | | |COPIAPO | |
+-----------------------------------------------------------------------------------------------------------------+
|3 | | | | SAN PEDRO DE LA TIERRA AMARILLA | | |
+-----------------------------------------------------------------------------------------------------------------+
I have implemented the following code to achieve this objective:
table1 = camelot.read_pdf(i, pages = "all", flavor = "lattice", line_scale=40)
df = table1[0].df
df2 = df.transpose(copy = True)
df2
tabla_1 = df2.rename(columns={0:'numero_review', 1:'origen', 2:'fecha_emision_review', 3:'autor'})
tabla_1 = tabla_1.drop(index=0)
tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:sW+(.+)$').fillna('')
tabla_1['provincia'] = tabla_1['origen'].str.extract('Provincia:sW+(.+)$').fillna('')
tabla_1['region'] = tabla_1['origen'].str.extract('Región:sW+(.+)$').fillna('')
tabla_1.drop(['origen'], axis = 1, inplace = True)
tabla_1
However, the regex implemented can’t detect the variable comuna
as it has two n
. Consider this result:
+-----------------------------------------------------------------------------------------------------------------+
|index|numero_review |fecha_emision_review |autor |comuna |provincia |region |
|-----|--------------|----------------------|-----------|---------------------------------|-----------------------|
|1 |23-0001-523 |02/01/2022 |JOSE GARCIA| | |ATACAMA |
+-----------------------------------------------------------------------------------------------------------------+
|2 | | | | |COPIAPO | |
+-----------------------------------------------------------------------------------------------------------------+
|3 | | | | | | |
+-----------------------------------------------------------------------------------------------------------------+
So, I need to implement a regex that in the case of comuna
is able to idenfy appropiatly any text, regardless it has one or two n
Any suggestions? The pdf used in the implemented code can be found here: https://drive.google.com/file/d/1xwTWnYUpYz_8cKqWFe9c6UoAliz2SAgH/view?usp=sharing
Use re.MULTILINE and re.DOTALL flags like the following.
(The names of the flags are self-explanatory. See the reference for details.)
tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:s*(.+)', re.MULTILINE | re.DOTALL).fillna('')
And there are a few suggestions.
-
Don’t mix ‘s'(white spaces) and ‘W'(none word characters), because the latter patterns include the former ones.
-
You don’t need ‘$’ or ‘Z’ to capture characters up to the end of the string because the ‘+’ or ‘*’ patterns work greedy by default.