regex to extract text after certain characters regardless it has one or two n

Question:

I’m using Python. I need to use regex to extract certain characters for cells in a row and, and convert them into a new row.
Consider the following example. I have a dataset with many dataframes that look like this:

+-----------------------------------------------------------------------------------------------+
| index | N ° Review  | Origen                    | Fecha de emisión del nreview | Autor       |
+-----------------------------------------------------------------------------------------------+
| 1     | 23-0001-523 | Región: nATACAMA         | 02/01/2022                    | JOSE GARCIA |
+-----------------------------------------------------------------------------------------------+
| 2     |             | Provincia: nCOPIAPO      |                               |             |   
+-----------------------------------------------------------------------------------------------+   
| 3     |             | Comuna: nSAN PEDRO DE LA |                               |             |
|       |             | nTIERRA AMARILLA         |                               |             |
+-----------------------------------------------------------------------------------------------+

And I need it to look like this:

+-----------------------------------------------------------------------------------------------------------------+
|index|numero_review |fecha_emision_review  |autor      |comuna                           |provincia     |region  |
|-----|--------------|----------------------|-----------|---------------------------------|-----------------------|
|1    |23-0001-523   |02/01/2022            |JOSE GARCIA|                                 |              |ATACAMA |
+-----------------------------------------------------------------------------------------------------------------+
|2    |              |                      |           |                                 |COPIAPO       |        |
+-----------------------------------------------------------------------------------------------------------------+
|3    |              |                      |           | SAN PEDRO DE LA TIERRA AMARILLA |              |        |
+-----------------------------------------------------------------------------------------------------------------+         

I have implemented the following code to achieve this objective:

table1 = camelot.read_pdf(i, pages = "all", flavor = "lattice", line_scale=40)
df = table1[0].df
df2 = df.transpose(copy = True)
df2
tabla_1 = df2.rename(columns={0:'numero_review', 1:'origen', 2:'fecha_emision_review', 3:'autor'})
tabla_1 = tabla_1.drop(index=0)
tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:sW+(.+)$').fillna('')
tabla_1['provincia'] = tabla_1['origen'].str.extract('Provincia:sW+(.+)$').fillna('') 
tabla_1['region'] = tabla_1['origen'].str.extract('Región:sW+(.+)$').fillna('')
tabla_1.drop(['origen'], axis = 1, inplace = True)
tabla_1

However, the regex implemented can’t detect the variable comuna as it has two n. Consider this result:

+-----------------------------------------------------------------------------------------------------------------+
|index|numero_review |fecha_emision_review  |autor      |comuna                           |provincia     |region  |
|-----|--------------|----------------------|-----------|---------------------------------|-----------------------|
|1    |23-0001-523   |02/01/2022            |JOSE GARCIA|                                 |              |ATACAMA |
+-----------------------------------------------------------------------------------------------------------------+
|2    |              |                      |           |                                 |COPIAPO       |        |
+-----------------------------------------------------------------------------------------------------------------+
|3    |              |                      |           |                                 |              |        |
+-----------------------------------------------------------------------------------------------------------------+         

So, I need to implement a regex that in the case of comuna is able to idenfy appropiatly any text, regardless it has one or two n

Any suggestions? The pdf used in the implemented code can be found here: https://drive.google.com/file/d/1xwTWnYUpYz_8cKqWFe9c6UoAliz2SAgH/view?usp=sharing

Asked By: Joseph

||

Answers:

Use re.MULTILINE and re.DOTALL flags like the following.
(The names of the flags are self-explanatory. See the reference for details.)

tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:s*(.+)', re.MULTILINE | re.DOTALL).fillna('')

And there are a few suggestions.

  1. Don’t mix ‘s'(white spaces) and ‘W'(none word characters), because the latter patterns include the former ones.

  2. You don’t need ‘$’ or ‘Z’ to capture characters up to the end of the string because the ‘+’ or ‘*’ patterns work greedy by default.

Answered By: relent95
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.