How to find the name change based on a specific TEXT and display its id and date using python
Question:
I have a excel file containing three columns as shown below,
id
name
Date
436
Minster
2020-04-15
436
Minster (HTTP gg AG)
2021-12-07
145
Denskin (HTTP geplan)
2020-07-24
145
Denskinf HTTP DTAG
2020-08-15
555
Garben
2021-03-05
555
Wabern (HttP)
2021-09-13
555
Wabern Garben HTTP
2022-04-18
737
oyehausen
2020-06-26
737
WerrePark HTTP ag
2020-07-14
737
Werre Park (http ssd)
2020-08-25
737
Werre Park (HTTP)
2021-03-15
884
klintern
2021-03-23
884
kitern http
2021-04-08
884
Lausen (http los)
2021-06-16
884
kitener (http geplan)
2021-07-24
584
Lausern
2020-08-15
584
Lausern (HTTP DTAG gg)
2021-03-05
Is it possible to filter out the id, name and the date when there is a change in name if HTTP in any form like HttP, (HTTP), http is included in the name at first event of occurance. For Example id:436 doesn’t have any form http text included in its first row but in the second row with the same id:436 HTTP is included, but for the id:145 the first row itself has the HTTP. But I wanted to filter out the change in name which includes HTTP in any form of text either small or captial in the first event of occurence, with its id, name and date.
Expecting the result to be like,
id
name
Date
436
Minster (HTTP gg AG)
2021-12-07
555
Wabern (HttP)
2021-09-13
737
WerrePark HTTP ag
2020-07-14
884
kitern http
2021-04-08
584
Lausern (HTTP DTAG gg)
2021-03-05
Answers:
Create a boolean mask has_http
to identify the rows containing http
then group this mask by id
and use shift
to create another boolean mask to identify whether previous row contains http. Then combine the masks using &
to identify the changes
has_http = df['name'].str.contains(r'(?i)bhttpb')
mask = has_http & ~has_http.groupby(df['id']).shift(fill_value=True)
Now use the resulting mask to filter the rows
df[mask]
id name Date
1 436 Minster (HTTP gg AG) 2021-12-07
5 555 Wabern (HttP) 2021-09-13
8 737 WerrePark HTTP ag 2020-07-14
12 884 kitern http 2021-04-08
16 584 Lausern (HTTP DTAG gg) 2021-03-05
Here is another approach using regex for any possible combination of uppercase/lowercase HTTP:
import re
import pandas as pd
data = [
["436", "Minster", "2020-04-15"],
["436", "Minster (HTTP gg AG)", "2021-12-07"],
["145", "Denskin (HTTP geplan)", "2020-07-24"],
["145", "Denskinf HTTP DTAG", "2020-08-15"],
["884", "klintern", "2021-03-23"],
["884", "klintern http", "2021-04-08"],
["884", "kitener (http geplan)", "2021-07-24"]
]
df = pd.DataFrame(data, columns=['id', 'name', 'Date'])
filtered_df = df.loc[df['name'].str.contains("http", flags=re.IGNORECASE)]
filtered_df = filtered_df.groupby("id").first()
Output:
id name Date
145 Denskin (HTTP geplan) 2020-07-24
436 Minster (HTTP gg AG) 2021-12-07
884 klintern http 2021-04-08
I have a excel file containing three columns as shown below,
id | name | Date |
---|---|---|
436 | Minster | 2020-04-15 |
436 | Minster (HTTP gg AG) | 2021-12-07 |
145 | Denskin (HTTP geplan) | 2020-07-24 |
145 | Denskinf HTTP DTAG | 2020-08-15 |
555 | Garben | 2021-03-05 |
555 | Wabern (HttP) | 2021-09-13 |
555 | Wabern Garben HTTP | 2022-04-18 |
737 | oyehausen | 2020-06-26 |
737 | WerrePark HTTP ag | 2020-07-14 |
737 | Werre Park (http ssd) | 2020-08-25 |
737 | Werre Park (HTTP) | 2021-03-15 |
884 | klintern | 2021-03-23 |
884 | kitern http | 2021-04-08 |
884 | Lausen (http los) | 2021-06-16 |
884 | kitener (http geplan) | 2021-07-24 |
584 | Lausern | 2020-08-15 |
584 | Lausern (HTTP DTAG gg) | 2021-03-05 |
Is it possible to filter out the id, name and the date when there is a change in name if HTTP in any form like HttP, (HTTP), http is included in the name at first event of occurance. For Example id:436 doesn’t have any form http text included in its first row but in the second row with the same id:436 HTTP is included, but for the id:145 the first row itself has the HTTP. But I wanted to filter out the change in name which includes HTTP in any form of text either small or captial in the first event of occurence, with its id, name and date.
Expecting the result to be like,
id | name | Date |
---|---|---|
436 | Minster (HTTP gg AG) | 2021-12-07 |
555 | Wabern (HttP) | 2021-09-13 |
737 | WerrePark HTTP ag | 2020-07-14 |
884 | kitern http | 2021-04-08 |
584 | Lausern (HTTP DTAG gg) | 2021-03-05 |
Create a boolean mask has_http
to identify the rows containing http
then group this mask by id
and use shift
to create another boolean mask to identify whether previous row contains http. Then combine the masks using &
to identify the changes
has_http = df['name'].str.contains(r'(?i)bhttpb')
mask = has_http & ~has_http.groupby(df['id']).shift(fill_value=True)
Now use the resulting mask to filter the rows
df[mask]
id name Date
1 436 Minster (HTTP gg AG) 2021-12-07
5 555 Wabern (HttP) 2021-09-13
8 737 WerrePark HTTP ag 2020-07-14
12 884 kitern http 2021-04-08
16 584 Lausern (HTTP DTAG gg) 2021-03-05
Here is another approach using regex for any possible combination of uppercase/lowercase HTTP:
import re
import pandas as pd
data = [
["436", "Minster", "2020-04-15"],
["436", "Minster (HTTP gg AG)", "2021-12-07"],
["145", "Denskin (HTTP geplan)", "2020-07-24"],
["145", "Denskinf HTTP DTAG", "2020-08-15"],
["884", "klintern", "2021-03-23"],
["884", "klintern http", "2021-04-08"],
["884", "kitener (http geplan)", "2021-07-24"]
]
df = pd.DataFrame(data, columns=['id', 'name', 'Date'])
filtered_df = df.loc[df['name'].str.contains("http", flags=re.IGNORECASE)]
filtered_df = filtered_df.groupby("id").first()
Output:
id name Date
145 Denskin (HTTP geplan) 2020-07-24
436 Minster (HTTP gg AG) 2021-12-07
884 klintern http 2021-04-08