How do I find consecutive repeating numbers in my pandas column?
Question:
I have two columns, one contains a string of numbers and one contains a two or three digits, as below:
Account number
0 5493455646944
1 56998884221
2 95853255555926
3 5055555555495718323
4 56999998247361
5 6506569568
I would like to create a regex function which displays a flag if the account number contains more 5 or more consecutive, repeated numbers.
So in theory, the target state is as follows:
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
I was thinking something like:
def reg_finder(x):
return re.findall('^([0-9])1{5,}$', x)
I am not good with regex at all so unsure…thanks
Edit: this is what I tried:
def reg_finder(x):
return re.findall('b(d)1+b', x)
example_df['test'] = example_df['Account number'].apply(reg_finder)
Account number test
0 5493455646944 []
1 56998884221 []
2 95853255555926 []
3 5055555555495718323 []
4 56999998247361 []
5 6506569568 []
Answers:
You can use
import pandas as pd
import warnings
warnings.filterwarnings("ignore", message="This pattern has match groups")
df = pd.DataFrame({'Account number':["5493455646944","56998884221","95853255555926","5055555555495718323","56999998247361","6506569568"]})
df['test'] = "No"
df.loc[df["Account number"].str.contains(r'([0-9])1{4,}'), 'test'] = "Yes"
Output:
>>> df
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
Note that r'([0-9])1{4,}'
regex is defined with a raw string literal where backslashes are parsed as literal backslashes, and not string escape sequence auxiliary chars.
Problems in your regex re.findall('^([0-9])1{5,}$', x)
:
- You use
^
and $
which is used to match the whole string is continuous.
- You want to match contains more 5, the
1
is already a match, you only need 4 more.
You can use
df['test'] = np.where(df['Account number'].astype(str).str.contains(r'([0-9])1{4,}'), 'Yes', 'No')
# Or
df['test'] = np.where(df['Account number'].astype(str).str.contains(r'(d)1{4,}'), 'Yes', 'No')
print(df)
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
dd1=df1.assign(col1=df1['Account number'].astype(str).map(list)).explode("col1")
col2=dd1.col1.ne(dd1.col1.shift()).cumsum()
dd2=dd1.assign(test=col2).assign(col3=lambda dd:dd.groupby(['Account number',col2]).test.transform('size'))
dd2.groupby("Account number",sort=False,as_index=False).apply(lambda dd:"yes" if dd.col3.ge(5).any() else "no")
out:
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
I have two columns, one contains a string of numbers and one contains a two or three digits, as below:
Account number
0 5493455646944
1 56998884221
2 95853255555926
3 5055555555495718323
4 56999998247361
5 6506569568
I would like to create a regex function which displays a flag if the account number contains more 5 or more consecutive, repeated numbers.
So in theory, the target state is as follows:
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
I was thinking something like:
def reg_finder(x):
return re.findall('^([0-9])1{5,}$', x)
I am not good with regex at all so unsure…thanks
Edit: this is what I tried:
def reg_finder(x):
return re.findall('b(d)1+b', x)
example_df['test'] = example_df['Account number'].apply(reg_finder)
Account number test
0 5493455646944 []
1 56998884221 []
2 95853255555926 []
3 5055555555495718323 []
4 56999998247361 []
5 6506569568 []
You can use
import pandas as pd
import warnings
warnings.filterwarnings("ignore", message="This pattern has match groups")
df = pd.DataFrame({'Account number':["5493455646944","56998884221","95853255555926","5055555555495718323","56999998247361","6506569568"]})
df['test'] = "No"
df.loc[df["Account number"].str.contains(r'([0-9])1{4,}'), 'test'] = "Yes"
Output:
>>> df
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
Note that r'([0-9])1{4,}'
regex is defined with a raw string literal where backslashes are parsed as literal backslashes, and not string escape sequence auxiliary chars.
Problems in your regex re.findall('^([0-9])1{5,}$', x)
:
- You use
^
and$
which is used to match the whole string is continuous. - You want to match contains more 5, the
1
is already a match, you only need 4 more.
You can use
df['test'] = np.where(df['Account number'].astype(str).str.contains(r'([0-9])1{4,}'), 'Yes', 'No')
# Or
df['test'] = np.where(df['Account number'].astype(str).str.contains(r'(d)1{4,}'), 'Yes', 'No')
print(df)
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
dd1=df1.assign(col1=df1['Account number'].astype(str).map(list)).explode("col1")
col2=dd1.col1.ne(dd1.col1.shift()).cumsum()
dd2=dd1.assign(test=col2).assign(col3=lambda dd:dd.groupby(['Account number',col2]).test.transform('size'))
dd2.groupby("Account number",sort=False,as_index=False).apply(lambda dd:"yes" if dd.col3.ge(5).any() else "no")
out:
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No