Count the number of strings with length in pandas
Question:
I am trying to calculate the number of strings in a column with length of 5 or more. These strings are in a column separated by comma.
df= pd.DataFrame(columns=['first'])
df['first'] = ['Jack Ryan, Tom O','Stack Over Flow, StackOverFlow','Jurassic Park, IT', 'GOT']
Code I have used till now but not creating a new column with counts of strings of more than 5 characters.
df['countStrings'] = df['first'].str.split(',').count(r'[a-zA-Z0-9]{5,}')
Expected Output: Counting Strings of length 5 or More.
first
countString
Jack Ryan, Tom O
0
Stack Over Flow, StackOverFlow
2
Jurassic Park, IT
1
GOT
0
Edge Case: Strings of length more than 5 separated by comma and have multiple spaces
first
wrongCounts
rightCounts
Accounts Payable Goods for Resale
4
1
Corporate Finance, Financial Engineering
4
2
TBD
0
0
Goods for Not Resale, SAP
2
1
Answers:
Pandas str.len() method is used to determine length of each string in a Pandas series. This method is only for series of strings.
Since this is a string method, .str has to be prefixed everytime before calling this method.
Yo can try this :
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = ['jack,utah,TOMHAWK
Somer,SORITNO','jill','bob,texas','matt,AR','john']
df['first'].replace(',',' ', regex=True, inplace=True)
df['first'].str.count(r'w+').sum()
This is how i would try to get the number of strings with len>=5 in a column:
data=[i for k in df['first']
for i in k.split(',')
if len(i)>=5]
result=len(data)
You can match 5 chars and on the left and right match optional chars other than a comma.
[^,]*[A-Za-z0-9]{5}[^,]*
See a regex demo with the matches.
Example
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = [
'Accounts Payable Goods for Resale',
'Corporate Finance, Financial Engineering',
'TBD',
'Goods for Not Resale, SAP',
'Jack Ryan, Tom O',
'Stack Over Flow, StackOverFlow',
'Jurassic Park, IT',
'GOT'
]
df['countStrings'] = df['first'].str.count(r'[^,]*[A-Za-z0-9]{5}[^,]*')
print(df)
Output
first countStrings
0 Accounts Payable Goods for Resale 1
1 Corporate Finance, Financial Engineering 2
2 TBD 0
3 Goods for Not Resale, SAP 1
4 Jack Ryan, Tom O 0
5 Stack Over Flow, StackOverFlow 2
6 Jurassic Park, IT 1
7 GOT 0
I am trying to calculate the number of strings in a column with length of 5 or more. These strings are in a column separated by comma.
df= pd.DataFrame(columns=['first'])
df['first'] = ['Jack Ryan, Tom O','Stack Over Flow, StackOverFlow','Jurassic Park, IT', 'GOT']
Code I have used till now but not creating a new column with counts of strings of more than 5 characters.
df['countStrings'] = df['first'].str.split(',').count(r'[a-zA-Z0-9]{5,}')
Expected Output: Counting Strings of length 5 or More.
first | countString |
---|---|
Jack Ryan, Tom O | 0 |
Stack Over Flow, StackOverFlow | 2 |
Jurassic Park, IT | 1 |
GOT | 0 |
Edge Case: Strings of length more than 5 separated by comma and have multiple spaces
first | wrongCounts | rightCounts |
---|---|---|
Accounts Payable Goods for Resale | 4 | 1 |
Corporate Finance, Financial Engineering | 4 | 2 |
TBD | 0 | 0 |
Goods for Not Resale, SAP | 2 | 1 |
Pandas str.len() method is used to determine length of each string in a Pandas series. This method is only for series of strings.
Since this is a string method, .str has to be prefixed everytime before calling this method.
Yo can try this :
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = ['jack,utah,TOMHAWK
Somer,SORITNO','jill','bob,texas','matt,AR','john']
df['first'].replace(',',' ', regex=True, inplace=True)
df['first'].str.count(r'w+').sum()
This is how i would try to get the number of strings with len>=5 in a column:
data=[i for k in df['first']
for i in k.split(',')
if len(i)>=5]
result=len(data)
You can match 5 chars and on the left and right match optional chars other than a comma.
[^,]*[A-Za-z0-9]{5}[^,]*
See a regex demo with the matches.
Example
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = [
'Accounts Payable Goods for Resale',
'Corporate Finance, Financial Engineering',
'TBD',
'Goods for Not Resale, SAP',
'Jack Ryan, Tom O',
'Stack Over Flow, StackOverFlow',
'Jurassic Park, IT',
'GOT'
]
df['countStrings'] = df['first'].str.count(r'[^,]*[A-Za-z0-9]{5}[^,]*')
print(df)
Output
first countStrings
0 Accounts Payable Goods for Resale 1
1 Corporate Finance, Financial Engineering 2
2 TBD 0
3 Goods for Not Resale, SAP 1
4 Jack Ryan, Tom O 0
5 Stack Over Flow, StackOverFlow 2
6 Jurassic Park, IT 1
7 GOT 0