re.sub erroring with "Expected string or bytes-like object"
Question:
I have read multiple posts regarding this error, but I still can’t figure it out. When I try to loop through my function:
def fix_Plan(location):
letters_only = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
location) # Column and row to search
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
return (" ".join(meaningful_words))
col_Plan = fix_Plan(train["Plan"][0])
num_responses = train["Plan"].size
clean_Plan_responses = []
for i in range(0,num_responses):
clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
Here is the error:
Traceback (most recent call last):
File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 48, in <module>
clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 22, in fix_Plan
location) # Column and row to search
File "C:UsersxxxxxAppDataLocalProgramsPythonPython36libre.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
Answers:
As you stated in the comments, some of the values appeared to be floats, not strings. You will need to change it to strings before passing it to re.sub
. The simplest way is to change location
to str(location)
when using re.sub
. It wouldn’t hurt to do it anyways even if it’s already a str
.
letters_only = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(location))
I suppose better would be to use re.match() function. here is an example which may help you.
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentences = word_tokenize("I love to learn NLP n 'a :(")
#for i in range(len(sentences)):
sentences = [word.lower() for word in sentences if re.match('^[a-zA-Z]+', word)]
sentences
The simplest solution is to apply Python str
function to the column you are trying to loop through.
If you are using pandas
, this can be implemented as:
dataframe['column_name']=dataframe['column_name'].apply(str)
I had the same problem. And it’s very interesting that every time I did something, the problem was not solved until I realized that there are two special characters in the string.
For example, for me, the text has two characters:
‎
(Left-to-Right Mark) and ‌
(Zero-width non-joiner)
The solution for me was to delete these two characters and the problem was solved.
import re
mystring = "‎Some Time W‌e"
mystring = re.sub(r"‎", "", mystring)
mystring = re.sub(r"‌", "", mystring)
I hope this has helped someone who has a problem like me.
from my experience in Python, this is caused by a None value in the second argument used in the function re.findall().
import re
x = re.findall(r"[(.*?)]", None)
One reproduce the error with this code sample.
To avoid this error message, one can filter the null values or add a condition to put them out of the processing
I have read multiple posts regarding this error, but I still can’t figure it out. When I try to loop through my function:
def fix_Plan(location):
letters_only = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
location) # Column and row to search
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
return (" ".join(meaningful_words))
col_Plan = fix_Plan(train["Plan"][0])
num_responses = train["Plan"].size
clean_Plan_responses = []
for i in range(0,num_responses):
clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
Here is the error:
Traceback (most recent call last):
File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 48, in <module>
clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 22, in fix_Plan
location) # Column and row to search
File "C:UsersxxxxxAppDataLocalProgramsPythonPython36libre.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
As you stated in the comments, some of the values appeared to be floats, not strings. You will need to change it to strings before passing it to re.sub
. The simplest way is to change location
to str(location)
when using re.sub
. It wouldn’t hurt to do it anyways even if it’s already a str
.
letters_only = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(location))
I suppose better would be to use re.match() function. here is an example which may help you.
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentences = word_tokenize("I love to learn NLP n 'a :(")
#for i in range(len(sentences)):
sentences = [word.lower() for word in sentences if re.match('^[a-zA-Z]+', word)]
sentences
The simplest solution is to apply Python str
function to the column you are trying to loop through.
If you are using pandas
, this can be implemented as:
dataframe['column_name']=dataframe['column_name'].apply(str)
I had the same problem. And it’s very interesting that every time I did something, the problem was not solved until I realized that there are two special characters in the string.
For example, for me, the text has two characters:
‎
(Left-to-Right Mark) and ‌
(Zero-width non-joiner)
The solution for me was to delete these two characters and the problem was solved.
import re
mystring = "‎Some Time W‌e"
mystring = re.sub(r"‎", "", mystring)
mystring = re.sub(r"‌", "", mystring)
I hope this has helped someone who has a problem like me.
from my experience in Python, this is caused by a None value in the second argument used in the function re.findall().
import re
x = re.findall(r"[(.*?)]", None)
One reproduce the error with this code sample.
To avoid this error message, one can filter the null values or add a condition to put them out of the processing