Trying to find human names in a file using ntlk
Question:
I’d like to extract human names from a text file. I’m getting a blank line as output for some reason. Here is my code:
import nltk
import re
nltk.download('names')
nltk.download('punkt')
from nltk.corpus import names
# Create a list of male and female names from the nltk names corpus
male_names = names.words('male.txt')
female_names = names.words('female.txt')
all_names = set(male_names + female_names)
def flag_people_names(text):
possible_names = []
words = nltk.word_tokenize(text)
for word in words:
# Split the word by ' ', '.' or '_' and check each part
parts = re.split('[ _.]', word)
for part in parts:
if part.lower() in all_names:
possible_names.append(word)
break
return possible_names
# Read text file
with open('sample.txt', 'r') as file:
text = file.read()
# Call function to flag possible names
names = flag_people_names(text)
print(names)
Here is the input file called sample.txt
James is a really nice guy
Gina is a friend of james.
Gina and james like to play with Andy.
I get this as the output:
[]
I’d like to get James, Gina and Andy.
I’m on a MAC Catalina with python3.8.5.
Any idea what’s not working here?
Answers:
Try removing the ".lower()" in "part.lower()" since the NLTK names list is not in all lower but proper capitalization.
I’d like to extract human names from a text file. I’m getting a blank line as output for some reason. Here is my code:
import nltk
import re
nltk.download('names')
nltk.download('punkt')
from nltk.corpus import names
# Create a list of male and female names from the nltk names corpus
male_names = names.words('male.txt')
female_names = names.words('female.txt')
all_names = set(male_names + female_names)
def flag_people_names(text):
possible_names = []
words = nltk.word_tokenize(text)
for word in words:
# Split the word by ' ', '.' or '_' and check each part
parts = re.split('[ _.]', word)
for part in parts:
if part.lower() in all_names:
possible_names.append(word)
break
return possible_names
# Read text file
with open('sample.txt', 'r') as file:
text = file.read()
# Call function to flag possible names
names = flag_people_names(text)
print(names)
Here is the input file called sample.txt
James is a really nice guy
Gina is a friend of james.
Gina and james like to play with Andy.
I get this as the output:
[]
I’d like to get James, Gina and Andy.
I’m on a MAC Catalina with python3.8.5.
Any idea what’s not working here?
Try removing the ".lower()" in "part.lower()" since the NLTK names list is not in all lower but proper capitalization.