How to extract name from string using nltk

Question:

I am trying to extract name(Indian) from unstructured string.

Here come my code:

text = "Balaji Chandrasekaran Bangalore |  Senior Business Analyst/ Lead Business Analyst An accomplished Senior Business Analyst with a track record of handling complex projects in given period of time, exceeding above the expectation. Successful at developing product road maps and leading cross-functional software teams from prototype to release. Professional Competencies Systems Development Life Cycle (SDLC) Agile methodologies Business process improvement Requirements gathering & Analysis Project Management UML Specification UI & UX (Wireframe Designing) Functional Specification Test Scenario Creation SharePoint Admin Work History Senior Business Analyst (Aug 2012 Current) YouBox Technology pvt ltd, Chennai Translating business goals, feature concepts and customer needs into prioritized product requirements and use cases. Expertized in designing innovative wireframes combining user experience analysis and technology models. Extensive Experience in implementing soft wares for Shipping/Logistics firms to handle CRM, Finance, Logistics, Operations, Intermodal, and documentation. Strong interpersonal skills, highly adept at diplomatically facilitating discussions and negotiations with stakeholders. Education Bachelor of Engineering: Electronics & Communication, 2011 CES Tech Hosur Accomplishment Successful onsite implementation at various locations around the globe for Europe Shipping Company. - (Pre Study, General Design, and Functional Specification) Organized Business Analyst Forum and conducted various activities to develop skill sets of Business Analysts."
if text != "":
    grammar = """PERSON: {<NNP>}"""
    chunkParser = nltk.RegexpParser(grammar)
    tagged = nltk.pos_tag(nltk.word_tokenize(text))
    tree = chunkParser.parse(tagged)

    for subtree in tree.subtrees():
        if subtree.label() == "PERSON": 
            pronouns.append(' '.join([c[0] for c in subtree]))

    print(pronouns)

[‘Balaji’, ‘Chandrasekaran’, ‘Bangalore’, ‘|’,’Senior’, ‘Business’,
‘Analys’, ‘/’, ‘Lead’, ‘Business’, ‘Analyst’, ‘Senior’, ‘Business’,
‘Analyst’, ‘Successful’, ‘Development’, ‘Life’, ‘Cycle’, ‘SDLC’,
‘Agile’, ‘Business’, ‘Requirements’, ‘Analysis’, ‘Project’,
‘Management’, ‘UML’, ‘Specification’, ‘UI’, ‘UX’, ‘Wireframe’,
‘Designing’, ‘Functional’, ‘Specification’, ‘Test’, ‘Scenario’,
‘Creation’, ‘SharePoint’, ‘Admin’, ‘Work’, ‘History’, ‘Senior’,
‘Business’, ‘Analyst’, ‘Aug’, ‘Current’, ‘Technology’, ‘Chennai’,
‘Translating’, ‘CRM’, ‘Finance’, ‘Logistics’, ‘Operations’,
‘Intermodal’, ‘Education’, ‘Bachelor’, ‘Engineering’, ‘Electronics’,
‘Communication’, ‘Accomplishment’, ‘Successful’, ‘Mediterranean’,
‘Ship’, ‘Company’, ‘MSC’, ‘Georgia’, ‘MSC’, ‘Cambodia’, ‘MSC’, ‘MSC’,
‘South’, ‘Successful’, ‘Stake’, ‘MSC’, ‘Geneva’, ‘Switzerland’, ‘Pre’,
‘Study’, ‘General’, ‘Design’, ‘Functional’, ‘Specification’, ‘O’,
‘Business’, ‘Analyst’, ‘Forum’, ‘Business’]

But actually i need to get only Balaji Chandrasekaran , I even try to use Standford ner lib.Which fails to pick Balaji Chandrasekaran

Can any one help to extract name from the un strcuture string, or suggest me any good tutorial to do that.

Thank you in advance.

Answers:

Like I said in the comments, you would have to create your own corpora for Indian names and test your text against that. The NLTK Book teaches you how to do this in Chapter 2 (Section 1.9 to be exact).

from nltk.corpus import PlaintextCorpusReader

# You can use a regular expression to find the files, or pass a list of files
files = ".*.txt"

new_corpus = PlaintextCorpusReader("/path/", files)
corpus  = nltk.Text(new_corpus.words())

See also: Creating a new corpus with NLTK

Answered By: emporerblk

Named entity recognition is not just about finding known names; the recognizer uses a combination of clues, including the form of words and the structure of the text. The name you fail to recognize occurs in a heading, not in running text, so the nltk’s recognizer (which is not that great anyway) cannot find it. See what happens if you use this name in text:

>>> text = "Balaji Chandrasekaran is a senior business analyst and lives in Bangalore."
>>> words = nltk.word_tokenize(text)
>>> print(nltk.ne_chunk(nltk.pos_tag(words)))
(S
  (PERSON Balaji/NNP)
  Chandrasekaran/NNP
  is/VBZ
  a/DT
  senior/JJ
  business/NN
  analyst/NN
  and/CC
  lives/NNS
  in/IN
  (GPE Bangalore/NNP)
  ./.)

It missed the last name (like I said the recognizer isn’t that great), but it was able to figure out that there’s a name here.

In other words: Your problem is that you’re not mining text, but resumes. The only good solution is to build and train a recognizer with some annotated resumes in the same format that you want to process. It’s not terribly simple: you’ll need to annotate your training corpus, and to figure out the useful features (clues from the word form and document structure) that your “feature extraction function” will place in a dictionary. Everything you’ll need is described in various parts of chapters 6 and 7 of the nltk book.

Answered By: alexis