NLTK. Detecting whether a sentence is Interrogative or Not?

Question:

I want to create a python script using NLTK or whatever library is best to correctly identify given sentence is interrogative (a question) or not. I tried using regex but there are deeper scenarios where regex fails. so wanted to use Natural Language Processing can anybody help!

Asked By: Freakant

||

Answers:

This will probably solve your question.

Here is the code:

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

And that should print something like 0.67, which is decent accuracy.
If you want to process a string of text through this classifier, try:

print(classifier.classify(dialogue_act_features(line)))

And you can categorise strings into whether they are ynQuestion, Statement, etc, and extract what you desire.

This approach was using NaiveBayes which in my opinion is the easiest, however surely there are many ways to process this. Hope this helps!

Answered By: PolkaDot

You can improved the PolkaDot solution and reach an accuracy of around 86% with a simple Gradient Boosting by using the sklearn library. That would come up to something like this:

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()


posts_text = [post.text for post in posts]

#divide train and test in 80 20
train_text = posts_text[:int(len(posts_text)*0.8)]
test_text = posts_text[int(len(posts_text)*0.2):]

#Get TFIDF features
vectorizer = TfidfVectorizer(ngram_range=(1,3), 
                             min_df=0.001, 
                             max_df=0.7, 
                             analyzer='word')

X_train = vectorizer.fit_transform(train_text)
X_test = vectorizer.transform(test_text)

y = [post.get('class') for post in posts]

y_train = y[:int(len(posts_text)*0.8)]
y_test = y[int(len(posts_text)*0.2):]

# Fitting Gradient Boosting classifier to the Training set
gb = GradientBoostingClassifier(n_estimators = 400, random_state=0)
#Can be improved with Cross Validation

gb.fit(X_train, y_train)

predictions_rf = gb.predict(X_test)

#Accuracy of 86% not bad
print(classification_report(y_test, predictions_rf))

Then you can use the model to make predictions on new data by using gb.predict(vectorizer.transform(['new sentence here']).

Answered By: Jerry Fanelli

From the answer of @PolkaDot, I created the function that uses NLTK and then some custom code to get more accuracy.

posts = nltk.corpus.nps_chat.xml_posts()[:10000]

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]

# 10% of the total data
size = int(len(featuresets) * 0.1)

# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]

# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy - 0.67
# print(nltk.classify.accuracy(classifier, test_set))

question_types = ["whQuestion","ynQuestion"]
def is_ques_using_nltk(ques):
    question_type = classifier.classify(dialogue_act_features(ques)) 
    return question_type in question_types

and then

question_pattern = ["do i", "do you", "what", "who", "is it", "why","would you", "how","is there",
                    "are there", "is it so", "is this true" ,"to know", "is that true", "are we", "am i", 
                   "question is", "tell me more", "can i", "can we", "tell me", "can you explain",
                   "question","answer", "questions", "answers", "ask"]

helping_verbs = ["is","am","can", "are", "do", "does"]
# check with custom pipeline if still this is a question mark it as a question
def is_question(question):
    question = question.lower().strip()
    if not is_ques_using_nltk(question):
        is_ques = False
        # check if any of pattern exist in sentence
        for pattern in question_pattern:
            is_ques  = pattern in question
            if is_ques:
                break

        # there could be multiple sentences so divide the sentence
        sentence_arr = question.split(".")
        for sentence in sentence_arr:
            if len(sentence.strip()):
                # if question ends with ? or start with any helping verb
                # word_tokenize will strip by default
                first_word = nltk.word_tokenize(sentence)[0]
                if sentence.endswith("?") or first_word in helping_verbs:
                    is_ques = True
                    break
        return is_ques    
    else:
        return True

you just need to use is_question method to check if passed sentence is question or not.

Answered By: Sunil Garg

Building on the previous answers. And if your only task is to build a binary classifier that tells if a given sentence is a question or not.

I would rather train a binary classifier. You can first preprocess the labels and create binary labels. And then train the classifier

This will boost your trained classifier to 0.864 accuracy

import nltk

nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

def generate_binary_feature(label):
    if label in ['whQuestion', 'yAnswer','ynQuestion']:
        return True
    else:
        return False

featuresets = [(dialogue_act_features(post.text), generate_binary_feature(post.get('class'))) for post in posts]

# 10% of the total data
size = int(len(featuresets) * 0.1)

# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]

# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy
print(nltk.classify.accuracy(classifier, test_set))
Answered By: code_fan