Removing nonsense words in python

Question:

I want to remove nonsense words in my dataset.

I tried which I saw StackOverflow something like this:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) 
     if w.lower() in words or not w.isalpha())

But now since I have a dataframe how do i iterate it over the whole column.

I tried something like this:

import nltk
words = set(nltk.corpus.words.words())

sent = df['Chats']
df['Chats'] = df['Chats'].apply(lambda w:" ".join(w for w in 
nltk.wordpunct_tokenize(sent) 
     if w.lower() in words or not w.isalpha()))

But I am getting an error TypeError: expected string or bytes-like object

Asked By: Questions

||

Answers:

Something like the following will generate a column Clean that applies your function to the column Chats

words = set(nltk.corpus.words.words())

def clean_sent(sent):
    return " ".join(w for w in nltk.wordpunct_tokenize(sent) 
     if w.lower() in words or not w.isalpha())

df['Clean'] = df['Chats'].apply(clean_sent)

To update the Chats column itself, you can overwrite it using the original column:

df['Chats'] = df['Chats'].apply(clean_sent)
Answered By: Wes Doyle
import re

df['Chats'] = [re.sub('n', '', x) for x in df['Chats']]
Answered By: Jainendra Bhiduri
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.