How to remove rows that have 3 word or less in dataframe?

Question:

Because I want to remove ambiguity when I train the data. I want to clean it well. So how can I remove all rows that contain 3 words or less in python?

Asked By: John Sall

||

Answers:

Hello World! This will be my first contribution ever to SO 🙂

Let’s create some data:

data = { 'Source':['Hello all Im Happy','Its a lie, dont trust him','Oops','foo','bar']}
df = pd.DataFrame (data, columns = ['Source'])

My approach is very straight forward, simple and little “brute” and inefficient,howver I ran this in a large dataframe (1013952 rows) and the time was fairly acceptable.
let’s find the indices of the data frame where there are more than n tokens:

from nltk.tokenize import word_tokenize


def get_indices(df,col,n): 
"""
Get the indices of dataframe where exist more than n tokens in a specific column

Parameters:

   df(pandas dataframe)
   n(int): threshold value for minimum words
   col(string): column name 

"""      


tmp = []
for i in range(len(df)):#df.iterrows() wasnt working for me
    if len(word_tokenize(df[col][i])) < n:
        tmp.append(i)
return tmp 

Next we just need to call the function and drop the rows and said indices:

tmp = get_indices(df)
df_clean = df.drop(tmp)

Best!

Answered By: José Rodrigues
df = pd.DataFrame({"mycolumn": ["", " ", "test string", "test string 1", "test string 2 2"]})
df = df.loc[df["mycolumn"].str.count(" ") >= 2]

You should never loop over a dataframe, always use vectorized operations.

Answered By: Vega
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.