How to delete specific values from a list-column in pandas

Question:

I’ve used POS-tagging (in german language, thus nouns have "NN" and "NE" as abbreviations) and now I am having trouble to extract the nouns into a new column of the pandas dataframe.

Example:

data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)
df
df["nouns"] = df["tagged"].apply(lambda x: [word for word, tag in x if tag in ["NN", "NE"]])

Results in the following error message: "ValueError: too many values to unpack (expected 2)"

I think the code would work if I was able to delete the first value of each tagged word but I cannot figure out how to do that.

Asked By: stacksterppr

||

Answers:

Because there are tuples with 3 values unpack values to variables word1 and word2:

df["nouns"] = df["tagged"].apply(lambda x: [word2 for word1, word2, tag 
                                                         in x if tag in ["NN", "NE"]])

Or use same solution in list comprehension:

df["nouns"] = [[word2 for word1,word2, tag in x if tag in ["NN", "NE"]]
                for x in df["tagged"]]

print (df)
                                         tagged          nouns
0        [(waffe, Waffe, NN), (haus, Haus, NN)]  [Waffe, Haus]
1  [(groß, groß, ADJD), (bereich, Bereich, NN)]      [Bereich]
Answered By: jezrael

I think it would be easier with function call. This creates list of NN or NE tags from each row. If you would like to deduplicate, you need to update the function.

data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)

#function
def getNoun(obj):
    ret=[] #declare empty list as default value
    for l in obj: #iterate list of word groups
        for tag in l: #iterate list of words/tags
            if tag in ['NN','NE']:
                ret.append(tag) #add to return list
    return ret

#call new column creation
df['noun']=df['tagged'].apply(getNoun)

#result
print(df['noun'])

#output:
#0    [NN, NN]
#1        [NN]
#Name: noun, dtype: object
Answered By: bracko
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.