splitting string made out of dataframe row wise

Question

I’m trying to tokenize the words within dataframe which looks like

  A            B       C          D            E           F
0 Orange     robot   x eyes   discomfort   striped tee    nan
1 orange     robot  blue beams   grin      vietnam jacket nan
2 aquamarine robot   3d          bored        cigarette   nan

After removing all the special characters the dataframe became a string like this

df_str = df.to_string(header=False)
    
import re

normalised_text = bayc_features_str.lower()
text = re.sub(r"[^a-zA-Z0-9 ]","", normalised_text)

print(text)

    1    orange   robot   x eyes   discomfort   striped tee   nan
    2    orange   robot   blue beams   grin   vietnam jacket  nan
    3    aquamarine  robot   3d       bored       cigarette    nan

so when I tokenize this string, with below code

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj
    else:
        return obj

tokenized_text = (tokenize(text))

I get the output

['orange', 'robot', 'x', 'eyes', 'discomfort', 'striped', 'tee', nan,'orange', 'robot', 'blue', 'beams', 'grin', 'vietnam', 'jacket', nan,'aquamarine', 'robot', '3d', 'bored', 'cigarette', nan, 'sea', 'captains', 'hat']

which is quite different from the output I expected

[['orange'], ['robot'], ['x', 'eyes'], ['discomfort'], ['striped', 'tee'], nan]
[['orange'], ['robot'], ['blue', 'beams'], ['grin'], ['vietnam', 'jacket'], nan]
[['aquamarine'], ['robot'], ['3d'], ['bored', 'cigarette'], nan, ['sea', 'captains', 'hat']]

Any ideas on how can I get the output I expected?
Any help would be greatly appreciated!

Asked By: mimiskims

||

Source

Answer 1

Don’t convert DataFrame to string but work with every text in DataFrame separatelly.

Use.applymap(function) to execute function on every text (on every cell in DataFrame).

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

Minimal working example:

import pandas as pd
from nltk.tokenize import word_tokenize

data = {
    'Background': ['Orange', 'Orange', 'Aqua'], 
    'Fur': ['Robot', 'Robot', 'Robot'], 
    'Eyes': ['X Eyes', 'Blue Beams', '3d'],
    'Mouth': ['Discomfort', 'Grin', 'Bored Cigarette'],
    'Clothes': ['Striped Tee', 'Vietman Jacket', None],
    'Hat': [None, None, "Sea Captain's Hat"],
}

df = pd.DataFrame(data)

print(df.to_string())  # `to_string()` to display full dataframe without `...`

# ----------------------------------------

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj]
    else:
        return obj

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

print(result)

Result:

  Background    Fur        Eyes            Mouth         Clothes                Hat
0     Orange  Robot      X Eyes       Discomfort     Striped Tee               None
1     Orange  Robot  Blue Beams             Grin  Vietman Jacket               None
2       Aqua  Robot          3d  Bored Cigarette            None  Sea Captain's Hat

[
  [['Orange'], ['Robot'], ['X', 'Eyes'], ['Discomfort'], ['Striped', 'Tee'], None], 
  [['Orange'], ['Robot'], ['Blue', 'Beams'], ['Grin'], ['Vietman', 'Jacket'], None], 
  [['Aqua'], ['Robot'], ['3d'], ['Bored', 'Cigarette'], None, ['Sea', 'Captain', "'s", 'Hat']]
]

Answered By: furas

splitting string made out of dataframe row wise

Question:

Answers: