How to extract data from lists as strings, and select data by value, in pandas?

Question:

I have a dataframe like this:

col1              col2
[abc, bcd, dog]   [[.4], [.5], [.9]]
[cat, bcd, def]   [[.9], [.5], [.4]]

the numbers in the col2 lists describe the element (based on list index location) in col1. So ".4" in col2 describes "abc" in col1.

I want to create 2 new columns, one that pulls only the elements in col1 that are >= .9 in col2, and the other column as the number in col2; so ".9" for both rows.

Result:

col3     col4
[dog]   .9
[cat]   .9

I think going a route where removing the nested list from col2 is fine. But that’s harder than it sounds. I’ve been trying for an hour to remove those fing brackets.

Attempts:

spec_chars3 = ["[","]"]

for char in spec_chars3: # didn't work, turned everything to nan
    df1['avg_jaro_company_word_scores'] = df1['avg_jaro_company_word_scores'].str.replace(char, '')

df.col2.str.strip('[]') #didn't work b/c the nested list is still in a list, not a string

I haven’t even figured out how to pull out the list index number and filter col1 on that

Asked By: max

||

Answers:

  • Based on the explanation at the end of the question, it seems that both columns are str type, and need to be converted to list type
    • Use .applymap with ast.literal_eval.
    • If only one column is str type, then use df[col] = df[col].apply(literal_eval)
  • The lists of data in each column must be extracted by using pandas.DataFrame.explode
    • The outer explode casts values from lists to scalars (i.e. [0.4] to 0.4).
  • Once the values are on separate rows, use Boolean Indexing to select data in the desired range.
  • If you want to combine df with df_new, use df.join(df_new, rsuffix='_extracted')
  • Tested in python 3.10, pandas 1.4.3
import pandas as pd
from ast import literal_eval

# setup the test data: this data is lists
# data = {'c1': [['abc', 'bcd', 'dog'], ['cat', 'bcd', 'def']], 'c2': [[[.4], [.5], [.9]], [[.9], [.5], [.4]]]}

# setup the test data: this data is strings
data = {'c1': ["['abc', 'bcd', 'dog', 'cat']", "['cat', 'bcd', 'def']"], 'c2': ["[[.4], [.5], [.9], [1.0]]", "[[.9], [.5], [.4]]"]}

# create the dataframe
df = pd.DataFrame(data)

# the description leads me to think the data is columns of strings, not lists
# convert the columns from string type to list type
# the following line is only required if the columns are strings
df = df.applymap(literal_eval)

# explode the lists in each column, and the explode the remaining lists in 'c2'
df_new = df.explode(['c1', 'c2'], ignore_index=True).explode('c2')

# use Boolean Indexing to select the desired data
df_new = df_new[df_new['c2'] >= 0.9]

# display(df_new)
    c1   c2
2  dog  0.9
3  cat  1.0
4  cat  0.9
Answered By: Trenton McKinney

You can use list comprehensions to populate new columns with your criteria.

df['col3'] = [
    [value for value, score in zip(c1, c2) if score[0] >= 0.9]
    for c1, c2 in zip(df['col1'], df['col2'])
]
df['col4'] = [
    [score[0] for score in c2 if score[0] >= 0.9]
    for c2 in df['col2']

Output

              col1                   col2   col3   col4
0  [abc, bcd, dog]  [[0.4], [0.5], [0.9]]  [dog]  [0.9]
1  [cat, bcd, def]  [[0.9], [0.5], [0.4]]  [cat]  [0.9]
Answered By: RichieV
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.