python pandas extracting an element from .str.split()

Question:

I want to extract rows from a dataframe where the middle part of a string is "BB" (see code)

import pandas as pd

data = {'col1': ['AA.BB.CC', 'AA.EE.CC']}
df = pd.DataFrame(data)
print(df)

df['col2'], df['col3'], df['col4'] = df['col1'].str.split('.', expand=True, n=3)
# Why is .str.split('.', expand=True, n=3) producing numbers instead of substrings?  What is going on here?
print(df)

df['col5'] = df['col1'].str.split('.')
# col5 holds lists of substrings, okay.
print(df)

df = df[df['col5'][1] == 'BB']
# disaster
print(df)

How do I actually do it?

Asked By: PlanetAtkinson

||

Answers:

Q. Why is .str.split('.', expand=True, n=3) producing numbers instead of substrings? What is going on here?

Answer: this is because you are using unpacking operation, whatever on the right side is upacked using iteration..for sake of simplicity you can think of this as calling list(RHS).

In your case list(RHS) would be list(df['col1'].str.split('.', expand=True, n=3)) which will return the column names: (0, 1, 2) of the expanded dataframe

example,

list(df['col1'].str.split('.', expand=True, n=3))
# [0, 1, 2]

Fix this by not using unpacking.

df[['a', 'b', 'c']] = df['col1'].str.split('.', expand=True, n=3)
#        col1   a   b   c
# 0  AA.BB.CC  AA  BB  CC
# 1  AA.EE.CC  AA  EE  CC

Solution to your question

Split then use str[0] accessor to get the item corresponding to index 1 for each row and compare with BB

mask = df['col1'].str.split('.').str[1].eq('BB')
df[mask]

#        col1   a   b   c
# 0  AA.BB.CC  AA  BB  CC
Answered By: Shubham Sharma
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.