python pandas extracting an element from .str.split()
Question:
I want to extract rows from a dataframe where the middle part of a string is "BB" (see code)
import pandas as pd
data = {'col1': ['AA.BB.CC', 'AA.EE.CC']}
df = pd.DataFrame(data)
print(df)
df['col2'], df['col3'], df['col4'] = df['col1'].str.split('.', expand=True, n=3)
# Why is .str.split('.', expand=True, n=3) producing numbers instead of substrings? What is going on here?
print(df)
df['col5'] = df['col1'].str.split('.')
# col5 holds lists of substrings, okay.
print(df)
df = df[df['col5'][1] == 'BB']
# disaster
print(df)
How do I actually do it?
Answers:
Q. Why is .str.split('.', expand=True, n=3)
producing numbers instead of substrings? What is going on here?
Answer: this is because you are using unpacking operation, whatever on the right side is upacked using iteration..for sake of simplicity you can think of this as calling list(RHS).
In your case list(RHS) would be list(df['col1'].str.split('.', expand=True, n=3))
which will return the column names: (0, 1, 2)
of the expanded dataframe
example,
list(df['col1'].str.split('.', expand=True, n=3))
# [0, 1, 2]
Fix this by not using unpacking.
df[['a', 'b', 'c']] = df['col1'].str.split('.', expand=True, n=3)
# col1 a b c
# 0 AA.BB.CC AA BB CC
# 1 AA.EE.CC AA EE CC
Solution to your question
Split then use str[0]
accessor to get the item corresponding to index 1 for each row and compare with BB
mask = df['col1'].str.split('.').str[1].eq('BB')
df[mask]
# col1 a b c
# 0 AA.BB.CC AA BB CC
I want to extract rows from a dataframe where the middle part of a string is "BB" (see code)
import pandas as pd
data = {'col1': ['AA.BB.CC', 'AA.EE.CC']}
df = pd.DataFrame(data)
print(df)
df['col2'], df['col3'], df['col4'] = df['col1'].str.split('.', expand=True, n=3)
# Why is .str.split('.', expand=True, n=3) producing numbers instead of substrings? What is going on here?
print(df)
df['col5'] = df['col1'].str.split('.')
# col5 holds lists of substrings, okay.
print(df)
df = df[df['col5'][1] == 'BB']
# disaster
print(df)
How do I actually do it?
Q. Why is .str.split('.', expand=True, n=3)
producing numbers instead of substrings? What is going on here?
Answer: this is because you are using unpacking operation, whatever on the right side is upacked using iteration..for sake of simplicity you can think of this as calling list(RHS).
In your case list(RHS) would be list(df['col1'].str.split('.', expand=True, n=3))
which will return the column names: (0, 1, 2)
of the expanded dataframe
example,
list(df['col1'].str.split('.', expand=True, n=3))
# [0, 1, 2]
Fix this by not using unpacking.
df[['a', 'b', 'c']] = df['col1'].str.split('.', expand=True, n=3)
# col1 a b c
# 0 AA.BB.CC AA BB CC
# 1 AA.EE.CC AA EE CC
Solution to your question
Split then use str[0]
accessor to get the item corresponding to index 1 for each row and compare with BB
mask = df['col1'].str.split('.').str[1].eq('BB')
df[mask]
# col1 a b c
# 0 AA.BB.CC AA BB CC