Pandas Groupby having index that are not in the dataframe derived by subseting with Copy()

Question:

Problem: Find the ZIP’s that are not repeated in df.ZIP (has to occur no more than once) and df.ST does not have values of ‘.’.
So I subset the original dataframe and applied Groupby – this still brought few rows that didn’t meet the subset criteria(df.ST != ‘.’). So I created a separate df_us by subsetting with copy() option. Groupby still give the same index.

grouped = df[df.ST != '.'].groupby(['ZIP_CD'],sort=False) # grouping
df_size = pd.DataFrame({'ZIP':grouped.size().index, 'Count':grouped.size().values}) # Forming df around the group
df_count = df_size[df_size.Count==1] #df with Count=1
one_index = df_count.index.tolist() #gathering index
df_one = df.loc[one_index] #final df

df_us = df_data[df.ST != '.'].copy() # tried this too

The last code above still gives some index for values of ‘.’ when I groupby. But df_us does not have any ‘.’ at all. So this result in having same index column as above method – but for ‘.’ values, rest of the row values are empty as df_us does not have them!

groupy is finding those index with ‘.’ values no matter what I did.
Any solution?

update:
sample data =
index ST ZIP_CD
123 ca 94025
124 Toronto .
125 ga 30306
126 Italy .
127 ca 94025

So correct answer is

    ST      ZIP_CD 
0   123     ca  94025

Update:
@Naveed’s soln and mine below works fine. Do not know why the above code is flawed?

Asked By: user4504270

||

Answers:

# use lot where zip_cd ne .
# and zip_cd is not duplicated

# df.duplicated(subset='ZIP') : identifies the duplicates based on ZIP code and results in True/False series
# df.loc selects the rows from df, where duplicated is true
# df['ZIP'].isin : check if any of the DF is part of the zip filtered by df.loc
# using negation we eliminate them from being selected

# first condition is to check using loc that ZIP is not equal to "."

# combining these two together with logical AND, we filter DF where it holds true

# please note: while the same DF is used repeatedly, the filtered result is different for each of them.


(df.loc[df['ZIP'].ne('.') & 
~df['ZIP'].isin(df.loc[df.duplicated(subset='ZIP')]['ZIP'])]
 )

    index   ST  ZIP_CD
0   123     ca  94025
2   125     ga  30306
Answered By: Naveed

@Naveed – thanks for helping me to learn. Your negation is something new to me. I also wrote an alternative soln using your negation.

df1 = df[df.ZIP != '.'] # eliminate invalid entries
v = df1.ZIP.value_counts() # counts values
df2 = df1[~df1.ZIP.isin(v.index[v.gt(1)])] # gets values more than once and negates

Link to try:
https://trinket.io/python3/b26eae2e0e

Worthy of mentioning @Jerrold110’s approach which was modified:

items = v[v<2].index # items that appear less than twice
df2 = df1[df1['ZIP'].isin(items)

I do not know why the original groupby soln failed.

Answered By: user4504270

Here you go with your original approach, using groupby:

grouped = df[df.ST != '.'].groupby(['ZIP_CD'],sort=False) # grouping
item = grouped.size()[grouped.size() < 2)].index # finding zip values
df_one = df[df.ZIP_CD.isin(item) #final df

I tested and it worked.

Answered By: silicon23
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.