Query the non-nan value from a pivoted df

Question

I have a pivoted df:

data = np.column_stack([["alpha", "beta", "gamma", "delta"], ["a", "b", "c", "d"], [0, 1, 2, 3], othercol, othercol2])
df = pn.DataFrame(data, columns=["greek", "latin", "distance", "othercol", "othercol2"])
piv = df.pivot(index = "greek", columns="latin", values="values")

and would like to access piv‘s values by name, so I figured .loc is what I need.
Passing piv.loc["gamma", "c"] works as intended –, but what if I wanted to access piv in a loop where I am iterating on random combinations of the greek and latin column names? In that case, one of the two combinations would return NaN.–

In other words, is there a way to have .loc retrieve the non-nan value of a given combination of row/column names?

Edit:
Thanks @mozway for the detailed explanation!
Here is a more complete version of the code:

def _get_distance_df():
    for df in list_of_dfs:
        rows = []
        for (a, b), (id_a, id_b) in zip(
            itertools.combinations(df.obj_to_calc_dist_on, 2),
            itertools.combinations(df.id, 2),
        ):
            dist = get_distance(a, b)
            row = [
                df.formula.iloc[0],
                df.description.iloc[0],
                a,
                b,
                dist,
                id_a,
                id_b,
            ]
            rows.append(row)
        newdf = pn.DataFrame(rows, columns=distance_df.columns)
        distance_df = pn.concat([distance_df, newdf], ignore_index=True)
    reduced_distance_df = distance_df[["id1", "id2", "distance"]]
    distance_df = distance_df
    piv_distance_df = distance_df.pivot(
        index="id1", columns="id2", values="distance"
    )

I pivoted the distance df to have values (I thought) easier to access, in the hope of using something like .loc to then query it in the next function:

def main_logic():
    for df in self.list_of_dfs:
            id1 = df.iloc[0]["id"]
            id2 = df.iloc[1]["id"]
            try:
                # distance = self.piv_distance_df.loc[id1, id2]
                distance = self.reduced_distance_df.loc[(id1, id2)]
            except KeyError:
                distance = self.reduced_distance_df.loc[(id2, id1)]
                # distance = self.piv_distance_df.loc[id2, id1]
            print(distance)

but eventually realized the pivot is not needed, as the reduced_distance_df can also be accessed quite easily. I feel like this way of handling the distance calculation and then query back to retrieve the ids is a bit clunky, but could’t think of a better one so far.

Asked By: Antonio Carnevali

||

Source

Answer 1

In other words, is there a way to have .loc retrieve the non-nan value of a given combination of row/column names?

I think not in pivoted DataFrame, but if reshape by DataFrame.stack get out missing values and can select by tuples in MultiIndex Series:

out = piv.stack()
#last pandas version
#out = piv.stack(future_stack=False)


print (out.loc[('alpha','a')])
0

List of all combination return non NaNs values:

print (out.index.tolist())
[('alpha', 'a'), ('beta', 'b'), ('delta', 'd'), ('gamma', 'c')]

And for random values:

import random
rn = random.sample(out.index.tolist(), 2)
print (rn)
[('delta', 'd'), ('alpha', 'a')]

Answered By: jezrael

Answer 2

If you want to avoid NaNs in general, just stack, this will drop all the NaNs:

tmp = piv.stack()

Output:

greek  latin
alpha  a        0
beta   b        1
delta  d        3
gamma  c        2
dtype: object

Then you can slice directly:

tmp.loc[('alpha', 'a')]

Or, to handle possibly missing combinations:

tmp.get(('alpha', 'a'), 'missing')

Output: 0

Note that if you want a random item, no need to know the indices, just sample:

tmp = piv.stack()
chosen = tmp.sample(1)
chosen.index[0]
# ('beta', 'b')

chosen.squeeze()
# 1

Or for multiple values at once:

tmp.sample(5, replace=True)

Output:

greek  latin
delta  d        3
alpha  a        0
       a        0
gamma  c        2
beta   b        1
dtype: object

If you have arbitrary pairs in any order, the best would be to design your loop to provide the combination in the correct order.

Now, assuming that you can’t, you could use a try except:

idx, col = 'a', 'alpha'
try:
    piv.loc[idx, col]
except KeyError:
    piv.loc[col, idx]

Alternatively, stack again and make your index a frozenset:

idx, col = 'a', 'alpha'

tmp = piv.stack()
tmp.index = tmp.index.map(frozenset)
tmp.get(frozenset((idx, col)), None)

Output: 0

Answered By: mozway

Answer 3

If i understood correctly this code should do it.

first create some list of combinations that you want
then loop over the list, and use .loc
add the if statement that will check if the result is not None, then return the result
if Nan print it. or add it to a list or whatever

# Example iteration through combinations
combinations = [("gamma", "c"), ("alpha", "d")]  # Example you can add more combinations

for greek, latin in combinations:
    #check if the combination exists
    if pd.notna(piv.loc[greek, latin]):
        print(f"Value at {greek}, {latin}: {piv.loc[greek, latin]}")
    # or add to a list notnan.append((greek, latin))
    else:
        print(f"Combination {greek}, {latin} is NaN.")
    # Same here you can add it to a list or whatever

Answered By: Afriend

Query the non-nan value from a pivoted df

Question:

Answers: