Sorting a Dataframe by column containing char-digit-char values

Question:

I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):

Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R'] 

Basically all the values are in the format of Char_1-Digit-Char_2 I want to sort them with Digit being the highest priority and Char_2 being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column

I thought I could do that with sorted() with this list sorting function as my sorted( , key=):

def alpha_numeric_sort_key(unsorted_list):
  
   return int( "".join( re.findall("d*", unsorted_list)  )   )




This works for lists. I tried the same thing for my dataframe but
got this error:



df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code

TypeError: expected string or bytes-like object

I just need to understand whats the right way to understand how to use key= in df.sort_values() in a way thats understandable to someone that has an intermediate level of experience using Python.

I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.

Thanks!

raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
                 'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
                 'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
                 'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
                 'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
                 'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
                 'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
                 'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )

Asked By: CodeDependency

||

Answers:

If you look at the definition of the parameter key in sort_values it clearly says :

It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.

You cannot use a single scalar as a key to sort.

You can do sorting in two ways:

  1. First way:
sort_int_key = lambda col: col.str.extract("(d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)d+(w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
    by="Mutation", key=sort_char_key
)
  1. Assign extracted values as temporary columns and sort them using by parameter specifying those columns:
raw_df.assign(
    sort_int=raw_df["Mutation"].str.extract("(d+)", expand=False),
    sort_char=raw_df["Mutation"].str.extract("(?<=)d+(w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])
Answered By: SomeDude
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.