Sorting a Dataframe by column containing char-digit-char values
Question:
I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):
Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R']
Basically all the values are in the format of Char_1-Digit-Char_2
I want to sort them with Digit
being the highest priority and Char_2
being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column
I thought I could do that with sorted()
with this list sorting function as my sorted( , key=)
:
def alpha_numeric_sort_key(unsorted_list):
return int( "".join( re.findall("d*", unsorted_list) ) )
This works for lists. I tried the same thing for my dataframe but
got this error:
df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code
TypeError: expected string or bytes-like object
I just need to understand whats the right way to understand how to use key=
in df.sort_values()
in a way thats understandable to someone that has an intermediate level of experience using Python.
I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.
Thanks!
raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )
Answers:
If you look at the definition of the parameter key
in sort_values
it clearly says :
It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.
You cannot use a single scalar as a key to sort.
You can do sorting in two ways:
- First way:
sort_int_key = lambda col: col.str.extract("(d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)d+(w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
by="Mutation", key=sort_char_key
)
- Assign extracted values as temporary columns and sort them using
by
parameter specifying those columns:
raw_df.assign(
sort_int=raw_df["Mutation"].str.extract("(d+)", expand=False),
sort_char=raw_df["Mutation"].str.extract("(?<=)d+(w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])
I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):
Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R']
Basically all the values are in the format of Char_1-Digit-Char_2
I want to sort them with Digit
being the highest priority and Char_2
being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column
I thought I could do that with sorted()
with this list sorting function as my sorted( , key=)
:
def alpha_numeric_sort_key(unsorted_list):
return int( "".join( re.findall("d*", unsorted_list) ) )
This works for lists. I tried the same thing for my dataframe but
got this error:
df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code
TypeError: expected string or bytes-like object
I just need to understand whats the right way to understand how to use key=
in df.sort_values()
in a way thats understandable to someone that has an intermediate level of experience using Python.
I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.
Thanks!
raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )
If you look at the definition of the parameter key
in sort_values
it clearly says :
It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.
You cannot use a single scalar as a key to sort.
You can do sorting in two ways:
- First way:
sort_int_key = lambda col: col.str.extract("(d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)d+(w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
by="Mutation", key=sort_char_key
)
- Assign extracted values as temporary columns and sort them using
by
parameter specifying those columns:
raw_df.assign(
sort_int=raw_df["Mutation"].str.extract("(d+)", expand=False),
sort_char=raw_df["Mutation"].str.extract("(?<=)d+(w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])