Sorting by name part of the columns in a data frame without changing the position of the other columns in python pandas?

Question:

I have a data frame with 906 columns. 160 columns are names of world languages. So, the data frame columns are more or less like this:

[c1,c2,c3,c4,c....,Italian, English, German, French, Albanian, Spanish,... c903, c904, c905, c906]

I know how to sort the columns, but considering all of the columns of the data frame, not only part of it. How can I sort alphabetically in Python only the columns with the names of the languages without changing the order of the other columns?
My desired output should look like this:

[c1,c2,c3,c4,c....,Albanian, English, French, German, Italian, Spanish,... c903, c904, c905, c906]

Many thanks in advance for the help!

Asked By: Fabiano Ruggeri

||

Answers:

Assuming all your non countries columns can be identified and converted to a boolean (here using a regex to match cd+, bus this could be anything), you can use numpy.lexsort:

# identify non-target columns
# the regex here is just an example
# any other method can be used
a = df.columns.str.fullmatch('cd+')
# array([ True,  True,  True,  True, False, False, False,
#        False, False, False,  True,  True,  True,  True])

# compute a first sorter with a custom cumsum
# the odd numbers are the original True
pos = (2*a).cumsum()-a
# array([ 1,  3,  5,  7,  8,  8,  8,  8,  8,  8,  9, 11, 13, 15])

# sort by first sorter then column names
df2 = df.iloc[:, np.lexsort((df.columns, pos))]

To get the languages columns from a list of valid languages, it is possible to use langcodes:

from langcodes import language_lists

lang = language_lists.WIKT_LANGUAGE_NAMES['en']

a = ~df.columns.isin(lang)
# array([ True,  True,  True,  True, False, False, False,
#        False, False, False,  True,  True,  True,  True])

pos = (2*a).cumsum()-a
df2 = df.iloc[:, np.lexsort((df.columns, pos))]

output:

['c1', 'c2', 'c3', 'c4', 'Albanian', 'English', 'French', 'German',
       'Italian', 'Spanish', 'c903', 'c904', 'c905', 'c906']

used input:

cols = ['c1', 'c2', 'c3', 'c4', 'Italian', 'English', 'German',
        'French', 'Albanian', 'Spanish', 'c903', 'c904', 'c905', 'c906']
df = pd.DataFrame(columns=cols)
Answered By: mozway