How to use usecols elements which are regex rather than strings?

Question:

I created a script to go over the needed data, using pandas.
I’m now receiving more files that I need to go over, and sadly these files do not have the same headers.

For example I have placed in my list of columns to use ‘id_num’ and in some of the files it appears as ‘num_id’.

Is it possible to still use the usecols list I created, and allow certain elements in it to “connect” with different header strings, for example by using regex?

Asked By: Lafayette

||

Answers:

I assume you’re referring to the usecols keyword in pd.read_csv (or some analogous pandas reading)? I’m sure you’ve gathered that pandas can’t do a regex search on a dataframe before it even read the dataframe so I’m fairly certain doing a regex search with the usecols keyword isn’t feasible.

However, after you read the csv into a dataframe (let’s name it df for the sake of the example), you could very easily filter the columns of interest using regexes.

for example, suppose your new dataframe is loaded into df:

potential_columns = ['num_id', 'id_num']

df_cols = [col for col in df.columns if re.search('|'.join(potential_columns), col)]

You could list all potential columns you want to search for with potential_columns. Then using join create one massive regex search. Then use a list comprehension to aggregate all valid columns in df.columns. Once that’s done you can finish this process by calling:

df = df[df_cols]

Dealing with duplicate columns, creating clever keywords to search for is left as an exercise for you.

Answered By: Brian
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.