How to use usecols elements which are regex rather than strings?
Question:
I created a script to go over the needed data, using pandas.
I’m now receiving more files that I need to go over, and sadly these files do not have the same headers.
For example I have placed in my list of columns to use ‘id_num’ and in some of the files it appears as ‘num_id’.
Is it possible to still use the usecols list I created, and allow certain elements in it to “connect” with different header strings, for example by using regex?
Answers:
I assume you’re referring to the usecols
keyword in pd.read_csv
(or some analogous pandas reading)? I’m sure you’ve gathered that pandas can’t do a regex search on a dataframe before it even read the dataframe so I’m fairly certain doing a regex search with the usecols
keyword isn’t feasible.
However, after you read the csv into a dataframe (let’s name it df
for the sake of the example), you could very easily filter the columns of interest using regexes.
for example, suppose your new dataframe is loaded into df
:
potential_columns = ['num_id', 'id_num']
df_cols = [col for col in df.columns if re.search('|'.join(potential_columns), col)]
You could list all potential columns you want to search for with potential_columns
. Then using join
create one massive regex search. Then use a list comprehension to aggregate all valid columns in df.columns
. Once that’s done you can finish this process by calling:
df = df[df_cols]
Dealing with duplicate columns, creating clever keywords to search for is left as an exercise for you.
I created a script to go over the needed data, using pandas.
I’m now receiving more files that I need to go over, and sadly these files do not have the same headers.
For example I have placed in my list of columns to use ‘id_num’ and in some of the files it appears as ‘num_id’.
Is it possible to still use the usecols list I created, and allow certain elements in it to “connect” with different header strings, for example by using regex?
I assume you’re referring to the usecols
keyword in pd.read_csv
(or some analogous pandas reading)? I’m sure you’ve gathered that pandas can’t do a regex search on a dataframe before it even read the dataframe so I’m fairly certain doing a regex search with the usecols
keyword isn’t feasible.
However, after you read the csv into a dataframe (let’s name it df
for the sake of the example), you could very easily filter the columns of interest using regexes.
for example, suppose your new dataframe is loaded into df
:
potential_columns = ['num_id', 'id_num']
df_cols = [col for col in df.columns if re.search('|'.join(potential_columns), col)]
You could list all potential columns you want to search for with potential_columns
. Then using join
create one massive regex search. Then use a list comprehension to aggregate all valid columns in df.columns
. Once that’s done you can finish this process by calling:
df = df[df_cols]
Dealing with duplicate columns, creating clever keywords to search for is left as an exercise for you.