how to manage column names containing multiple spaces when using read_csv

Question:

I use the same piece of code which I use to import multiple dataframes. Usually the have the same column names with different data. However sometimes they have different spaces before or after the names of the columns.

  df = pd.read_csv(
                file_path,
                delimiter="|",
                low_memory=True,
                dtype=schema,
                usecols=schema.keys(),
            )

The schema of the file is in a different file:

file_schema = {
    " Age ": str,
    " Name ": str,
    " Country ": str,}

for some other cases, there are no spaces before and after the names:

   file_schema = {
        "Age": str,
        "Name": str,
        "Country": str,}

Currently with having one schema, if there is no match in the spaces before the name of the columns, I’m having errors related to usecols.
I’m wondering if there’s a way in one schema file to write the names of the columns and for it to work no matter how many spaces we have before or after the names?

Asked By: the phoenix

||

Answers:

I think it should be possible to match the column names with

pd.read_csv(..., usecols=lambda x: x.strip() in schema.keys())

and then either strip them afterwards with

df.columns = df.columns.str.strip()

or even better try to pass them explicitly with

pd.read_csv(..., header=0, names=schema.keys())

if you know that all columns declared in schema will be in the file and in order.

Not sure, whether dtype=schema will cause the next problems immediatlely, though

Answered By: maow
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.