How to read csv files based on multiple index_col argument?

Question:

I have a problem which might be easy for majority of people here.

I have four folders: SA1, SA2, SA3, SA4.
Each folder has around 60 csv files. I have defined the path like this:

my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)

Once the user defines the year and level in the path above, I to combine all the csv files under the "level" folder based on the "index_col=" parameter.

For exmaple, for SA1, I want to combine the CSV files based on the "SA1_code" column. For SA2, I want to combine the CSV files based on
the "SA2_MAIN_DIGIT_CODE" column. For SA3 and SA4, the index_col should be "SA3_MULTI" and "SA4_REGIONS" respectively. As you can see all the columns names for CSV files under these four folders are different.

So far, I have attempted the following things.
I have defined the function as

def combine_csv(path):
    """
    Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
    
    Parameters
    ----------
    Path (string); Location of the folder and files therein
    
    Returns
    ----------
    A dataframe of all the concatenated csv files
    
    """

    # setting the path for joining multiple files
    files = os.path.join(path+"/*.csv")

    # list of merged files returned
    files = glob.glob(files)
    
    # joining files with concat and read_csv
    list_csv = []
    for filename in files: 
        list_df = pd.read_csv(filename) # Can't give the "index_col" as there four different strings for teh csv files in each folder
        list_csv.append(list_df)
    df = pd.concat(list_csv, axis=1, ignore_index=False)
    return df

data_df = combine_csv(path)

gives me the combined dataframe. But I want to combine it based on "SA1_code" if the user chooses to go to SA1 folder or "SA2_MAIN_DIGIT_CODE" if they choose to combine CSV files from SA2 folder, and so on and so forth.

How do I do this?

Asked By: surviving-grad

||

Answers:

You don’t have four separate index columns, you just have one that changes depending on user input. Therefore, the solution to your problem is relatively simple. First, modify your combine_csv method:

def combine_csv(path, index):
    """
    Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
    
    Parameters
    ----------
    Path (string); Location of the folder and files therein
    
    Returns
    ----------
    A dataframe of all the concatenated csv files
    
    """

    # setting the path for joining multiple files
    files = os.path.join(path+"/*.csv")

    # list of merged files returned
    files = glob.glob(files)
    
    # joining files with concat and read_csv
    list_csv = []
    for filename in files: 
        list_df = pd.read_csv(filename, index_col = index)
        list_csv.append(list_df)
    df = pd.concat(list_csv, axis=1, ignore_index=False)
    return df

All I did was inject a value, index, that will be used for the indeX_col argument to read_csv.

Next, we need to determine the value for index based on the value for level, as input by the user. According to your question, it seems that there should be a one-to-one relationship between these values. So, we can use a dictionary for this:

LevelIndexMapping = {
    "SA1": "SA1_code",
    "SA2": "SA2_MAIN_DIGIT_CODE",
    "SA3": "SA3_MULTI",
    "SA4": "SA4_REGIONS"
}

my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)

combine_csv(path, LevelIndexMapping[level])

Here, I created a dictionary that maps your level variable to its associated index column value, and then accesses that mapping when calling combine_csv.

Answered By: Woody1193