Is there any way to capture the input file name of multiple parquet files read in with a wildcard in pandas/awswrangler?

Question:

This is the exact python analogue of the following Spark question:

Is there any way to capture the input file name of multiple parquet files read in with a wildcard in Spark?

I am reading in a wildcard list of parquet files using (variously) pandas and awswrangler.

Is there a way to retrieve a column containing the original filename of each row loaded into the eventual combined dataframe, exactly as per the Spark version of this quesiton?

Update: This is possibly a way to do it – Reading DataFrames saved as parquet with pyarrow, save filenames in columns

Update2: The present question is the reverse of https://stackoverflow.com/a/59682461/1021819

Asked By: jtlz2

||

Answers:

You’ll need to add the filename as a new column to each dataframe as you load them. For example, here is how to do this with a set of CSV files since that is easier to run as an example. You’ll follow a similar pattern for parquet files.

from pathlib import Path

import pandas as pd

# write a couple fake csv files to our disk as examples
Path("csv1.csv").write_text("a,bn1,2n1,2")
Path("csv2.csv").write_text("b,cn3,4n3,4")

all_dfs = []

# for every csv file that we want to load
for f in Path(".").glob("csv*.csv"):
    
    # read the csv
    df = pd.read_csv(f)
    
    # add a filename column to the dataframe
    df["filename"] = f.name
    
    # store the dataframe to concat later
    all_dfs.append(df)
    
# put together the dataframes for each file
pd.concat(all_dfs)
#>      a  b  filename    c
#> 0  1.0  2  csv1.csv  NaN
#> 1  1.0  2  csv1.csv  NaN
#> 0  NaN  3  csv2.csv  4.0
#> 1  NaN  3  csv2.csv  4.0
Answered By: hume

Sorry this isn’t an answer, I couldn’t see how to post a reply.

I have tried this solution and it appears to work but only for the last file in the set. I’ve tried this with csv and parquet files. Here are the files in my folder:

folder containing both file types

As you can see from this image, its all the same file name (all 146000 rows) – so its the result of only one csv file – what ahs happened to the rest? Does it only read the last file?

csv file with new column

So I guess my question is, how do I do this for all csvparquet files to append the filename for each file into one csv (so combine all files into one while having the file names for each file)

Answered By: David Ashworth