Importing multiple .cnv files, extract filename, and attach as a column

Question:

I have multiple .cnv files that I can import and concatenate with the following code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob

# Get CSV files list from a folder
path ='/Users/mariacristinaalvarez/Documents/High3/'
csv_files = glob.glob(path + "/*_high3.cnv")

# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file, encoding="ISO-8859-1",delim_whitespace=True,skiprows=316 , header=None) for file in csv_files1)

# Concatenate all DataFrames
stations_df = pd.concat(df_list)

This code works for me to get all files concatenated into one, but I would like to have a column with the filename that is contained between ‘
.
For example:
filename = ‘HLY2202_008_high3_predown_av1dbar.cnv’
I would like to only extract the numbers after HLY2202
AND before _high3
So the return should be "008"
I want to do this for each file and add the name as a column so it becomes a identifier when I do explorative data analysis.

Answers:

Use and assign the filename/substring you need to a new column.

import pandas as pd
from pathlib import Path

path = Path("/Users/mariacristinaalvarez/Documents/High3/")
csv_files = path.glob("*_high3.cnv")

df_list = [pd.read_csv(file, encoding="ISO-8859-1", delim_whitespace=True, skiprows=316, header=None)
               .assign(filename=file.stem.split("_")[1]) for file in csv_files]

stations_df = pd.concat(df_list)
Answered By: Timeless

You can create a dict of dataframes instead of a list. The key will be used by pd.concat to create a new index:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pathlib

# Use high-level module pathlib rather than os/glob
path = pathlib.Path('/Users/mariacristinaalvarez/Documents/High3/')
csv_files = path.glob('*_high3_*.cnv')

# CSV parameters
params = {'encoding': 'ISO-8859-1', 'delim_whitespace': True, 
          'skiprows': 316, 'header': None}

stations_df = (pd.concat({file.stem.split('_')[1]: pd.read_csv(file, **params) 
                          for file in csv_files}, names=['id'], axis=0)
                 .reset_index(level='id').reset_index(drop=True))

Output:

>>> stations_df

    id        date  value
0  008  2020-01-01     35  # file HLY2202_008_high3_predown_av1dbar.cnv
1  008  2020-01-02     40
2  008  2020-01-03     45
3  009  2020-01-01     35  # HLY2202_009_high3_predown_av1dbar.cnv
4  009  2020-01-02     40
5  009  2020-01-03     45
Answered By: Corralien
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.